mall

| |

Use Large Language Models (LLM) to run Natural Language Processing (NLP) operations against your data. It takes advantage of the LLMs general language training in order to get the predictions, thus removing the need to train a new NLP model. mall is available for R and Python.

It works by running multiple LLM predictions against your data. The predictions are processed row-wise over a specified column. It relies on the “one-shot” prompt technique to instruct the LLM on a particular NLP operation to perform. The package includes prompts to perform the following specific NLP operations:

Sentiment analysis
Text summarizing
Classify text
Extract one, or several, specific pieces information from the text
Translate text
Verify that something is true about the text (binary)

For other NLP operations, mall offers the ability for you to write your own prompt.

In R The functions inside mall are designed to easily work with piped commands, such as dplyr.

reviews |>
  llm_sentiment(review)

In Python, mall is a library extension to Polars.

reviews.llm.sentiment("review")

Motivation

We want to new find new ways to help data scientists use LLMs in their daily work. Unlike the familiar interfaces, such as chatting and code completion, this interface runs your text data directly against the LLM. This package is inspired by the SQL AI functions now offered by vendors such as Databricks and Snowflake.

The LLM’s flexibility, allows for it to adapt to the subject of your data, and provide surprisingly accurate predictions. This saves the data scientist the need to write and tune an NLP model.

In recent times, the capabilities of LLMs that can run locally in your computer have increased dramatically. This means that these sort of analysis can run in your machine with good accuracy. It also makes it possible to take advantage of LLMs at your institution, since the data will not leave the corporate network. Additionally, LLM management and integration platforms, such as Ollama, are now very easy to setup and use. mall uses Ollama as to interact with local LLMs.

In its latest version, mall lets you use external LLMs such as OpenAI, Gemini and Anthropic. In R, mall uses the ellmer package to integrate with the external LLM, and the chatlas package to integrate in Python.

Install `mall`

Install the package to get started:

R
Python

Official version from CRAN:

install.packages("mall")

Development version from GitHub (required for remote LLM integration):

pak::pak("mlverse/mall/r")

Official version from PyPi:

pip install mlverse-mall

Development version from GitHub:

pip install "mlverse-mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python"

Setup the LLM

Choose one of the two following options to setup LLM connectivity:

Local LLMs, via Ollama

Download Ollama from the official website
Install and start Ollama in your computer

R
Python

Install Ollama in your machine. The ollamar package’s website provides this Installation guide
Download an LLM model. For example, I have been developing this package using Llama 3.2 to test. To get that model you can run:
```
ollamar::pull("llama3.2")
```

Install the official Ollama library
```
pip install ollama
```
Download an LLM model. For example, I have been developing this package using Llama 3.2 to test. To get that model you can run:
```
import ollama
ollama.pull('llama3.2')
```

mall uses the ellmer package as the integration point to the LLM. This package supports multiple providers such as OpenAI, Anthropic, Google Gemini, etc.

Install ellmer
```
install.packages("ellmer")
```
Refer to ellmer’s documentation to find out how to setup the connections with your selected provider: https://ellmer.tidyverse.org/reference/index.html#chatbots

Let mall know which ellmer object to use during the R session. To do this, call llm_use(). Here is an example of using OpenAI:

library(mall)
library(ellmer)
chat <- chat_openai()
#> Using model = "gpt-4.1".
llm_use(chat)
#> 
#> ── mall session object 
#> Backend: ellmerLLM session: model:gpt-4.1R session:
#> cache_folder:/var/folders/y_/f_0cx_291nl0s8h26t4jg6ch0000gp/T//RtmpmtAm72/_mall_cache14c3f6e10b6db

Set a default LLM for your R session

As a convenience, mall is able to automatically establish a connection with the LLM at the beginning o R session. To do this you can use the .mall_chat option:

options(.mall_chat =  ellmer::chat_openai(model = "gpt-4o"))

Add this line to your .Rprofile file in order for that code to run every time you start R. You can call usethis::edit_r_profile() to open your .Rprofile file so you can add the option.

mall uses the chatlas package as the integration point to the LLM. This package supports multiple providers such as OpenAI, Anthropic, Google Gemini, etc.

Install the chatlas library
```
pip install chatlas
```
Refer to chatlas’s documentation to find out how to setup the connections with your selected provider: https://posit-dev.github.io/chatlas/reference/#chat-model-providers

Let mall know which chatlas object to use during the Python session. To do this, call llm_use(). Here is an example of using OpenAI:

import mall
from chatlas import ChatOpenAI

chat = ChatOpenAI()

data = mall.MallData
reviews = data.reviews

reviews.llm.use(chat)

LLM functions

We will start with loading a very small data set contained in mall. It has 3 product reviews that we will use as the source of our examples.

R
Python

library(mall)
data("reviews")

reviews
#> # A tibble: 3 × 1
#>   review                                                                        
#>   <chr>                                                                         
#> 1 This has been the best TV I've ever used. Great screen, and sound.            
#> 2 I regret buying this laptop. It is too slow and the keyboard is too noisy     
#> 3 Not sure how to feel about my new washing machine. Great color, but hard to f…

import mall 
data = mall.MallData
reviews = data.reviews

reviews

review
"This has been the best TV I've ever used. Great screen, and sound."
"I regret buying this laptop. It is too slow and the keyboard is too noisy"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"

Sentiment

Automatically returns “positive”, “negative”, or “neutral” based on the text.

R
Python

reviews |>
  llm_sentiment(review)
#> # A tibble: 3 × 2
#>   review                                                              .sentiment
#>   <chr>                                                               <chr>     
#> 1 This has been the best TV I've ever used. Great screen, and sound.  positive  
#> 2 I regret buying this laptop. It is too slow and the keyboard is to… negative  
#> 3 Not sure how to feel about my new washing machine. Great color, bu… neutral

For more information and examples visit this function’s R reference page

reviews.llm.sentiment("review")

review	sentiment
"This has been the best TV I've ever used. Great screen, and sound."	"positive"
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"negative"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"neutral"

For more information and examples visit this function’s Python reference page

Summarize

There may be a need to reduce the number of words in a given text. Typically to make it easier to understand its intent. The function has an argument to control the maximum number of words to output (max_words):

R
Python

reviews |>
  llm_summarize(review, max_words = 5)
#> # A tibble: 3 × 2
#>   review                                        .summary                    
#>   <chr>                                         <chr>                       
#> 1 This has been the best TV I've ever used. Gr… this tv is excellent quality
#> 2 I regret buying this laptop. It is too slow … i regret my laptop purchase 
#> 3 Not sure how to feel about my new washing ma… confused about the purchase

For more information and examples visit this function’s R reference page

reviews.llm.summarize("review", 5)

review	summary
"This has been the best TV I've ever used. Great screen, and sound."	"it's a great tv set"
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"laptop purchase was not wise"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"mixed feelings about new appliance"

For more information and examples visit this function’s Python reference page

Classify

Use the LLM to categorize the text into one of the options you provide:

R
Python

reviews |>
  llm_classify(review, c("appliance", "computer"))
#> # A tibble: 3 × 2
#>   review                                        .classify
#>   <chr>                                         <chr>    
#> 1 This has been the best TV I've ever used. Gr… computer 
#> 2 I regret buying this laptop. It is too slow … computer 
#> 3 Not sure how to feel about my new washing ma… appliance

For more information and examples visit this function’s R reference page

reviews.llm.classify("review", ["computer", "appliance"])

review	classify
"This has been the best TV I've ever used. Great screen, and sound."	"appliance"
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"computer"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"appliance"

For more information and examples visit this function’s Python reference page

Extract

One of the most interesting use cases Using natural language, we can tell the LLM to return a specific part of the text. In the following example, we request that the LLM return the product being referred to. We do this by simply saying “product”. The LLM understands what we mean by that word, and looks for that in the text.

R
Python

reviews |>
  llm_extract(review, "product")
#> # A tibble: 3 × 2
#>   review                                        .extract       
#>   <chr>                                         <chr>          
#> 1 This has been the best TV I've ever used. Gr… tv             
#> 2 I regret buying this laptop. It is too slow … laptop         
#> 3 Not sure how to feel about my new washing ma… washing machine

For more information and examples visit this function’s R reference page

reviews.llm.extract("review", "product")

review	extract
"This has been the best TV I've ever used. Great screen, and sound."	"tv"
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"laptop"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"washing machine"

For more information and examples visit this function’s Python reference page

Verify

This functions allows you to check and see if a statement is true, based on the provided text. By default, it will return a 1 for “yes”, and 0 for “no”. This can be customized.

R
Python

reviews |>
  llm_verify(review, "is the customer happy with the purchase")
#> # A tibble: 3 × 2
#>   review                                                                 .verify
#>   <chr>                                                                  <fct>  
#> 1 This has been the best TV I've ever used. Great screen, and sound.     1      
#> 2 I regret buying this laptop. It is too slow and the keyboard is too n… 0      
#> 3 Not sure how to feel about my new washing machine. Great color, but h… 0

For more information and examples visit this function’s R reference page

reviews.llm.verify("review", "is the customer happy with the purchase")

review	verify
"This has been the best TV I've ever used. Great screen, and sound."	1
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	0
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	0

For more information and examples visit this function’s Python reference page

Translate

As the title implies, this function will translate the text into a specified language. What is really nice, it is that you don’t need to specify the language of the source text. Only the target language needs to be defined. The translation accuracy will depend on the LLM

R
Python

reviews |>
  llm_translate(review, "spanish")
#> # A tibble: 3 × 2
#>   review                                        .translation                    
#>   <chr>                                         <chr>                           
#> 1 This has been the best TV I've ever used. Gr… Esta ha sido la mejor televisió…
#> 2 I regret buying this laptop. It is too slow … Lo lamento comprar este portáti…
#> 3 Not sure how to feel about my new washing ma… No estoy seguro de cómo sentirm…

For more information and examples visit this function’s R reference page

reviews.llm.translate("review", "spanish")

review	translation
"This has been the best TV I've ever used. Great screen, and sound."	"Esto ha sido la mejor televisión que he utilizado jamás. Pantalla y sonido excelentes."
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"Lamento haber comprado este portátil. Es demasiado lento y la tecla de espacio es demasiado ruidosa."
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"No estoy seguro de cómo sentirme con mi nueva lavadora. Me gusta mucho el color, pero no sé cómo fun…

For more information and examples visit this function’s Python reference page

Custom prompt

It is possible to pass your own prompt to the LLM, and have mall run it against each text entry:

R
Python

my_prompt <- paste(
  "Answer a question.",
  "Return only the answer, no explanation",
  "Acceptable answers are 'yes', 'no'",
  "Answer this about the following text, is this a happy customer?:"
)

reviews |>
  llm_custom(review, my_prompt)
#> # A tibble: 3 × 2
#>   review                                                                   .pred
#>   <chr>                                                                    <chr>
#> 1 This has been the best TV I've ever used. Great screen, and sound.       No   
#> 2 I regret buying this laptop. It is too slow and the keyboard is too noi… No   
#> 3 Not sure how to feel about my new washing machine. Great color, but har… No

For more information and examples visit this function’s R reference page

my_prompt = (
    "Answer a question."
    "Return only the answer, no explanation"
    "Acceptable answers are 'yes', 'no'"
    "Answer this about the following text, is this a happy customer?:"
)

reviews.llm.custom("review", prompt = my_prompt)

review	custom
"This has been the best TV I've ever used. Great screen, and sound."	"Yes"
"I regret buying this laptop. It is too slow and the keyboard is too noisy"	"No"
"Not sure how to feel about my new washing machine. Great color, but hard to figure"	"No"

For more information and examples visit this function’s Python reference page

Model selection and settings

Local LLMs via Ollama

You can set the model and its options to use when calling the LLM. In this case, we refer to options as model specific things that can be set, such as seed or temperature.

R
Python

Invoking an llm function will automatically initialize a model selection if you don’t have one selected yet. If there is only one option, it will pre-select it for you. If there are more than one available models, then mall will present you as menu selection so you can select which model you wish to use.

Calling llm_use() directly will let you specify the model and backend to use. You can also setup additional arguments that will be passed down to the function that actually runs the prediction. In the case of Ollama, that function is chat().

The model to use, and other options can be set for the current R session

llm_use("ollama", "llama3.2", seed = 100, temperature = 0)

The model and options to be used will be defined at the Polars data frame object level. If not passed, the default model will be llama3.2.

reviews.llm.use("ollama", "llama3.2", options = dict(seed = 100))

Remote LLMs

The provider and model selection will be based on the chat object you create. Any model related setting, such as temperature, seed and others, should be set at the time of the object creation as well.

R
Python

library(mall)
library(ellmer)
chat <- chat_openai(model = "gpt-4o", seed = 100)
llm_use(chat)

import mall
from chatlas import ChatOpenAI
chat = ChatOpenAI(model = "gpt-4o",  seed= 100)
data = mall.MallData
reviews = data.reviews
reviews.llm.use(chat)

Results caching

By default mall caches the requests and corresponding results from a given LLM run. Each response is saved as individual JSON files. By default, the folder name is _mall_cache. The folder name can be customized, if needed. Also, the caching can be turned off by setting the argument to empty ("").

R
Python

llm_use(.cache = "_my_cache")

To turn off:

llm_use(.cache = "")

reviews.llm.use(_cache = "my_cache")

To turn off:

reviews.llm.use(_cache = "")

For more information see the Caching Results article.

Key considerations

The main consideration is cost. Either, time cost, or money cost.

If using this method with an LLM locally available, the cost will be a long running time. Unless using a very specialized LLM, a given LLM is a general model. It was fitted using a vast amount of data. So determining a response for each row, takes longer than if using a manually created NLP model. The default model used in Ollama is Llama 3.2, which was fitted using 3B parameters.

If using an external LLM service, the consideration will need to be for the billing costs of using such service. Keep in mind that you will be sending a lot of data to be evaluated.

Another consideration is the novelty of this approach. Early tests are providing encouraging results. But you, as an user, will still need to keep in mind that the predictions will not be infallible, so always check the output. At this time, I think the best use for this method, is for a quick analysis.

Vector functions

R
Python

mall includes functions that expect a vector, instead of a table, to run the predictions. This should make it easier to test things, such as custom prompts or results of specific text. Each llm_ function has a corresponding llm_vec_ function:

llm_vec_sentiment("I am happy")
#> [1] "positive"

llm_vec_translate("Este es el mejor dia!", "english")
#> [1] "It's the best day!"

mall is also able to process vectors contained in a list object. This allows us to avoid having to convert a list of texts without having to first convert them into a single column data frame. To use, initialize a new LLMVec class object with either an Ollama model, or a chatlas Chat object, and then access the same NLP functions as the Polars extension.

# Initialize a Chat object
from chatlas import ChatOllama
chat = ChatOllama(model = "llama3.2")

# Pass it to a new LLMVec
from mall import LLMVec
llm = LLMVec(chat)

Access the functions via the new LLMVec object, and pass the text to be processed.

llm.sentiment(["I am happy", "I am sad"])
#> ['positive', 'negative']

llm.translate(["Este es el mejor dia!"], "english")
#> ['This is the best day!']

For more information visit the reference page: LLMVec

Motivation

Install mall

Setup the LLM

Local LLMs, via Ollama

Remote LLMs

LLM functions

Sentiment

Summarize

Classify

Extract

Verify

Translate

Custom prompt

Model selection and settings

Local LLMs via Ollama

Remote LLMs

Results caching

Key considerations

Vector functions

Install `mall`