library(mall)
data("reviews")
reviews#> # A tibble: 3 × 1
#> review
#> <chr>
#> 1 This has been the best TV I've ever used. Great screen, and sound.
#> 2 I regret buying this laptop. It is too slow and the keyboard is too noisy
#> 3 Not sure how to feel about my new washing machine. Great color, but hard to f…
Run multiple LLM predictions against a data frame. The predictions are processed row-wise over a specified column. It works using a pre-determined one-shot prompt, along with the current row’s content. mall
has been implemented for both R and Python. mall
will use the appropriate prompt for the requested analysis.
Currently, the included prompts perform the following:
- Sentiment analysis
- Text summarizing
- Classify text
- Extract one, or several, specific pieces information from the text
- Translate text
- Verify that something it true about the text (binary)
- Custom prompt
This package is inspired by the SQL AI functions now offered by vendors such as Databricks and Snowflake. mall
uses Ollama to interact with LLMs installed locally.
For R, that interaction takes place via the ollamar
package. The functions are designed to easily work with piped commands, such as dplyr
.
|>
reviews llm_sentiment(review)
For Python, mall
is a library extension to Polars. To interact with Ollama, it uses the official Python library.
"review") reviews.llm.sentiment(
Motivation
We want to new find new ways to help data scientists use LLMs in their daily work. Unlike the familiar interfaces, such as chatting and code completion, this interface runs your text data directly against the LLM.
The LLM’s flexibility, allows for it to adapt to the subject of your data, and provide surprisingly accurate predictions. This saves the data scientist the need to write and tune an NLP model.
In recent times, the capabilities of LLMs that can run locally in your computer have increased dramatically. This means that these sort of analysis can run in your machine with good accuracy. Additionally, it makes it possible to take advantage of LLM’s at your institution, since the data will not leave the corporate network.
Get started
- Install
mall
From CRAN:
install.packages("mall")
From GitHub:
::pak("mlverse/mall/r") pak
From PyPi:
-mall pip install mlverse
From GitHub:
"mall @ git+https://git@github.com/mlverse/mall.git#subdirectory=python" pip install
Install and start Ollama in your computer
Install Ollama in your machine. The
ollamar
package’s website provides this Installation guideDownload an LLM model. For example, I have been developing this package using Llama 3.2 to test. To get that model you can run:
::pull("llama3.2") ollamar
Install the official Ollama library
pip install ollama
Download an LLM model. For example, I have been developing this package using Llama 3.2 to test. To get that model you can run:
import ollama 'llama3.2') ollama.pull(
With Databricks (R only)
If you pass a table connected to Databricks via odbc
, mall
will automatically use Databricks’ LLM instead of Ollama. You won’t need Ollama installed if you are using Databricks only.
mall
will call the appropriate SQL AI function. For more information see our Databricks article.
LLM functions
We will start with loading a very small data set contained in mall
. It has 3 product reviews that we will use as the source of our examples.
import mall
= mall.MallData
data = data.reviews
reviews
reviews
review |
---|
"This has been the best TV I've ever used. Great screen, and sound." |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" |
Sentiment
Automatically returns “positive”, “negative”, or “neutral” based on the text.
|>
reviews llm_sentiment(review)
#> # A tibble: 3 × 2
#> review .sentiment
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Great screen, and sound. positive
#> 2 I regret buying this laptop. It is too slow and the keyboard is to… negative
#> 3 Not sure how to feel about my new washing machine. Great color, bu… neutral
For more information and examples visit this function’s R reference page
"review") reviews.llm.sentiment(
review | sentiment |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "positive" |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "negative" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "neutral" |
For more information and examples visit this function’s Python reference page
Summarize
There may be a need to reduce the number of words in a given text. Typically to make it easier to understand its intent. The function has an argument to control the maximum number of words to output (max_words
):
|>
reviews llm_summarize(review, max_words = 5)
#> # A tibble: 3 × 2
#> review .summary
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Gr… it's a great tv
#> 2 I regret buying this laptop. It is too slow … laptop purchase was a mistake
#> 3 Not sure how to feel about my new washing ma… having mixed feelings about it
For more information and examples visit this function’s R reference page
"review", 5) reviews.llm.summarize(
review | summary |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "great tv with good features" |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop purchase was a mistake" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "feeling uncertain about new purchase" |
For more information and examples visit this function’s Python reference page
Classify
Use the LLM to categorize the text into one of the options you provide:
|>
reviews llm_classify(review, c("appliance", "computer"))
#> # A tibble: 3 × 2
#> review .classify
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Gr… computer
#> 2 I regret buying this laptop. It is too slow … computer
#> 3 Not sure how to feel about my new washing ma… appliance
For more information and examples visit this function’s R reference page
"review", ["computer", "appliance"]) reviews.llm.classify(
review | classify |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "appliance" |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "computer" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "appliance" |
For more information and examples visit this function’s Python reference page
Extract
One of the most interesting use cases Using natural language, we can tell the LLM to return a specific part of the text. In the following example, we request that the LLM return the product being referred to. We do this by simply saying “product”. The LLM understands what we mean by that word, and looks for that in the text.
|>
reviews llm_extract(review, "product")
#> # A tibble: 3 × 2
#> review .extract
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Gr… tv
#> 2 I regret buying this laptop. It is too slow … laptop
#> 3 Not sure how to feel about my new washing ma… washing machine
For more information and examples visit this function’s R reference page
"review", "product") reviews.llm.extract(
review | extract |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "tv" |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "laptop" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "washing machine" |
For more information and examples visit this function’s Python reference page
Verify
This functions allows you to check and see if a statement is true, based on the provided text. By default, it will return a 1 for “yes”, and 0 for “no”. This can be customized.
|>
reviews llm_verify(review, "is the customer happy with the purchase")
#> # A tibble: 3 × 2
#> review .verify
#> <chr> <fct>
#> 1 This has been the best TV I've ever used. Great screen, and sound. 1
#> 2 I regret buying this laptop. It is too slow and the keyboard is too n… 0
#> 3 Not sure how to feel about my new washing machine. Great color, but h… 0
For more information and examples visit this function’s R reference page
"review", "is the customer happy with the purchase") reviews.llm.verify(
review | verify |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | 1 |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | 0 |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | 0 |
For more information and examples visit this function’s Python reference page
Translate
As the title implies, this function will translate the text into a specified language. What is really nice, it is that you don’t need to specify the language of the source text. Only the target language needs to be defined. The translation accuracy will depend on the LLM
|>
reviews llm_translate(review, "spanish")
#> # A tibble: 3 × 2
#> review .translation
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Gr… Esta ha sido la mejor televisió…
#> 2 I regret buying this laptop. It is too slow … Me arrepiento de comprar este p…
#> 3 Not sure how to feel about my new washing ma… No estoy seguro de cómo me sien…
For more information and examples visit this function’s R reference page
"review", "spanish") reviews.llm.translate(
review | translation |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "Esta ha sido la mejor televisión que he utilizado hasta ahora. Gran pantalla y sonido." |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "Me arrepiento de comprar este portátil. Es demasiado lento y la tecla es demasiado ruidosa." |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No estoy seguro de cómo sentirme con mi nueva lavadora. Un color maravilloso, pero muy difícil de en… |
For more information and examples visit this function’s Python reference page
Custom prompt
It is possible to pass your own prompt to the LLM, and have mall
run it against each text entry:
<- paste(
my_prompt "Answer a question.",
"Return only the answer, no explanation",
"Acceptable answers are 'yes', 'no'",
"Answer this about the following text, is this a happy customer?:"
)
|>
reviews llm_custom(review, my_prompt)
#> # A tibble: 3 × 2
#> review .pred
#> <chr> <chr>
#> 1 This has been the best TV I've ever used. Great screen, and sound. Yes
#> 2 I regret buying this laptop. It is too slow and the keyboard is too noi… No
#> 3 Not sure how to feel about my new washing machine. Great color, but har… No
For more information and examples visit this function’s R reference page
= (
my_prompt "Answer a question."
"Return only the answer, no explanation"
"Acceptable answers are 'yes', 'no'"
"Answer this about the following text, is this a happy customer?:"
)
"review", prompt = my_prompt) reviews.llm.custom(
review | custom |
---|---|
"This has been the best TV I've ever used. Great screen, and sound." | "Yes" |
"I regret buying this laptop. It is too slow and the keyboard is too noisy" | "No" |
"Not sure how to feel about my new washing machine. Great color, but hard to figure" | "No" |
For more information and examples visit this function’s Python reference page
Model selection and settings
You can set the model and its options to use when calling the LLM. In this case, we refer to options as model specific things that can be set, such as seed or temperature.
Invoking an llm
function will automatically initialize a model selection if you don’t have one selected yet. If there is only one option, it will pre-select it for you. If there are more than one available models, then mall
will present you as menu selection so you can select which model you wish to use.
Calling llm_use()
directly will let you specify the model and backend to use. You can also setup additional arguments that will be passed down to the function that actually runs the prediction. In the case of Ollama, that function is chat()
.
The model to use, and other options can be set for the current R session
llm_use("ollama", "llama3.2", seed = 100, temperature = 0)
The model and options to be used will be defined at the Polars data frame object level. If not passed, the default model will be llama3.2.
"ollama", "llama3.2", options = dict(seed = 100)) reviews.llm.use(
Results caching
By default mall
caches the requests and corresponding results from a given LLM run. Each response is saved as individual JSON files. By default, the folder name is _mall_cache
. The folder name can be customized, if needed. Also, the caching can be turned off by setting the argument to empty (""
).
llm_use(.cache = "_my_cache")
To turn off:
llm_use(.cache = "")
= "my_cache") reviews.llm.use(_cache
To turn off:
= "") reviews.llm.use(_cache
For more information see the Caching Results article.
Key considerations
The main consideration is cost. Either, time cost, or money cost.
If using this method with an LLM locally available, the cost will be a long running time. Unless using a very specialized LLM, a given LLM is a general model. It was fitted using a vast amount of data. So determining a response for each row, takes longer than if using a manually created NLP model. The default model used in Ollama is Llama 3.2, which was fitted using 3B parameters.
If using an external LLM service, the consideration will need to be for the billing costs of using such service. Keep in mind that you will be sending a lot of data to be evaluated.
Another consideration is the novelty of this approach. Early tests are providing encouraging results. But you, as an user, will still need to keep in mind that the predictions will not be infallible, so always check the output. At this time, I think the best use for this method, is for a quick analysis.
Vector functions (R only)
mall
includes functions that expect a vector, instead of a table, to run the predictions. This should make it easier to test things, such as custom prompts or results of specific text. Each llm_
function has a corresponding llm_vec_
function:
llm_vec_sentiment("I am happy")
#> [1] "positive"
llm_vec_translate("Este es el mejor dia!", "english")
#> [1] "It's the best day!"