Performance

Performance

We will briefly cover this methods performance from two perspectives:

  • How long the analysis takes to run locally

  • How well it predicts

To do so, we will use the data_bookReviews data set, provided by the classmap package. For this exercise, only the first 100, of the total 1,000, are going to be part of this analysis.

library(mall)
library(classmap)
library(dplyr)

data(data_bookReviews)

data_bookReviews |>
  glimpse()
#> Rows: 1,000
#> Columns: 2
#> $ review    <chr> "i got this as both a book and an audio file. i had waited t…
#> $ sentiment <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, …

As per the docs, sentiment is a factor indicating the sentiment of the review: negative (1) or positive (2)

length(strsplit(paste(head(data_bookReviews$review, 100), collapse = " "), " ")[[1]])
#> [1] 20470

Just to get an idea of how much data we’re processing, I’m using a very, very simple word count. So we’re analyzing a bit over 20 thousand words.

reviews_llm <- data_bookReviews |>
  head(100) |> 
  llm_sentiment(
    col = review,
    options = c("positive" ~ 2, "negative" ~ 1),
    pred_name = "predicted"
  )
#> ! There were 2 predictions with invalid output, they were coerced to NA

As far as time, on my Apple M3 machine, it took about 1.5 minutes to process, 100 rows, containing 20 thousand words. Setting temp to 0 in llm_use(), made the model run faster.

The package uses purrr to send each prompt individually to the LLM. But, I did try a few different ways to speed up the process, unsuccessfully:

  • Used furrr to send multiple requests at a time. This did not work because either the LLM or Ollama processed all my requests serially. So there was no improvement.

  • I also tried sending more than one row’s text at a time. This cause instability in the number of results. For example sending 5 at a time, sometimes returned 7 or 8. Even sending 2 was not stable.

This is what the new table looks like:

reviews_llm
#> # A tibble: 100 × 3
#>    review                                        sentiment             predicted
#>    <chr>                                         <fct>                     <dbl>
#>  1 "i got this as both a book and an audio file… 1                             1
#>  2 "this book places too much emphasis on spend… 1                             1
#>  3 "remember the hollywood blacklist? the holly… 2                             2
#>  4 "while i appreciate what tipler was attempti… 1                             1
#>  5 "the others in the series were great, and i … 1                             1
#>  6 "a few good things, but she's lost her edge … 1                             1
#>  7 "words cannot describe how ripped off and di… 1                             1
#>  8 "1. the persective of most writers is shaped… 1                            NA
#>  9 "i have been a huge fan of michael crichton … 1                             1
#> 10 "i saw dr. polk on c-span a month or two ago… 2                             2
#> # ℹ 90 more rows

I used yardstick to see how well the model performed. Of course, the accuracy will not be of the “truth”, but rather the package’s results recorded in sentiment.

library(forcats)

reviews_llm |>
  mutate(predicted = as.factor(predicted)) |>
  yardstick::accuracy(sentiment, predicted)
#> # A tibble: 1 × 3
#>   .metric  .estimator .estimate
#>   <chr>    <chr>          <dbl>
#> 1 accuracy binary         0.980