IMDB movie review sentiment classification dataset — imdb

The format of this dataset is meant to replicate that provided by Keras.

imdb_dataset(
  root,
  download = FALSE,
  split = "train",
  shuffle = (split == "train"),
  num_words = Inf,
  skip_top = 0,
  maxlen = Inf,
  start_char = 2,
  oov_char = 3,
  index_from = 4
)

Arguments

root: path to the data location
download: wether to download or not
split: train, test or valid
shuffle: whether to shuffle or not the dataset. TRUE if split=="train"
num_words: Words are ranked by how often they occur (in the training set), and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If Inf, all words are kept. Defaults to None, so all words are kept.
skip_top: skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
maxlen: int or Inf. Maximum sequence length. Any longer sequence will be truncated. Defaults to Inf, which means no truncation.
start_char: The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character.
oov_char: int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
index_from: int. Index actual words with this index and higher.