The format of this dataset is meant to replicate that provided by Keras.

imdb_dataset(
  root,
  download = FALSE,
  split = "train",
  shuffle = (split == "train"),
  num_words = Inf,
  skip_top = 0,
  maxlen = Inf,
  start_char = 2,
  oov_char = 3,
  index_from = 4
)

Arguments

root

path to the data location

download

wether to download or not

split

train, test or valid

shuffle

whether to shuffle or not the dataset. TRUE if split=="train"

num_words

Words are ranked by how often they occur (in the training set), and only the num_words most frequent words are kept. Any less frequent word will appear as oov_char value in the sequence data. If Inf, all words are kept. Defaults to None, so all words are kept.

skip_top

skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.

maxlen

int or Inf. Maximum sequence length. Any longer sequence will be truncated. Defaults to Inf, which means no truncation.

start_char

The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character.

oov_char

int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.

index_from

int. Index actual words with this index and higher.