The format of this dataset is meant to replicate that provided by Keras.
imdb_dataset(
root,
download = FALSE,
split = "train",
shuffle = (split == "train"),
num_words = Inf,
skip_top = 0,
maxlen = Inf,
start_char = 2,
oov_char = 3,
index_from = 4
)path to the data location
wether to download or not
train, test or valid
whether to shuffle or not the dataset. TRUE if split=="train"
Words are ranked by how often they occur (in the training set),
and only the num_words most frequent words are kept. Any less frequent word
will appear as oov_char value in the sequence data. If Inf, all words are
kept. Defaults to None, so all words are kept.
skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
int or Inf. Maximum sequence length. Any longer sequence will
be truncated. Defaults to Inf, which means no truncation.
The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character.
int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
int. Index actual words with this index and higher.