The format of this dataset is meant to replicate that provided by Keras.
imdb_dataset(
root,
download = FALSE,
split = "train",
shuffle = (split == "train"),
num_words = Inf,
skip_top = 0,
maxlen = Inf,
start_char = 2,
oov_char = 3,
index_from = 4
)
path to the data location
wether to download or not
train, test or valid
whether to shuffle or not the dataset. TRUE
if split=="train"
Words are ranked by how often they occur (in the training set),
and only the num_words most frequent words are kept. Any less frequent word
will appear as oov_char value in the sequence data. If Inf
, all words are
kept. Defaults to None, so all words are kept.
skip the top N most frequently occurring words (which may not be informative). These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
int or Inf
. Maximum sequence length. Any longer sequence will
be truncated. Defaults to Inf, which means no truncation.
The start of a sequence will be marked with this character. Defaults to 2, because 1 is usually the padding character.
int. The out-of-vocabulary character. Words that were cut out because of the num_words or skip_top limits will be replaced with this character.
int. Index actual words with this index and higher.