tokenizer.Rd
The tokenizer class
The tokenizer class
new()
tokenizer$new(model)
train()
tokenizer$train(files, trainer = NULL)
encode()
tokenizer$encode( sequence, pair = NULL, is_pretokenized = FALSE, add_special_tokens = TRUE )
decode()
tokenizer$decode(ids, skip_special_tokens = TRUE)
get_vocab()
tokenizer$get_vocab(with_added_tokens = FALSE)
save()
tokenizer$save(path, pretty = FALSE)
token_to_id()
tokenizer$token_to_id(token)
encode_batch()
tokenizer$encode_batch( inputs, is_pre_tokenized = FALSE, add_special_tokens = TRUE )
enable_padding()
Enable the padding
tokenizer$enable_padding( direction = "right", pad_id = 0, pad_type_id = 0, pad_token = "[PAD]", length = NULL, pad_to_multiple_of = NULL )
direction
(str, optional, defaults to right) – The direction in which to pad. Can be either right or left
pad_id
(int, defaults to 0) – The id to be used when padding
pad_type_id
(int, defaults to 0) – The type id to be used when padding
pad_token
(str, defaults to [PAD]
) – The pad token to be used when padding
length
(int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of
(int, optional) – If specified, the padding
length should always snap to the next multiple of the given value.
For example if we were going to pad with a length of 250 but
pad_to_multiple_of=8
then we will pad to 256.
clone()
The objects of this class are cloneable with this method.
tokenizer$clone(deep = FALSE)
deep
Whether to make a deep clone.