tokenizer.RdThe tokenizer class
The tokenizer class
new()tokenizer$new(model)
train()tokenizer$train(files, trainer = NULL)
encode()tokenizer$encode( sequence, pair = NULL, is_pretokenized = FALSE, add_special_tokens = TRUE )
decode()tokenizer$decode(ids, skip_special_tokens = TRUE)
get_vocab()tokenizer$get_vocab(with_added_tokens = FALSE)
save()tokenizer$save(path, pretty = FALSE)
token_to_id()tokenizer$token_to_id(token)
encode_batch()tokenizer$encode_batch( inputs, is_pre_tokenized = FALSE, add_special_tokens = TRUE )
enable_padding()Enable the padding
tokenizer$enable_padding( direction = "right", pad_id = 0, pad_type_id = 0, pad_token = "[PAD]", length = NULL, pad_to_multiple_of = NULL )
direction(str, optional, defaults to right) – The direction in which to pad. Can be either right or left
pad_id(int, defaults to 0) – The id to be used when padding
pad_type_id(int, defaults to 0) – The type id to be used when padding
pad_token(str, defaults to [PAD]) – The pad token to be used when padding
length(int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of(int, optional) – If specified, the padding
length should always snap to the next multiple of the given value.
For example if we were going to pad with a length of 250 but
pad_to_multiple_of=8 then we will pad to 256.
clone()The objects of this class are cloneable with this method.
tokenizer$clone(deep = FALSE)
deepWhether to make a deep clone.