trainers_bpe.Rd
Byte pair encoding trainer class
Byte pair encoding trainer class
hftokenizers::Trainer
-> trainers_bpe
Inherited methods
new()
Trainer capable of training a BPE model
trainers_bpe$new( vocab_size = NULL, min_frequency = NULL, show_progress = NULL, special_tokens = NULL, limit_alphabet = NULL, initial_alphabet = NULL, continuing_subword_prefix = NULL, end_of_word_suffix = NULL )
vocab_size
(int, optional) – The size of the final vocabulary, including all tokens and alphabet.
min_frequency
(int, optional) – The minimum frequency a pair should have in order to be merged.
show_progress
(bool, optional) – Whether to show progress bars while training.
special_tokens
(List[Unionstr, AddedToken], optional) – A list of special tokens the model should know of.
limit_alphabet
(int, optional) – The maximum different characters to keep in the alphabet.
initial_alphabet
(Liststr, optional) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.
continuing_subword_prefix
(str, optional) – A prefix to be used for every subword that is not a beginning-of-word.
end_of_word_suffix
(str, optional) – A suffix to be used for every subword that is a end-of-word.
clone()
The objects of this class are cloneable with this method.
trainers_bpe$clone(deep = FALSE)
deep
Whether to make a deep clone.