Byte pair encoding trainer class

Byte pair encoding trainer class

Super class

hftokenizers::Trainer -> trainers_bpe

Methods

Public methods

Inherited methods

Method new()

Trainer capable of training a BPE model

Usage

trainers_bpe$new(
  vocab_size = NULL,
  min_frequency = NULL,
  show_progress = NULL,
  special_tokens = NULL,
  limit_alphabet = NULL,
  initial_alphabet = NULL,
  continuing_subword_prefix = NULL,
  end_of_word_suffix = NULL
)

Arguments

vocab_size

(int, optional) – The size of the final vocabulary, including all tokens and alphabet.

min_frequency

(int, optional) – The minimum frequency a pair should have in order to be merged.

show_progress

(bool, optional) – Whether to show progress bars while training.

special_tokens

(List[Unionstr, AddedToken], optional) – A list of special tokens the model should know of.

limit_alphabet

(int, optional) – The maximum different characters to keep in the alphabet.

initial_alphabet

(Liststr, optional) – A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept.

continuing_subword_prefix

(str, optional) – A prefix to be used for every subword that is not a beginning-of-word.

end_of_word_suffix

(str, optional) – A suffix to be used for every subword that is a end-of-word.


Method clone()

The objects of this class are cloneable with this method.

Usage

trainers_bpe$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.