The tokenizer class

The tokenizer class

Methods

Public methods


Method new()

Usage

tokenizer$new(model)


Method train()

Usage

tokenizer$train(files, trainer = NULL)


Method encode()

Usage

tokenizer$encode(
  sequence,
  pair = NULL,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)


Method decode()

Usage

tokenizer$decode(ids, skip_special_tokens = TRUE)


Method get_vocab()

Usage

tokenizer$get_vocab(with_added_tokens = FALSE)


Method save()

Usage

tokenizer$save(path, pretty = FALSE)


Method token_to_id()

Usage

tokenizer$token_to_id(token)


Method encode_batch()

Usage

tokenizer$encode_batch(
  inputs,
  is_pre_tokenized = FALSE,
  add_special_tokens = TRUE
)


Method enable_padding()

Enable the padding

Usage

tokenizer$enable_padding(
  direction = "right",
  pad_id = 0,
  pad_type_id = 0,
  pad_token = "[PAD]",
  length = NULL,
  pad_to_multiple_of = NULL
)

Arguments

direction

(str, optional, defaults to right) – The direction in which to pad. Can be either right or left

pad_id

(int, defaults to 0) – The id to be used when padding

pad_type_id

(int, defaults to 0) – The type id to be used when padding

pad_token

(str, defaults to [PAD]) – The pad token to be used when padding

length

(int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.

pad_to_multiple_of

(int, optional) – If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.


Method clone()

The objects of this class are cloneable with this method.

Usage

tokenizer$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.