The tokenizer class

Methods

Public methods

tokenizer$new()
tokenizer$train()
tokenizer$encode()
tokenizer$decode()
tokenizer$get_vocab()
tokenizer$save()
tokenizer$token_to_id()
tokenizer$encode_batch()
tokenizer$enable_padding()
tokenizer$clone()

Method `new()`

Usage

tokenizer$new(model)

Method `train()`

Usage

tokenizer$train(files, trainer = NULL)

Method `encode()`

Usage

tokenizer$encode(
  sequence,
  pair = NULL,
  is_pretokenized = FALSE,
  add_special_tokens = TRUE
)

Method `decode()`

Usage

tokenizer$decode(ids, skip_special_tokens = TRUE)

Method `get_vocab()`

Usage

tokenizer$get_vocab(with_added_tokens = FALSE)

Method `save()`

Usage

tokenizer$save(path, pretty = FALSE)

Method `token_to_id()`

Usage

tokenizer$token_to_id(token)

Method `encode_batch()`

Usage

tokenizer$encode_batch(
  inputs,
  is_pre_tokenized = FALSE,
  add_special_tokens = TRUE
)

Method `enable_padding()`

Enable the padding

Usage

tokenizer$enable_padding(
  direction = "right",
  pad_id = 0,
  pad_type_id = 0,
  pad_token = "[PAD]",
  length = NULL,
  pad_to_multiple_of = NULL
)

Arguments

direction: (str, optional, defaults to right) – The direction in which to pad. Can be either right or left
pad_id: (int, defaults to 0) – The id to be used when padding
pad_type_id: (int, defaults to 0) – The type id to be used when padding
pad_token: (str, defaults to [PAD]) – The pad token to be used when padding
length: (int, optional) – If specified, the length at which to pad. If not specified we pad using the size of the longest sequence in a batch.
pad_to_multiple_of: (int, optional) – If specified, the padding length should always snap to the next multiple of the given value. For example if we were going to pad with a length of 250 but pad_to_multiple_of=8 then we will pad to 256.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

tokenizer$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

The tokenizer class

Methods

Public methods

Method new()

Usage

Method train()

Usage

Method encode()

Usage

Method decode()

Usage

Method get_vocab()

Usage

Method save()

Usage

Method token_to_id()

Usage

Method encode_batch()

Usage

Method enable_padding()

Usage

Arguments

Method clone()

Usage

Arguments

Method `new()`

Method `train()`

Method `encode()`

Method `decode()`

Method `get_vocab()`

Method `save()`

Method `token_to_id()`

Method `encode_batch()`

Method `enable_padding()`

Method `clone()`