tokenizeBert

First introduced in versions: 3.00.4, 3.00.3.1

Syntax

tokenizeBert(text, vocabName, [addSpecialTokens=true])

Details

Tokenizes the input text using the specified vocabulary. This function uses the WordPiece tokenization algorithm, designed for use with the BERT (Bidirectional Encoder Representations from Transformers) model.

Parameters

text A LITERAL scalar, representing the text to be tokenized.

vocabName A STRING scalar specifying the name of the vocabulary to use.

addSpecialTokens (optional) A boolean indicating whether to add special tokens at the beginning and end of the input text. Currently, only [CLS] at the beginning and [SEP] at the end are supported. Defaults to true.

Returns

A table with the following columns:

  • tokens: List of tokens.
  • input_ids: Corresponding token ID.
  • attention_mask: A mask value used for model input, currently always set to 1.

Examples

loadVocab("/home/data/vocab.txt", "vocab1")
tokenizeBert("apple ```\n—— abcd1234", "vocab1", true)

Related functions: loadVocab, unloadVocab