tokenizeBert

Syntax

tokenizeBert(text, vocabName, [addSpecialTokens=true])

Arguments

text A LITERAL scalar, representing the text to be tokenized.

vocabName A STRING scalar specifying the name of the vocabulary to use.

addSpecialTokens (optional) A boolean indicating whether to add special tokens at the beginning and end of the input text. Currently, only [CLS] at the beginning and [SEP] at the end are supported. Defaults to true.

Details

Tokenizes the input text using the specified vocabulary. This function uses the WordPiece tokenization algorithm, designed for use with the BERT (Bidirectional Encoder Representations from Transformers) model.

Return value: A table with the following columns:

tokens: List of tokens.
input_ids: Corresponding token ID.
attention_mask: A mask value used for model input, currently always set to 1.

Examples

loadVocab("/home/data/vocab.txt", "vocab1")
tokenizeBert("apple ```\n—— abcd1234", "vocab1", true)

Related functions: loadVocab, unloadVocab