tokenizeBert
First introduced in versions: 3.00.4, 3.00.3.1
Syntax
tokenizeBert(text, vocabName, [addSpecialTokens=true])
Details
Tokenizes the input text using the specified vocabulary. This function uses the WordPiece tokenization algorithm, designed for use with the BERT (Bidirectional Encoder Representations from Transformers) model.
Parameters
text A LITERAL scalar, representing the text to be tokenized.
vocabName A STRING scalar specifying the name of the vocabulary to use.
addSpecialTokens (optional) A boolean indicating whether to add special tokens
at the beginning and end of the input text. Currently, only [CLS]
at the beginning and [SEP] at the end are supported. Defaults to
true.
Returns
A table with the following columns:
- tokens: List of tokens.
- input_ids: Corresponding token ID.
- attention_mask: A mask value used for model input, currently always set to 1.
Examples
loadVocab("/home/data/vocab.txt", "vocab1")
tokenizeBert("apple ```\n—— abcd1234", "vocab1", true)
Related functions: loadVocab, unloadVocab
