tokenizeBert
Syntax
tokenizeBert(text, vocabName, [addSpecialTokens=true])
Arguments
text A LITERAL scalar, representing the text to be tokenized.
vocabName A STRING scalar specifying the name of the vocabulary to use.
addSpecialTokens (optional) A boolean indicating whether to add special tokens
at the beginning and end of the input text. Currently, only [CLS]
at the beginning and [SEP]
at the end are supported. Defaults to
true.
Details
Tokenizes the input text using the specified vocabulary. This function uses the WordPiece tokenization algorithm, designed for use with the BERT (Bidirectional Encoder Representations from Transformers) model.
Return value: A table with the following columns:
- tokens: List of tokens.
- input_ids: Corresponding token ID.
- attention_mask: A mask value used for model input, currently always set to 1.
Examples
loadVocab("/home/data/vocab.txt", "vocab1")
tokenizeBert("apple ```\n—— abcd1234", "vocab1", true)
Related functions: loadVocab, unloadVocab