ktoken/com.aallam.ktoken/Tokenizer/encode

encode

abstract fun encode(text: String, allowedSpecial: Set<String> = emptySet(), disallowedSpecial: Set<String> = setOf("all")): List<Int>

Encodes a string into tokens.

Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't want it to do.

Hence, by default, encode will raise an error if it encounters text that corresponds to a special token. This can be controlled on a per-token level using the allowedSpecial and disallowed_special parameters. In particular:

Setting disallowedSpecial to empty set will prevent this function from raising exceptions and cause all text corresponding to special tokens to be encoded as natural text.
Setting allowedSpecial to "all" will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.

>>> tokenizer.encode("hello world")
[31373, 995]
>>> tokenizer.encode("<|endoftext|>", allowedSpecial = setOf("<|endoftext|>"))
[50256]
>>> tokenizer.encode("<|endoftext|>", allowedSpecial= setOf("all"))
[50256]
>>> tokenizer.encode("<|endoftext|>")
# Raises exception
>>> tokenizer.encode("<|endoftext|>", disallowedSpecial= emptySet())
[27, 91, 437, 1659, 5239, 91, 29]

Return

An array of integers representing the encoded text.

Parameters

text

The text to be encoded.

allowedSpecial

A set of special tokens that are permissible during the encoding process.

disallowedSpecial

A set of special tokens that are not allowed during the encoding process.