encode

abstract fun encode(text: String, allowedSpecial: Set<String> = emptySet(), disallowedSpecial: Set<String> = setOf("all")): List<Int>

Encodes a string into tokens.

Special tokens are artificial tokens used to unlock capabilities from a model, such as fill-in-the-middle. So we want to be careful about accidentally encoding special tokens, since they can be used to trick a model into doing something we don't want it to do.

Hence, by default, encode will raise an error if it encounters text that corresponds to a special token. This can be controlled on a per-token level using the allowedSpecial and disallowed_special parameters. In particular:

  • Setting disallowedSpecial to empty set will prevent this function from raising exceptions and cause all text corresponding to special tokens to be encoded as natural text.

  • Setting allowedSpecial to "all" will cause this function to treat all text corresponding to special tokens to be encoded as special tokens.

>>> tokenizer.encode("hello world")
[31373, 995]
>>> tokenizer.encode("<|endoftext|>", allowedSpecial = setOf("<|endoftext|>"))
[50256]
>>> tokenizer.encode("<|endoftext|>", allowedSpecial= setOf("all"))
[50256]
>>> tokenizer.encode("<|endoftext|>")
# Raises exception
>>> tokenizer.encode("<|endoftext|>", disallowedSpecial= emptySet())
[27, 91, 437, 1659, 5239, 91, 29]

Return

An array of integers representing the encoded text.

Parameters

text

The text to be encoded.

allowedSpecial

A set of special tokens that are permissible during the encoding process.

disallowedSpecial

A set of special tokens that are not allowed during the encoding process.