LLM API Basics — Prompts, Tokens, and Temperature
Imagine an LLM tokenizer as a highly specialized dictionary where common words, subwords, and even common Unicode sequences are each given a single, compact "ID" or "token." While simple text might have a rough character-to-token ratio, complex or rare sequences will often be broken down into many more smaller, unexpected tokens.
The Setup
You are writing a preprocessing utility for an LLM-powered document classifier. To prevent API payload rejection, you use a standard character-to-token ratio heuristic to determine whether a document chunk fits within the remaining context window before calling the API.
What Does This Print?
def trim_to_token_limit(prompt: str, max_tokens: int = 100) -> str:
# Heuristic: 1 token is roughly 4 characters
max_chars = max_tokens * 4
if len(prompt) > max_chars:
return prompt[:max_chars]
return prompt
# A user submits a prompt with heavy Unicode formatting and technical code
prompt = "System diagnostic: ⚡🚀🔋 (Error code: [0xFF99_X])" * 10
trimmed = trim_to_token_limit(prompt, max_tokens=20)
print(f"Trimmed character length: {len(trimmed)}")
The Output
While the character count is exactly 80, passing this string to a Byte-Pair Encoding (BPE) tokenizer like OpenAI's cl100k_base yields 110 tokens, massively exceeding your 20-token constraint. The heuristic completely fails because Unicode symbols, emojis, and specific programming code syntaxes do not map to the naive 4-characters-per-token average. Emojis and rare characters are split into multiple byte-level tokens, inflating the token-to-character ratio dramatically.
Why Python Does This
In Python, len() on a string returns the number of Unicode code points (characters), not bytes or encoded representations. Under the hood, Python strings use PEP 393's flexible string representation (ASCII, UCS-2, or UCS-4 depending on the contents). However, LLMs do not see Unicode characters directly; they consume tokens generated by byte-pair encoding algorithms. For example, a single emoji like '⚡' is 1 character in Python, but when encoded to UTF-8, it occupies 3 bytes, which the BPE tokenizer maps to multiple tokens. Using character slicing on Python strings cuts across arbitrary boundaries without understanding the vocabulary of the target model's tokenizer. To accurately manage token allocations, you must use the C-bound tokenizer library (tiktoken for OpenAI) to calculate actual token frequencies and slice the encoded integer arrays instead.
The Fix
import tiktoken
def trim_to_token_limit(prompt: str, max_tokens: int = 100) -> str:
# Get the correct tokenizer encoding for the model
encoding = tiktoken.get_encoding("cl100k_base")
# Encode the entire string to a list of token IDs
tokens = encoding.encode(prompt)
if len(tokens) > max_tokens:
# Slice the token array directly, then decode back to a string
return encoding.decode(tokens[:max_tokens])
return prompt
By using the model's actual Byte-Pair Encoding (BPE) tokenizer, the system accurately reflects how the LLM "sees" and processes the input string. This ensures that the token limit is respected based on the model's internal representation, preventing unexpected truncation or exceeding context windows.
How This Fails in Real Systems
An enterprise log-analysis service used the character-based truncation heuristic to process core dumps before sending them to GPT-4. When an application crash dumped extensive memory addresses, escape sequences, and stack traces, the character-to-token ratio exploded. The API client repeatedly crashed with 400 Bad Request errors due to context window limit violations, leaving the operations team blind to the crash origin for 14 hours.