← Python Code AI Agents & LLM Apps
Browse Python Concepts

LLM API Basics — Prompts, Tokens, and Temperature

Mental Model

Imagine an LLM tokenizer as a highly specialized dictionary where common words, subwords, and even common Unicode sequences are each given a single, compact "ID" or "token." While simple text might have a rough character-to-token ratio, complex or rare sequences will often be broken down into many more smaller, unexpected tokens.

Rule: Never use character-based heuristics to estimate LLM token footprints; always run payloads through the model's dedicated BPE tokenizer.

The Setup

You are writing a preprocessing utility for an LLM-powered document classifier. To prevent API payload rejection, you use a standard character-to-token ratio heuristic to determine whether a document chunk fits within the remaining context window before calling the API.

What Does This Print?

Broken code
Python
def trim_to_token_limit(prompt: str, max_tokens: int = 100) -> str:
    # Heuristic: 1 token is roughly 4 characters
    max_chars = max_tokens * 4
    if len(prompt) > max_chars:
        return prompt[:max_chars]
    return prompt

# A user submits a prompt with heavy Unicode formatting and technical code
prompt = "System diagnostic: ⚡🚀🔋 (Error code: [0xFF99_X])" * 10
trimmed = trim_to_token_limit(prompt, max_tokens=20)
print(f"Trimmed character length: {len(trimmed)}")
Predict what happens to the actual token count of the returned string compared to your expected 20-token limit.

The Output

What actually happens
Trimmed character length: 80

While the character count is exactly 80, passing this string to a Byte-Pair Encoding (BPE) tokenizer like OpenAI's cl100k_base yields 110 tokens, massively exceeding your 20-token constraint. The heuristic completely fails because Unicode symbols, emojis, and specific programming code syntaxes do not map to the naive 4-characters-per-token average. Emojis and rare characters are split into multiple byte-level tokens, inflating the token-to-character ratio dramatically.

Why Python Does This

In Python, len() on a string returns the number of Unicode code points (characters), not bytes or encoded representations. Under the hood, Python strings use PEP 393's flexible string representation (ASCII, UCS-2, or UCS-4 depending on the contents). However, LLMs do not see Unicode characters directly; they consume tokens generated by byte-pair encoding algorithms. For example, a single emoji like '⚡' is 1 character in Python, but when encoded to UTF-8, it occupies 3 bytes, which the BPE tokenizer maps to multiple tokens. Using character slicing on Python strings cuts across arbitrary boundaries without understanding the vocabulary of the target model's tokenizer. To accurately manage token allocations, you must use the C-bound tokenizer library (tiktoken for OpenAI) to calculate actual token frequencies and slice the encoded integer arrays instead.

The Fix

Corrected pattern
Python
import tiktoken

def trim_to_token_limit(prompt: str, max_tokens: int = 100) -> str:
    # Get the correct tokenizer encoding for the model
    encoding = tiktoken.get_encoding("cl100k_base")
    # Encode the entire string to a list of token IDs
    tokens = encoding.encode(prompt)
    if len(tokens) > max_tokens:
        # Slice the token array directly, then decode back to a string
        return encoding.decode(tokens[:max_tokens])
    return prompt

By using the model's actual Byte-Pair Encoding (BPE) tokenizer, the system accurately reflects how the LLM "sees" and processes the input string. This ensures that the token limit is respected based on the model's internal representation, preventing unexpected truncation or exceeding context windows.

How This Fails in Real Systems

An enterprise log-analysis service used the character-based truncation heuristic to process core dumps before sending them to GPT-4. When an application crash dumped extensive memory addresses, escape sequences, and stack traces, the character-to-token ratio exploded. The API client repeatedly crashed with 400 Bad Request errors due to context window limit violations, leaving the operations team blind to the crash origin for 14 hours.

Key Takeaway

Never use character-based heuristics to estimate LLM token footprints; always run payloads through the model's dedicated BPE tokenizer.
Common mistake: Developers mistakenly believe that token counts directly correlate with character counts, leading them to use simplistic string length heuristics instead of actual tokenization for managing LLM input limits.