← Python Code AI Agents & LLM Apps
Browse Python Concepts

Streaming LLM Responses — async for chunk Patterns

Mental Model

The asyncio event loop is like a single, hyper-efficient chef trying to juggle many tasks. If you ask the chef to chop vegetables for 10 minutes (a blocking synchronous I/O task), they can't do anything else during that time, even if other tasks are ready. To keep things moving, the chef needs to delegate the chopping to an assistant (a thread pool) and just check back when it's done.

Rule: Never insert blocking, synchronous I/O or cpu-heavy calculations directly inside an async for streaming loop; delegate them to thread pools.

The Setup

You are building a low-latency chat interface. To make the UI highly responsive, you use asynchronous streaming to yield tokens as they arrive from the LLM. Inside the generator loop, you record a diagnostic log to a local file using basic Python I/O operations.

What Does This Print?

Broken code
Python
import asyncio
import time

async def mock_llm_stream():
    # Simulating LLM yielding word chunks
    for chunk in ["AI ", "agents ", "are ", "active."]:
        await asyncio.sleep(0.05)
        yield chunk

async def stream_and_log_response():
    async for token in mock_llm_stream():
        # Standard blocking file write inside the hot event loop
        with open("session_log.txt", "a") as log_file:
            log_file.write(token)
            # Simulating OS disk write overhead
            time.sleep(0.1)

async def main():
    # Running the streaming task alongside another task
    start = time.time()
    await asyncio.gather(stream_and_log_response(), asyncio.sleep(0.1))
    print(f"Completed in {time.time() - start:.2f}s")

asyncio.run(main())
Predict how long the async execution takes, and whether the auxiliary sleep task is blocked by the logging activity.

The Output

What actually happens
Completed in 0.61s

The execution took over 0.6 seconds. Despite being an 'async' stream, the auxiliary asyncio.sleep(0.1) task was starved of execution CPU cycles. The synchronous, blocking calls to time.sleep and open().write() suspended the single thread of the asyncio event loop, causing all other concurrently scheduled tasks to freeze entirely until the synchronous operation finished.

Why Python Does This

The Python interpreter executes in a single-threaded event loop under asyncio. When you write async for, it allows the generator to yield execution control back to the loop on an await command. However, if the code block inside that loop executes synchronous, non-awaitable blocks (like standard file descriptors, time.sleep(), or blocking HTTP requests), Python cannot yield control. The active thread blocks on the OS-level system calls. Since the event loop cannot run its tick evaluation logic, other pending tasks are entirely starved of execution. To preserve high concurrency, any I/O operation inside a hot async stream must be fully non-blocking, handled via thread pools using run_in_executor, or delegated to async-compatible I/O libraries like aiofiles.

The Fix

Corrected pattern
Python
import asyncio
import time
import anyio

async def mock_llm_stream():
    for chunk in ["AI ", "agents ", "are ", "active."]:
        await asyncio.sleep(0.05)
        yield chunk

def write_log(token: str):
    with open("session_log.txt", "a") as f:
        f.write(token)

async def stream_and_log_response_safe():
    async for token in mock_llm_stream():
        # Delegate blocking I/O to a thread pool, freeing the event loop
        await anyio.to_thread.run_sync(write_log, token)

async def main():
    start = time.time()
    await asyncio.gather(stream_and_log_response_safe(), asyncio.sleep(0.1))
    print(f"Completed in {time.time() - start:.2f}s")

asyncio.run(main())

By offloading synchronous I/O operations (like file writes) to a separate thread pool using loop.run_in_executor(), the main asyncio event loop remains free to process other coroutines. This prevents the I/O operation from blocking the entire application, maintaining concurrency and responsiveness.

How This Fails in Real Systems

A real-time customer service agent system ran on FastAPI, serving 100 concurrent chat streams. A junior engineer added standard synchronous database logging inside the stream generator loop to log daily token usage metrics. The production server CPU usage immediately spiked to 100%, causing overall system latencies to jump from 150ms to over 12 seconds, dropping 90% of user connections.

Key Takeaway

Never insert blocking, synchronous I/O or cpu-heavy calculations directly inside an async for streaming loop; delegate them to thread pools.
Common mistake: Developers mistakenly believe that because a function is async, any operations within it will automatically be non-blocking, failing to recognize that standard synchronous I/O calls will still block the single-threaded asyncio event loop.