Part 4: Streaming: Getting Tokens as the Model Generates Them

Part 4 of the Building with LLMs series. Streaming changes when you get the text, not what. It feels faster, and it quietly breaks structured output.

Krishna C

May 5, 2026

•

6 min read

TL;DR

Streaming sends you the model's tokens as they're generated instead of waiting for the whole response. It changes when you get the text, not what. It makes apps feel much faster, and it quietly breaks structured output and tool calls if you parse too early.

This is Part 4 of the Building with LLMs series. Part 1 showed an LLM is a stateless function: text in, text out. Part 2 made that text parseable. Part 3 made it act. All three assumed you get the full response in one shot. Streaming changes that, and it's worth understanding before it bites you.

Without Streaming You Wait for the Whole Thing

The default is request and response. You send the chain, the model generates the entire answer, and you get it back as one block. If the answer takes eight seconds to generate, the user stares at nothing for eight seconds, then the whole thing appears at once.

For a short classification that's fine. For a long answer in a chat UI, eight seconds of blank screen feels broken, even though nothing is wrong.

Streaming Sends Tokens as They Come

A model doesn't think up the whole answer and hand it over. It generates one token at a time, each token based on everything before it. Without streaming, the API holds all those tokens and sends them only when the last one is done. With streaming, the API forwards each token the moment it's produced.

Same tokens, same final answer, same total generation time. The only difference is you start receiving it almost immediately instead of at the end.

Why It Feels Faster

Two numbers matter. Total time is how long until the full answer is done. Time to first token is how long until the first piece shows up. Streaming barely changes total time. It massively cuts time to first token.

That gap is the whole point. A user reading words appear one by one perceives the system as fast and alive, even if the full answer still takes the same eight seconds. Perceived latency is a real product metric, and streaming is the cheapest win you get.

It's Still the Same Stateless Call

Streaming does not change anything from Part 1. The model still produced one response from one input. You just received that response in pieces instead of all at once. Once it finishes, you have the exact same final text you'd have gotten without streaming. You append it to the chain the same way and resend the chain on the next turn the same way. Streaming is a delivery detail, not a new kind of call.

What Streaming Breaks

This is the part that catches people. Streaming hands you a partial response, and a partial response is not valid data.

Think about Part 2. You asked for JSON constrained to a schema. Mid-stream you might have:

1{ "name": "Priya Nair", "ema

That's not JSON. It's the first few tokens of JSON. If your code tries to parse each chunk as it arrives, it fails on every chunk until the last one. The fix is simple but you have to know to do it: accumulate the chunks, and only parse and validate once the stream is complete.

Tool calls from Part 3 have the same problem. The tool name and arguments stream in as fragments. You cannot run the function on half an argument. You wait until the tool-call request is fully assembled, then dispatch it. Same rule: stream for display, but act only on the complete result.

What This Looks Like in Code

For plain text you can show each chunk as it lands:

1buffer = ""
2for chunk in llm.stream(messages):
3    buffer += chunk.text
4    render_to_screen(chunk.text)   # live, token by token
5final_text = buffer                # same as a non-streamed response

For structured output or tool calls, you still stream, but you do not act until the end:

1buffer = ""
2for chunk in llm.stream(messages, response_schema = schema):
3    buffer += chunk.text
4    # optionally show a typing indicator, but do not parse yet
5
6data = json.parse(buffer)          # only now, when complete
7if not looks_valid(data):
8    ...                            # retry, same as Part 2
9use(data)

The streaming loop is the same. The only discipline is to parse at the end, not in the middle.

Getting It to the Browser

That server loop is half the picture. The chunks still have to reach a browser, and the regular HTTP request and response model wasn't built for "send me bytes as you have them." You pick a transport. Three are common, and they aren't interchangeable.

Server-Sent Events

SSE is the simplest match for token streaming. The server holds an HTTP response open and writes events as plain text. The browser's built-in EventSource reads them as they arrive.

1// browser
2const es = new EventSource("/api/chat?id=42");
3es.onmessage = (e) => appendToUI(e.data);
4es.onerror = () => { /* EventSource auto-reconnects */ };

It's the default for chat UIs because the shape fits. The server pushes tokens, the client renders them, nothing else happens. Browsers handle reconnects for you. Most proxies and CDNs are fine with it because it's just HTTP.

The catches are small. Communication is one direction, server to client, which is what token streaming needs anyway. Plain EventSource only does GET, so if your API uses POST with a body, you either switch transports or use a fetch-based SSE polyfill.

Fetch with a ReadableStream

Modern browsers expose the response body as a ReadableStream. You read chunks the same way the server emits them. No new protocol, no extra library.

1const res = await fetch("/api/chat", {
2  method: "POST",
3  body: JSON.stringify(payload),
4});
5const reader = res.body.getReader();
6const decoder = new TextDecoder();
7
8while (true) {
9  const { value, done } = await reader.read();
10  if (done) break;
11  appendToUI(decoder.decode(value));
12}

This is what a lot of chat apps actually use, because they need POST with a body, custom auth headers, or their own framing (newline-delimited JSON is common). The trade is that you write the parsing yourself and handle dropped connections on your own. There's no automatic reconnect.

WebSockets

A WebSocket is a two-way channel that stays open. Either side can send at any time.

It's overkill for pure token streaming because the client has nothing to say while the model is generating. WebSockets earn their weight when there's real two-way traffic. The user can cancel mid-generation and you want that signal to land fast. You're multiplexing tool progress, status updates, and tokens over one connection. The app has collaborative or real-time features layered on top of chat.

The cost is operational. WebSockets are a separate protocol from your REST stack, some proxies treat them poorly, and you manage connection state, retries, and heartbeats yourself.

Picking One

The rough order I reach for: SSE if the API is a simple GET and tokens go one direction. Fetch with a ReadableStream when the API uses POST or you want custom framing. WebSockets when there's two-way traffic worth the extra cost.

None of these change what the model does. SSE, fetch streams, and WebSockets are three plumbing choices for getting the same chunks from your server to a browser.

When to Use It and When to Skip It

Stream when:

A human is watching a long answer appear, like a chat or writing assistant.
You want the app to feel responsive even when generation is slow.

Skip it when:

Your code needs the full structured object before it can do anything, and there's no human waiting on the words.
It's a background or batch job. Nobody sees the typing effect, so it adds complexity for no gain.
The turn is purely a tool call. The user doesn't read tool arguments, so there's nothing useful to show mid-stream.

Thoughts? Hit me up at [email protected]

Building with LLMs — full series

Part 4 of 22

Part 1:How an LLM Actually Works: It Has No Memory
Part 2:Structured Output: Getting Reliable Data Out of an LLM
Part 3:Tool Calls: How an LLM Takes Action
Part 4:Streaming: Getting Tokens as the Model Generates Them(you are here)
Part 5:Memory and Context Engineering (coming soon)
Part 6:Evaluating LLM Output (coming soon)
Part 7:Telemetry and Observability (coming soon)
Part 8:Human in the Loop (coming soon)
Part 9:RAG (coming soon)
Part 10:Agents (coming soon)
Part 11:Frameworks (coming soon)
Part 12:Model Context Protocol (MCP) (coming soon)
Part 13:Skills (coming soon)
Part 14:Agent-to-Agent Communication (A2A) (coming soon)
Part 15:Tokenomics and Cost (coming soon)
Part 16:Browser Use (coming soon)
Part 17:Computer Use (coming soon)
Part 18:Mobile Use (coming soon)
Part 19:LLM-Powered Frontends (coming soon)
Part 20:AI Gateways (coming soon)
Part 21:Design Patterns for Agentic AI (coming soon)
Part 22:Scaling AI Agents (coming soon)

← Previous

Everyone's Hardening Companies Against AI. Nobody's Securing Your Home.

The industry is racing to defend organizations from AI that finds zero-days at scale. Almost nobody is talking about your privacy, your family, and your data. Here's the privacy-first, defense-in-depth setup I actually run.

Part 3: Tool Calls: How an LLM Takes Action

Part 3 of the Building with LLMs series. A tool call is just the model emitting structured text that asks your code to run a function. The model never acts itself.