May 31, 2026

Stop Putting LLMs in the Hot Path

min read

Copied!

Ariel Cohen

Senior Software Engineer

No items found.

May 31, 2026

Stop Putting LLMs in the Hot Path

min read

Copied!

Ariel Cohen

Senior Software Engineer

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

Stop Putting LLMs in the Hot Path

You're building an agent that extracts structured review data from Yelp. Maybe it's tracking sentiment for a set of business listings, feeding a competitive intelligence tool, or powering a reputation monitoring dashboard. The task is straightforward: hit each page, pull the reviews, store them.

So you wire up an LLM. The agent fetches the page, the response goes into context, the model reads it, and it hands back structured data. It works fine the first few times.

Then you run it at scale.

The real numbers from 250 Yelp extraction runs:

	Tokens per run	Time per run	Total cost
LLM extraction	~120,000	~18 seconds	$92.50 (depending on model)

That’s one batch of 250 pages. Run this daily and you're looking at ~$2,775/month before you've added a single new data source.

And it's not just money. Because the model reads a different slice of the page each time, outputs drift. The same listing returns slightly different review structures across runs. When Yelp updates its page layout, you find out when the data breaks. Debugging means retracing a model's reasoning with no inspectable artifact.

The Insight: An LLM Is Only Doing Hard Work the First Time

Yelp review pages are structurally consistent. Every listing page renders reviews the same way. The model doesn't need to rediscover that on run 47 or run 231. It only needed to figure it out once.

The fix is to pull the model out of the hot path entirely. You use it once, deliberately, to understand the site's structure. It produces a plain-language rulebook, a markdown file describing exactly how to get the data.

From that point forward, every extraction run is deterministic code following the rulebook.

The model never sees another raw response.

How It Actually Works

This is the part that makes the difference between a demo and a production system.

A naive agent calls the API, pulls the response into context, and reasons over it. Responses for data-rich sites can be megabytes of nested JSON. A single page can consume an entire context window. Even when responses fit, you're paying input tokens for every byte the model only needs to glance at.

So we inverted it.

The raw response never enters the model, either at investigation time or at runtime.

Agent → calls Nimble Web Extract → response goes to a temp file

↓

Agent → generates a Python parser from the markdown rulebook

↓

Parser → reads the temp file, extracts data → writes structured output

↓

Agent → reads only the clean output (small, structured)

The agent is the architect. Python is the parser. The model's context stays small regardless of response size. A 50 KB feed and a 5 MB feed cost the same in tokens.

The same 250 Yelp extraction runs, with the new approach:

	Tokens	Time	Cost
Investigation (one-time)	~180,000	~3 minutes	$4.50 (can be optimized)
Runtime extraction (per run)	~3,000	4 seconds	$0.005
250 runs total	~750,000	~8 minutes	$5.25

That’s a 94% cost reduction and 9x speed improvement. The investigation is a fixed, one-time cost. After that, you're running fast, deterministic code.

The Runtime Doesn't Even Need a Smart Model

Because the rulebook is tightly defined, the runtime model's only job is mechanical: read the markdown, emit a boilerplate parser, hand it to Python.

That's a job for a small, cheap, fast model, one that costs a tenth or less of a frontier one.

Investigation runs once on the frontier. Runtime runs forever on the budget tier.

The economics stack cleanly: the hard, novel work happens once; everything after that is cheap by design.

Nothing Rots

The markdown rulebook is the only durable artifact. No generated parsers are checked in. No per-site code accumulates. The Python parser is regenerated fresh from the rulebook every run, so there’s nothing to maintain and nothing to drift away from the live site.

When something breaks, you read the markdown and the parser. You don't squint at a model trace, tweak a prompt, or re-run inference to see what changed.

When a site updates its layout, you re-investigate once (~3 minutes, ~$4.50), get a new rulebook, and extraction is cheap again.

Why This Matters Beyond Categories

Category extraction is an obvious first target. Every site has one and they're all different. But the pattern works anywhere you have a long tail of structurally similar but individually quirky data sources.

Product detail pages. Localized review feeds. Search result layouts. Directory listings. Any place where you'd otherwise be writing the 41st hand-rolled parser.

It also flips the per-site engineering cost from a Python pull request to a markdown pull request, and increasingly, the markdown itself is drafted by an agent.

Get the Skills

Both skills are available in the Nimble agent-skills repository, built on Nimble's web data infrastructure. Install them in Claude Code or Cursor, point them at any site, and run the investigation once.

After that, extraction is just code

‍