Persistent Extraction: A faster, cheaper approach for LLM-based web extraction
LLM-in-the-loop extraction gets more expensive, less reliable, and harder to debug at scale. Persistent Extraction solves all three.


Persistent Extraction: A faster, cheaper approach for LLM-based web extraction
LLM-in-the-loop extraction gets more expensive, less reliable, and harder to debug at scale. Persistent Extraction solves all three.


There are a billion use cases that require LLMs to extract information from webpages.
But putting an LLM in the middle of a web extraction pipeline creates three concrete problems that get worse over time: 1) costs that scale linearly with traffic, 2) outputs that drift run-to-run because the model reads a different slice of the page each time, and 3) failures that are nearly impossible to debug because there's no inspectable artifact.
In this post, we’ll explain the benefits of a new approach for LLM-based extraction – Persistent Extraction – which uses previous knowledge of extraction methodology to reduce LLM overhead of repetitive extraction jobs.
The way most teams do it today
You're building an agent that extracts structured review data from Yelp. Maybe it's tracking sentiment for a set of business listings, feeding a competitive intelligence tool, or powering a reputation monitoring dashboard. The task is straightforward: hit each page, pull the reviews, store them in a structured format.
So you wire up an LLM. The agent fetches the page, the response goes into context, the model reads it, and it hands back structured data. It works fine the first few times.
Then you run it at scale
That's for a single batch of 250 pages. Run this daily across a meaningful set of listings and you're looking at roughly ~$2,775/month before you've added a single new data source.
And it's not just money. Because the model reads a different slice of the page each time, outputs drift. The same listing returns slightly different review structures across runs. Debugging means retracing a model's reasoning with no inspectable artifact.
The better way: investigate once, extract forever with Persistent Extraction
Yelp review pages are structurally consistent. Every listing page renders reviews the same way. The model shouldn’t need to rediscover that on run 47 or run 231.
That’s the insight that Persistent Extraction leverages: an LLM is only doing genuinely hard work the first time it encounters a site. After that, it's re-reading the same structure and producing the same output.
The same 250 Yelp extraction runs, with Persistent Extraction:
That's a 94% cost reduction and 9x speed improvement. The investigation is a fixed, one-time cost. After that, you're running fast, deterministic code.
This also eliminates two of the most painful problems with LLM-in-the-loop extraction. Output drift disappears because deterministic code produces the same result every time. And when something does break, you read the rulebook and the generated parser rather than retracing a model's reasoning through a prompt trace.
What actually changes
The difference comes down to what the model is doing in each approach.
In the old approach, the model reads the raw page response every single run. A Yelp listing page with a full review set, embedded JSON, and rendered HTML can run well over 1 MB.
With Persistent Extraction, the raw response never enters the model at all. It goes straight to a file. Code parses it. The model only sees the small, clean output. Investigation still uses a capable model, but it runs once. Runtime extraction uses a lightweight model handling a mechanical task, at a fraction of the cost.
The rulebook it produces looks something like this:
Step 1: Fetch the Yelp listing page
Step 2: Find the embedded JSON blob in the page HTML
Step 3: Extract the reviews array from the JSON
Step 4: For each review, pull reviewer name, rating, date, and text
Step 5: Return as structured records, sorted by date descendingHuman-readable. Version-controlled. Reviewable. When something breaks, you read the rulebook. You don't retrace a model's reasoning.
How we built Persistent Extraction at Nimble
A naive agent calls the API, pulls the response into context, and reasons over it. Responses for data-rich sites can be megabytes of nested JSON — a single page can consume an entire context window. Even when responses fit, you're paying input tokens for every byte the model only needs to glance at.
So we inverted it. The raw response never enters the model at investigation or at runtime. We built a markdown rulebook to define Persistent Extraction:
Agent → send a url to Nimble’s Extract API → response goes to a temp file
↓
Agent → generates a Python parser from the markdown rulebook
↓
Parser → reads the temp file, extracts data → writes structured output
↓
Agent → reads only the clean output (small, structured)
The markdown rulebook is the only durable artifact. The only heavy lift is generating the parsing template during the first run (learn more about parsing webpages here). Once it's built, the Python parser is regenerated fresh from the rulebook every run. The only maintenance required is adjusting the Python parser if the rulebook changes, which our agent handles automatically.
The agent is the architect. Python is the parser. The model's context stays small regardless of response size — a 50 KB feed and a 5 MB feed cost the same in tokens.
What this means at scale
Running an LLM on every extraction job is like hiring an architect to lay every brick: you're paying for expertise on work that only needed expertise once. The better model is to have the architect draw the blueprints, then let the construction crew handle the rest:
- Significantly reduced costs: At 250 runs per day, the old approach costs roughly $3,200/month. The new approach costs around $38/month for the same output.
- Reproducible outputs. Deterministic extraction means you can write tests for your parsers, catch regressions before they reach production, and roll back cleanly when a site changes.
- Inspectable failures. When something breaks, you read the rulebook and the generated parser. There is no model trace to retrace, no prompt to tweak, no nondeterminism to account for.
Getting started
Both extraction approaches are available as agent skills in the Nimbleway/agent-skills repository, built on Nimble's web data infrastructure. Nimble will intelligently choose the most efficient extraction method based on your request.
Install them in Claude Code or Cursor, point them at any site, and run the investigation once.
After that, extraction is just code.
FAQ
Answers to frequently asked questions
.avif)




.avif)
.png)
.png)