May 31, 2026

Persistent Extraction: A faster, cheaper approach for LLM-based web extraction

LLM-in-the-loop extraction gets more expensive, less reliable, and harder to debug at scale. Persistent Extraction solves all three.

min read

Copied!

Ariel Cohen

Senior Software Engineer

No items found.

May 31, 2026

Persistent Extraction: A faster, cheaper approach for LLM-based web extraction

LLM-in-the-loop extraction gets more expensive, less reliable, and harder to debug at scale. Persistent Extraction solves all three.

min read

Copied!

Ariel Cohen

Senior Software Engineer

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

The way most teams do it today

You're building an agent that extracts structured review data from Yelp. Maybe it's tracking sentiment for a set of business listings, feeding a competitive intelligence tool, or powering a reputation monitoring dashboard. The task is straightforward: hit each page, pull the reviews, store them in a structured format.

So you wire up an LLM. The agent fetches the page, the response goes into context, the model reads it, and it hands back structured data. It works fine the first few times.

Then you run it at scale

	Tokens per run	Time per run	Total cost
LLM extraction	~120,000	~18 seconds	$92.50 (depend model)

That's for a single batch of 250 pages. Run this daily across a meaningful set of listings and you're looking at roughly ~$2,775/month before you've added a single new data source.

And it's not just money. Because the model reads a different slice of the page each time, outputs drift. The same listing returns slightly different review structures across runs. Debugging means retracing a model's reasoning with no inspectable artifact.

The better way: investigate once, extract forever with Persistent Extraction

Yelp review pages are structurally consistent. Every listing page renders reviews the same way. The model shouldn’t need to rediscover that on run 47 or run 231.

That’s the insight that Persistent Extraction leverages: an LLM is only doing genuinely hard work the first time it encounters a site. After that, it's re-reading the same structure and producing the same output.

The same 250 Yelp extraction runs, with Persistent Extraction:

	Tokens	Time	Cost
Investigation (one-time)	~180,000	~3 minutes	$4.50 (can be optimize)
Runtime extraction (per run)	~3,000	4 seconds	$0.005
250 runs total	~750,000	~8 minutes	$5.25

That's a 94% cost reduction and 9x speed improvement. The investigation is a fixed, one-time cost. After that, you're running fast, deterministic code.

This also eliminates two of the most painful problems with LLM-in-the-loop extraction. Output drift disappears because deterministic code produces the same result every time. And when something does break, you read the rulebook and the generated parser rather than retracing a model's reasoning through a prompt trace.

What actually changes

The difference comes down to what the model is doing in each approach.

In the old approach, the model reads the raw page response every single run. A Yelp listing page with a full review set, embedded JSON, and rendered HTML can run well over 1 MB.

With Persistent Extraction, the raw response never enters the model at all. It goes straight to a file. Code parses it. The model only sees the small, clean output. Investigation still uses a capable model, but it runs once. Runtime extraction uses a lightweight model handling a mechanical task, at a fraction of the cost.

The rulebook it produces looks something like this:

Step 1: Fetch the Yelp listing page

Step 2: Find the embedded JSON blob in the page HTML

Step 3: Extract the reviews array from the JSON

Step 4: For each review, pull reviewer name, rating, date, and text

Step 5: Return as structured records, sorted by date descending

Human-readable. Version-controlled. Reviewable. When something breaks, you read the rulebook. You don't retrace a model's reasoning.

How we built Persistent Extraction at Nimble

A naive agent calls the API, pulls the response into context, and reasons over it. Responses for data-rich sites can be megabytes of nested JSON — a single page can consume an entire context window. Even when responses fit, you're paying input tokens for every byte the model only needs to glance at.

So we inverted it. The raw response never enters the model at investigation or at runtime. We built a markdown rulebook to define Persistent Extraction:

Agent → send a url to Nimble’s Extract API → response goes to a temp file
↓
Agent → generates a Python parser from the markdown rulebook
↓
Parser → reads the temp file, extracts data → writes structured output
↓
Agent → reads only the clean output (small, structured)

The markdown rulebook is the only durable artifact. The only heavy lift is generating the parsing template during the first run (learn more about parsing webpages here). Once it's built, the Python parser is regenerated fresh from the rulebook every run. The only maintenance required is adjusting the Python parser if the rulebook changes, which our agent handles automatically.

The agent is the architect. Python is the parser. The model's context stays small regardless of response size — a 50 KB feed and a 5 MB feed cost the same in tokens.

What this means at scale

Running an LLM on every extraction job is like hiring an architect to lay every brick: you're paying for expertise on work that only needed expertise once. The better model is to have the architect draw the blueprints, then let the construction crew handle the rest:

Significantly reduced costs: At 250 runs per day, the old approach costs roughly $3,200/month. The new approach costs around $38/month for the same output.
Reproducible outputs. Deterministic extraction means you can write tests for your parsers, catch regressions before they reach production, and roll back cleanly when a site changes.
Inspectable failures. When something breaks, you read the rulebook and the generated parser. There is no model trace to retrace, no prompt to tweak, no nondeterminism to account for.

Getting started

Both extraction approaches are available as agent skills in the Nimbleway/agent-skills repository, built on Nimble's web data infrastructure. Nimble will intelligently choose the most efficient extraction method based on your request.

Install them in Claude Code or Cursor, point them at any site, and run the investigation once.

After that, extraction is just code.

FAQ

Answers to frequently asked questions

No items found.

Data Strategy

Data Monitoring with the Medallion Architecture Framework: How to Ensure Data Quality, Compliance & Accuracy

Discover how better data monitoring across all layers of the medallion architecture framework can boost efficiency and enable better business strategy.

Landon Iannamico

March 10, 2025

min read

Data Strategy

The Insights Stack is Broken: Why Retail Teams Need Unified, Flexible, and Fast Data

Too many retail teams rely on outdated, incomplete data that creates misinformed insights. Learn how AI-powered retail data integration & data harmonization help transform the retail insight stack.

Landon Iannamico

May 23, 2025

min read