The Complete Guide to LLM Scraping

The Complete Guide to LLM Scraping

OpenAI updated its model pricing multiple times in 2024. There's no webhook, no changelog feed, no official notification. If you're building something that depends on accurate LLM pricing data — a cost estimator, a model comparison tool, a procurement dashboard — you probably found out about those changes the hard way.
This is the core problem with LLM data: it's important, it changes constantly, and none of the major providers make it easy to collect programmatically. This guide covers both sides of the problem — scraping LLM provider pages for pricing, model specs, and policy data, and querying LLM interfaces directly to collect structured AI-generated responses.
Table of Contents
- What LLM Scraping Actually Covers
- What Data Is Worth Collecting
- Why LLM Provider Pages Are Hard to Scrape
- Scraping LLM Provider Pages with Python
- Scaling Beyond a Single Script
- Querying LLM Interfaces Without the Browser Automation Headaches
- Best Practices
- Getting Started
- Continue Exploring
What LLM Scraping Actually Covers
LLM scraping is two different problems that often get conflated.
The first is monitoring LLM provider pages — pricing tables, model deprecation notices, rate limit documentation, changelog entries, terms of service updates. This data lives on provider websites and changes without notice. Collecting it means scraping web pages directly, storing snapshots over time, and diffing them to detect changes.
The second is querying LLM web interfaces — sending prompts to ChatGPT, Gemini, Google AI Mode, or Grok and collecting structured responses: answers, source citations, referenced links. This is different from hitting the providers' official APIs. The web interfaces often return richer output — including citations and browsing results — that the programmatic APIs don't expose the same way.
Both use cases need different tools. Both are worth understanding.
What Data Is Worth Collecting
From LLM provider documentation pages
- Input and output pricing per model, per million tokens
- Context window sizes and output token limits
- Rate limits by tier (requests per minute, tokens per minute, tokens per day)
- Model deprecation dates and replacement recommendations
- Supported modalities (text, vision, audio, code)
- Regional availability
- Policy and terms of service updates
Key pages to monitor across providers:
- platform.openai.com/docs/models — OpenAI model specs and deprecation status
- openai.com/api/pricing — GPT-4o, o1, o3 pricing tables
- ai.google.dev/gemini-api/docs/models — Gemini model specs
- ai.google.dev/gemini-api/docs/pricing — Gemini pricing by tier
- x.ai/api — Grok model availability and pricing
- console.x.ai — xAI API console and rate limit info
- docs.anthropic.com/en/docs/about-claude/models — Claude model catalog
From LLM web interfaces
- Model-generated answers in plain text and HTML
- Source citations and referenced URLs
- Associated images and media links
- Real-time responses reflecting the current live model version
Why LLM Provider Pages Are Hard to Scrape
This is where most DIY approaches run into trouble.
JavaScript rendering. OpenAI, Google, and xAI all use React or Next.js for their documentation and pricing pages. The HTML returned by a plain HTTP request is mostly an app shell — the actual pricing data is fetched asynchronously after the JavaScript runs. requests and BeautifulSoup alone can't see it.
Cloudflare and bot detection. OpenAI's web properties sit behind Cloudflare. A headless Chromium browser has a well-known fingerprint — TLS handshake patterns, navigator properties, missing browser APIs — that Cloudflare's bot detection identifies reliably. Even with stealth patches, getting consistent access requires residential IP rotation and ongoing maintenance as detection rules evolve.
Page structure changes. LLM providers redesign their documentation frequently. Any extraction logic tied to specific CSS selectors or HTML structure breaks when the page layout changes. Someone has to notice and fix it.
No official data feed. None of the major LLM providers publish structured pricing or model metadata through a public API. The web page is the source of truth.
Scraping LLM Provider Pages with Python
The requests approach
The obvious starting point is a plain HTTP request:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}
response = requests.get("https://openai.com/api/pricing", headers=headers)
print(response.status_code)
soup = BeautifulSoup(response.text, "html.parser")
tables = soup.find_all("table")
print(tables) # []Two things will happen here. Either you get a 403 from Cloudflare, or you get a 200 with HTML that contains no pricing data — just the React app shell and a handful of <script> tags. The content you want is fetched dynamically after page load.
Adding JavaScript rendering with Playwright
Playwright executes JavaScript and waits for the page to finish loading, which gets you closer:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://openai.com/api/pricing")
page.wait_for_load_state("networkidle")
print(page.title())
print(page.content()[:500])
browser.close()On many LLM provider pages, page.title() returns "Just a moment..." — the Cloudflare interstitial challenge. Headless Chromium is fingerprinted by its TLS signature, missing browser APIs (window.chrome, navigator.plugins), and behavioral patterns. Cloudflare catches most out-of-the-box headless setups.
Patching the fingerprint
Libraries like playwright-stealth and undetected-chromedriver patch some of the most obvious fingerprinting signals:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.goto("https://openai.com/api/pricing")
page.wait_for_load_state("networkidle")
content = page.content()
browser.close()This gets through more reliably on pages with lighter bot detection. For OpenAI specifically, results are inconsistent — Cloudflare's detection has evolved beyond what stealth patches address, and the arms race means library updates lag behind detection updates.
Parsing the structured data
If you do get clean content, you still need to extract the data. Pricing tables on LLM provider pages are rarely simple HTML <table> elements — they're usually rendered from JavaScript state. The most reliable approach is intercepting the network requests that populate the page:
from playwright.sync_api import sync_playwright
import json
pricing_data = []
def handle_response(response):
if "pricing" in response.url or "models" in response.url:
try:
data = response.json()
pricing_data.append(data)
except Exception:
pass
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.on("response", handle_response)
page.goto("https://openai.com/api/pricing")
page.wait_for_load_state("networkidle")
browser.close()
print(json.dumps(pricing_data, indent=2))This approach intercepts the underlying API calls the page makes to fetch pricing data, which is often cleaner than parsing rendered HTML. It also breaks whenever the provider changes their internal API routes.
Where the DIY approach breaks down
For one-off collection, a Playwright script gets you there. The problems surface when you need it to run reliably over time:
- Cloudflare detection rules update, breaking scripts that worked last week
- Page structure changes break CSS selectors and network request interception
- Running a browser at scale is resource-heavy and slow
- Residential IP rotation adds cost and complexity
- Each provider needs its own maintenance when things break
Scaling Beyond a Single Script
One-off scraping scripts and recurring production pipelines are different problems.
For production monitoring — checking OpenAI's pricing page every day and alerting when something changes — you need reliable access that holds up against bot detection, structured output that's consistent across runs so diffs are meaningful, and handling for page failures, layout changes, and provider downtime.
Teams that reach this stage either invest heavily in proxy infrastructure and browser automation maintenance, or move to managed extraction services that handle access, rendering, and scheduling. Here's what the latter looks like in practice.
Extracting a provider page
The call below fetches OpenAI's pricing page with full JavaScript rendering and stealth access, returning structured, parseable content:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.extract(
url="https://openai.com/api/pricing",
render=True,
driver="vx10"
)
print(result.data.html)
# Also available: result.data.markdown, result.data.parsingThe same call works against any provider page — swap the URL for ai.google.dev/gemini-api/docs/pricing, x.ai/api, or any other documentation page you want to monitor.
Comparing snapshots for changes
Once you're pulling clean content on each run, the monitoring logic is straightforward:
import hashlib
from datetime import datetime
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
def fetch_page(url):
result = nimble.extract(url=url, render=True, driver="vx10")
return result.data.markdown
def snapshot_hash(content):
return hashlib.sha256(content.encode()).hexdigest()
pages = [
"https://openai.com/api/pricing",
"https://ai.google.dev/gemini-api/docs/pricing",
"https://x.ai/api"
]
previous_hashes = {} # load from storage in practice
for url in pages:
content = fetch_page(url)
current_hash = snapshot_hash(content)
if url in previous_hashes and previous_hashes[url] != current_hash:
print(f"[{datetime.now()}] CHANGE DETECTED: {url}")
# trigger alert, store new snapshot, diff the content
previous_hashes[url] = current_hashDriver selection
Not all LLM provider pages need the same rendering approach. The Standard Driver (VX6, $0.90/1,000 URLs) handles pages with minimal bot detection. For heavily JS-rendered pages or sites sitting behind Cloudflare — OpenAI being the primary example — the Render JS + Stealth Driver (VX10, $1.45/1,000 URLs) is the reliable option. Recurring monitoring workflows can be set up through the Managed Service, which handles scheduling and delivery without requiring you to maintain the pipeline.
Querying LLM Interfaces Without the Browser Automation Headaches
Querying the LLM interfaces themselves — rather than their documentation pages — is a separate challenge. The official Python SDKs (openai, google-generativeai) let you query models programmatically, but they return different output than the web interfaces do.
The web interfaces include citation panels, source links, and web search results that aren't exposed the same way through the raw API. If you're trying to collect what users actually see in the ChatGPT or Gemini UI — including referenced sources and associated links — the API response is not the equivalent.
Automating the actual web interfaces with Playwright is technically possible but fragile: session management, login flows, CSRF tokens, and rate limiting all need to be handled, and any UI change breaks the automation.
A more reliable approach is using pre-built agents that handle session management and structured data extraction for each provider's interface — returning clean, consistent output without managing the browser automation layer yourself.
For example, querying ChatGPT via Nimble's Web Search Agent:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.agent.run(
agent="chatgpt_mirror_prod_data",
params={
"keyword": "OpenAI GPT-4o pricing per million tokens",
"refinements": "current 2026"
}
)
print(result)This returns answer, markdown, sources, and links — the full response including citations — without managing a browser session or handling authentication.
The same pattern works for Gemini (gemini_mirror_prod_data), Google AI Mode (google_ai_mirror_prod_data), and Grok (grok_mirror_prod_data). Grok additionally returns images in the response.
Best Practices
Start with network interception, not HTML parsing If you're building your own scraper, intercept the XHR/fetch calls the page makes rather than parsing rendered HTML. It's less brittle and more likely to return structured data.
Snapshot everything, not just the changed fields Store the full page content or API response with a timestamp on every run. Diffing full snapshots lets you catch structural changes, not just value changes.
Separate the access problem from the parsing problem Keeping proxy rotation and browser automation in the same script as your parsing logic makes both harder to maintain. Treat them as separate concerns.
Test against the actual pages you need to monitor Bot detection behavior varies significantly between providers. OpenAI's pages are considerably harder to access reliably than Google's. Test your setup against each target before building a production workflow around it.
Don't rely on provider APIs as a substitute The official SDKs return model API behavior, not web interface behavior, and they don't expose pricing, rate limit, or deprecation data. For intelligence about what providers are doing, the documentation pages are the source.
Track response structure changes alongside content When a provider changes the format of their response — adding or removing fields, changing types — that's signal worth capturing. Schema changes often precede content changes.
Use pre-built agents for LLM interface queries If you're collecting responses from ChatGPT, Gemini, Google AI Mode, or Grok rather than scraping their documentation pages, pre-built agents handle the session management and structured extraction for you. The output — including source citations and links — is consistent across runs without writing browser automation code for each provider's interface.
Getting Started with Nimble
Create an account and get your API key from the dashboard. Install the Python SDK:
pip install nimble_pythonYour first Extract request:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.extract(
url="https://openai.com/api/pricing",
render=True,
driver="vx10"
)
print(result.data.markdown)Your first agent query:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.agent.run(
agent="chatgpt_mirror_prod_data",
params={"keyword": "GPT-4o pricing", "refinements": "per million tokens 2026"}
)
print(result)Web Search Agents are $1 per 1,000 pages. Extract starts at $0.90 per 1,000 URLs. The free trial includes 5,000 pages.
Sign up at https://app.nimbleway.com/signup Full API reference at https://docs.nimbleway.com/
Continue Exploring
These guides go deeper on specific providers and use cases covered in this post.
- How to Extract ChatGPT Responses as Structured Data with Python — Collect answers, source citations, and links from ChatGPT's web interface programmatically.
- Scraping Google AI Mode for LLM Overviews and Sources — Pull Google's AI-generated overviews and their cited sources using Python.
- Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
- Scraping Gemini's Web Search Answers with Python — Collect Gemini's grounded answers with full source metadata and position data.
- How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.
- How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.
FAQ
Answers to frequently asked questions
.png)
%20(1).png)
%20(1).png)
.png)
.avif)
.png)






