March 24, 2026

The Complete Guide to LLM Scraping

min read

Copied!

Tom Shaked

No items found.

March 24, 2026

The Complete Guide to LLM Scraping

min read

Copied!

Tom Shaked

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

What LLM Scraping Actually Covers

LLM scraping is two different problems that often get conflated.

The first is monitoring LLM provider pages — pricing tables, model deprecation notices, rate limit documentation, changelog entries, terms of service updates. This data lives on provider websites and changes without notice. Collecting it means scraping web pages directly, storing snapshots over time, and diffing them to detect changes.

The second is querying LLM web interfaces — sending prompts to ChatGPT, Gemini, Google AI Mode, or Grok and collecting structured responses: answers, source citations, referenced links. This is different from hitting the providers' official APIs. The web interfaces often return richer output — including citations and browsing results — that the programmatic APIs don't expose the same way.

Both use cases need different tools. Both are worth understanding.

What Data Is Worth Collecting

From LLM provider documentation pages

Input and output pricing per model, per million tokens
Context window sizes and output token limits
Rate limits by tier (requests per minute, tokens per minute, tokens per day)
Model deprecation dates and replacement recommendations
Supported modalities (text, vision, audio, code)
Regional availability
Policy and terms of service updates

Key pages to monitor across providers:

platform.openai.com/docs/models — OpenAI model specs and deprecation status
openai.com/api/pricing — GPT-4o, o1, o3 pricing tables
ai.google.dev/gemini-api/docs/models — Gemini model specs
ai.google.dev/gemini-api/docs/pricing — Gemini pricing by tier
x.ai/api — Grok model availability and pricing
console.x.ai — xAI API console and rate limit info
docs.anthropic.com/en/docs/about-claude/models — Claude model catalog

From LLM web interfaces

Model-generated answers in plain text and HTML
Source citations and referenced URLs
Associated images and media links
Real-time responses reflecting the current live model version

Why LLM Provider Pages Are Hard to Scrape

This is where most DIY approaches run into trouble.

JavaScript rendering. OpenAI, Google, and xAI all use React or Next.js for their documentation and pricing pages. The HTML returned by a plain HTTP request is mostly an app shell — the actual pricing data is fetched asynchronously after the JavaScript runs. requests and BeautifulSoup alone can't see it.

Cloudflare and bot detection. OpenAI's web properties sit behind Cloudflare. A headless Chromium browser has a well-known fingerprint — TLS handshake patterns, navigator properties, missing browser APIs — that Cloudflare's bot detection identifies reliably. Even with stealth patches, getting consistent access requires residential IP rotation and ongoing maintenance as detection rules evolve.

Page structure changes. LLM providers redesign their documentation frequently. Any extraction logic tied to specific CSS selectors or HTML structure breaks when the page layout changes. Someone has to notice and fix it.

No official data feed. None of the major LLM providers publish structured pricing or model metadata through a public API. The web page is the source of truth.

Scraping LLM Provider Pages with Python

The requests approach

The obvious starting point is a plain HTTP request:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
}

response = requests.get("https://openai.com/api/pricing", headers=headers)
print(response.status_code)

soup = BeautifulSoup(response.text, "html.parser")
tables = soup.find_all("table")
print(tables)  # []

Two things will happen here. Either you get a 403 from Cloudflare, or you get a 200 with HTML that contains no pricing data — just the React app shell and a handful of <script> tags. The content you want is fetched dynamically after page load.

Adding JavaScript rendering with Playwright

Playwright executes JavaScript and waits for the page to finish loading, which gets you closer:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://openai.com/api/pricing")
    page.wait_for_load_state("networkidle")
    print(page.title())
    print(page.content()[:500])
    browser.close()

On many LLM provider pages, page.title() returns "Just a moment..." — the Cloudflare interstitial challenge. Headless Chromium is fingerprinted by its TLS signature, missing browser APIs (window.chrome, navigator.plugins), and behavioral patterns. Cloudflare catches most out-of-the-box headless setups.

Patching the fingerprint

Libraries like playwright-stealth and undetected-chromedriver patch some of the most obvious fingerprinting signals:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)
    page.goto("https://openai.com/api/pricing")
    page.wait_for_load_state("networkidle")
    content = page.content()
    browser.close()

This gets through more reliably on pages with lighter bot detection. For OpenAI specifically, results are inconsistent — Cloudflare's detection has evolved beyond what stealth patches address, and the arms race means library updates lag behind detection updates.

Parsing the structured data

If you do get clean content, you still need to extract the data. Pricing tables on LLM provider pages are rarely simple HTML <table> elements — they're usually rendered from JavaScript state. The most reliable approach is intercepting the network requests that populate the page:

from playwright.sync_api import sync_playwright
import json

pricing_data = []

def handle_response(response):
    if "pricing" in response.url or "models" in response.url:
        try:
            data = response.json()
            pricing_data.append(data)
        except Exception:
            pass

with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.on("response", handle_response)
    page.goto("https://openai.com/api/pricing")
    page.wait_for_load_state("networkidle")
    browser.close()

print(json.dumps(pricing_data, indent=2))

This approach intercepts the underlying API calls the page makes to fetch pricing data, which is often cleaner than parsing rendered HTML. It also breaks whenever the provider changes their internal API routes.

Where the DIY approach breaks down

For one-off collection, a Playwright script gets you there. The problems surface when you need it to run reliably over time:

Cloudflare detection rules update, breaking scripts that worked last week
Page structure changes break CSS selectors and network request interception
Running a browser at scale is resource-heavy and slow
Residential IP rotation adds cost and complexity
Each provider needs its own maintenance when things break

Scaling Beyond a Single Script

One-off scraping scripts and recurring production pipelines are different problems.

For production monitoring — checking OpenAI's pricing page every day and alerting when something changes — you need reliable access that holds up against bot detection, structured output that's consistent across runs so diffs are meaningful, and handling for page failures, layout changes, and provider downtime.

Teams that reach this stage either invest heavily in proxy infrastructure and browser automation maintenance, or move to managed extraction services that handle access, rendering, and scheduling. Here's what the latter looks like in practice.

Extracting a provider page

The call below fetches OpenAI's pricing page with full JavaScript rendering and stealth access, returning structured, parseable content:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.extract(
    url="https://openai.com/api/pricing",
    render=True,
    driver="vx10"
)

print(result.data.html)
# Also available: result.data.markdown, result.data.parsing

The same call works against any provider page — swap the URL for ai.google.dev/gemini-api/docs/pricing, x.ai/api, or any other documentation page you want to monitor.

Comparing snapshots for changes

Once you're pulling clean content on each run, the monitoring logic is straightforward:

import hashlib
from datetime import datetime
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

def fetch_page(url):
    result = nimble.extract(url=url, render=True, driver="vx10")
    return result.data.markdown

def snapshot_hash(content):
    return hashlib.sha256(content.encode()).hexdigest()

pages = [
    "https://openai.com/api/pricing",
    "https://ai.google.dev/gemini-api/docs/pricing",
    "https://x.ai/api"
]

previous_hashes = {}  # load from storage in practice

for url in pages:
    content = fetch_page(url)
    current_hash = snapshot_hash(content)

    if url in previous_hashes and previous_hashes[url] != current_hash:
        print(f"[{datetime.now()}] CHANGE DETECTED: {url}")
        # trigger alert, store new snapshot, diff the content

    previous_hashes[url] = current_hash

Driver selection

Not all LLM provider pages need the same rendering approach. The Standard Driver (VX6, $0.90/1,000 URLs) handles pages with minimal bot detection. For heavily JS-rendered pages or sites sitting behind Cloudflare — OpenAI being the primary example — the Render JS + Stealth Driver (VX10, $1.45/1,000 URLs) is the reliable option. Recurring monitoring workflows can be set up through the Managed Service, which handles scheduling and delivery without requiring you to maintain the pipeline.

Querying LLM Interfaces Without the Browser Automation Headaches

Querying the LLM interfaces themselves — rather than their documentation pages — is a separate challenge. The official Python SDKs (openai, google-generativeai) let you query models programmatically, but they return different output than the web interfaces do.

The web interfaces include citation panels, source links, and web search results that aren't exposed the same way through the raw API. If you're trying to collect what users actually see in the ChatGPT or Gemini UI — including referenced sources and associated links — the API response is not the equivalent.

Automating the actual web interfaces with Playwright is technically possible but fragile: session management, login flows, CSRF tokens, and rate limiting all need to be handled, and any UI change breaks the automation.

A more reliable approach is using pre-built agents that handle session management and structured data extraction for each provider's interface — returning clean, consistent output without managing the browser automation layer yourself.

For example, querying ChatGPT via Nimble's Web Search Agent:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={
        "keyword": "OpenAI GPT-4o pricing per million tokens",
        "refinements": "current 2026"
    }
)

print(result)

This returns answer, markdown, sources, and links — the full response including citations — without managing a browser session or handling authentication.

The same pattern works for Gemini (gemini_mirror_prod_data), Google AI Mode (google_ai_mirror_prod_data), and Grok (grok_mirror_prod_data). Grok additionally returns images in the response.

Best Practices

Start with network interception, not HTML parsing If you're building your own scraper, intercept the XHR/fetch calls the page makes rather than parsing rendered HTML. It's less brittle and more likely to return structured data.

Snapshot everything, not just the changed fields Store the full page content or API response with a timestamp on every run. Diffing full snapshots lets you catch structural changes, not just value changes.

Separate the access problem from the parsing problem Keeping proxy rotation and browser automation in the same script as your parsing logic makes both harder to maintain. Treat them as separate concerns.

Test against the actual pages you need to monitor Bot detection behavior varies significantly between providers. OpenAI's pages are considerably harder to access reliably than Google's. Test your setup against each target before building a production workflow around it.

Don't rely on provider APIs as a substitute The official SDKs return model API behavior, not web interface behavior, and they don't expose pricing, rate limit, or deprecation data. For intelligence about what providers are doing, the documentation pages are the source.

Track response structure changes alongside content When a provider changes the format of their response — adding or removing fields, changing types — that's signal worth capturing. Schema changes often precede content changes.

Use pre-built agents for LLM interface queries If you're collecting responses from ChatGPT, Gemini, Google AI Mode, or Grok rather than scraping their documentation pages, pre-built agents handle the session management and structured extraction for you. The output — including source citations and links — is consistent across runs without writing browser automation code for each provider's interface.

Getting Started with Nimble

Create an account and get your API key from the dashboard. Install the Python SDK:

pip install nimble_python

Your first Extract request:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.extract(
    url="https://openai.com/api/pricing",
    render=True,
    driver="vx10"
)

print(result.data.markdown)

Your first agent query:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={"keyword": "GPT-4o pricing", "refinements": "per million tokens 2026"}
)

print(result)

Web Search Agents are $1 per 1,000 pages. Extract starts at $0.90 per 1,000 URLs. The free trial includes 5,000 pages.

Continue Exploring

These guides go deeper on specific providers and use cases covered in this post.

How to Extract ChatGPT Responses as Structured Data with Python — Collect answers, source citations, and links from ChatGPT's web interface programmatically.
Scraping Google AI Mode for LLM Overviews and Sources — Pull Google's AI-generated overviews and their cited sources using Python.
Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
Scraping Gemini's Web Search Answers with Python — Collect Gemini's grounded answers with full source metadata and position data.
How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.
How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.