March 24, 2026

Scraping Google AI Mode for LLM Overviews and Sources

min read

Copied!

Tom Shaked

No items found.

March 24, 2026

Scraping Google AI Mode for LLM Overviews and Sources

min read

Copied!

Tom Shaked

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

What Google AI Returns

When Google renders an AI overview for a query, the response includes:

answer — The plain text AI-generated summary. This is what you'd see in the gray box at the top of Google's search results.
sources — A list of dictionaries, each containing:
- title — The page title or headline
- url — The full URL to the source

That's the schema. No metadata bloat, no extra fields. The sources list is your citation record — it's the only place citations appear in the response.

Scraping Google AI with Python

Let's walk through the DIY path first, then show you where it breaks.

Attempt 1: Plain Requests

Start simple. Hit Google's search endpoint with your query:

import requests

query = "Google Gemini API pricing 2026"
url = f"https://www.google.com/search?q={query}"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)
print(response.text[:500])

What you get back: static HTML with the standard blue-link results. No AI overview. Google renders AI overviews client-side after the page loads — the initial HTML response doesn't include them. requests can't see what the browser renders.

Attempt 2: Add Playwright

A headless browser executes JavaScript and waits for the page to settle:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
    page.wait_for_load_state("networkidle")
    print(page.content()[:1000])
    browser.close()

Two things happen here. Either Google returns a CAPTCHA challenge — headless Chromium has a recognisable fingerprint and Google's bot detection catches it — or you get a rendered page where the AI overview simply isn't present. AI overviews don't appear on every query; Google decides when to show them based on query type, freshness, and geography.

Attempt 3: Stealth patching and parsing

If you get past bot detection, the next problem is extraction. The AI overview is buried in deeply nested, dynamically generated HTML. There's no stable semantic element — the class names are obfuscated and change frequently:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)
    page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
    page.wait_for_load_state("networkidle")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "html.parser")

# AI overview container — this selector breaks whenever Google updates the page
overview = soup.find("div", {"class": "IVvmDb"})
if overview:
    print(overview.get_text())
else:
    print("No AI overview found")

Stealth patches help with fingerprinting. They don't fix the fundamental problems: AI overviews are inconsistent, the HTML structure changes without notice, and Google actively works against automated access to its search results at scale.

A Cleaner Approach

Instead of wrestling with bot detection, inconsistent rendering, and brittle selectors, use an agent that handles all of that for you:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="google_ai_mirror_prod_data",
    params={
        "keyword": "Google Gemini API pricing 2026"
    }
)

print(result)

What comes back is a clean dictionary with answer and sources. No HTML parsing. No browser wrangling. No login sessions to maintain.

Working with the Response

Extract the answer:

answer = result["answer"]
print(answer)

Pull all cited sources:

for source in result["sources"]:
    print(f"{source['title']}: {source['url']}")

A more complete example — store results with timestamps and extract all URLs for downstream scraping:

import json
from datetime import datetime
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

# Collect multiple queries
queries = [
    "Google Gemini API pricing 2026",
    "Claude API pricing 2026",
    "OpenAI GPT-4 pricing 2026"
]

results = []

for query in queries:
    result = nimble.agent.run(
        agent="google_ai_mirror_prod_data",
        params={"keyword": query}
    )

    # Store with timestamp
    record = {
        "query": query,
        "timestamp": datetime.now().isoformat(),
        "answer": result["answer"],
        "sources": result["sources"],
        "source_urls": [s["url"] for s in result["sources"]]
    }

    results.append(record)

# Save to file
with open("google_ai_results.json", "w") as f:
    json.dump(results, f, indent=2)

# Print source URLs
for record in results:
    print(f"\n{record['query']}:")
    for url in record["source_urls"]:
        print(f"  {url}")

Use Cases

Track brand perception over time. Run the same query weekly and compare how Google's AI describes your product, company, or service. Notice when new sources enter the summary or when the framing shifts.

Understand source authority. Collect the sources Google's AI cites. Which domains does Google trust for your topic? Are your company's pages in the citation list? Are competitors' pages ranked higher?

Compare LLM perspectives. Run the same query against Google AI, ChatGPT, and Grok. See where they diverge. This reveals gaps in training data, regional differences, or recency bias across models.

Feed cited sources into research pipelines. Use the source URLs from Google's AI overview as seed content for further scraping, comparison analysis, or fact-checking workflows.

Getting Started with Nimble

Install the Python client:

pip install nimble_python

Then make your first request:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="google_ai_mirror_prod_data",
    params={"keyword": "your query here"}
)

print(result)

Pricing: Web Search Agents are $1 per 1,000 pages. Start free with 5,000 pages.

Sign up: https://app.nimbleway.com/signup

Continue Exploring

The same approach works across all four major LLM interfaces. These posts cover the other providers and related use cases.

The Complete Guide to LLM Scraping — The full picture: scraping provider documentation pages and querying LLM interfaces, with DIY approaches for each.
How to Extract ChatGPT Responses as Structured Data with Python — Collect answers, source citations, and links from ChatGPT's web interface programmatically.
Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
Scraping Gemini's Web Search Answers with Python — Collect Gemini's grounded answers with full source metadata and position data.
How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.

How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.