March 24, 2026

Scraping Google AI Mode for LLM Overviews and Sources

clock
7
min read
Copied!

Tom Shaked

linkedin
No items found.
Scraping Google AI Mode for LLM Overviews and Sources
March 24, 2026

Scraping Google AI Mode for LLM Overviews and Sources

clock
7
min read
Copied!

Tom Shaked

linkedin
No items found.
Scraping Google AI Mode for LLM Overviews and Sources

Google Search now returns AI-generated overviews alongside traditional search results — concise summaries of web sources that Google's AI has pulled and cited. Unlike the standard Google Search API, these overviews are only available in the web interface and come with structured source attribution. If you need to collect Google's AI take on a topic at scale, you'll want to automate this.

Google AI Mode returns two key pieces of data: the answer (the AI-generated overview as plain text) and sources (a clean list of citations, each with title and url). You could use this to track how Google's AI summarizes your brand over time, compare its take against ChatGPT or Grok, or feed cited sources into downstream research pipelines.

What Google AI Returns

When Google renders an AI overview for a query, the response includes:

  • answer — The plain text AI-generated summary. This is what you'd see in the gray box at the top of Google's search results.
  • sources — A list of dictionaries, each containing:
    • title — The page title or headline
    • url — The full URL to the source

That's the schema. No metadata bloat, no extra fields. The sources list is your citation record — it's the only place citations appear in the response.

Scraping Google AI with Python

Let's walk through the DIY path first, then show you where it breaks.

Attempt 1: Plain Requests

Start simple. Hit Google's search endpoint with your query:

import requests

query = "Google Gemini API pricing 2026"
url = f"https://www.google.com/search?q={query}"

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)
print(response.text[:500])

What you get back: static HTML with the standard blue-link results. No AI overview. Google renders AI overviews client-side after the page loads — the initial HTML response doesn't include them. requests can't see what the browser renders.

Attempt 2: Add Playwright

A headless browser executes JavaScript and waits for the page to settle:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
    page.wait_for_load_state("networkidle")
    print(page.content()[:1000])
    browser.close()

Two things happen here. Either Google returns a CAPTCHA challenge — headless Chromium has a recognisable fingerprint and Google's bot detection catches it — or you get a rendered page where the AI overview simply isn't present. AI overviews don't appear on every query; Google decides when to show them based on query type, freshness, and geography.

Attempt 3: Stealth patching and parsing

If you get past bot detection, the next problem is extraction. The AI overview is buried in deeply nested, dynamically generated HTML. There's no stable semantic element — the class names are obfuscated and change frequently:

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
from bs4 import BeautifulSoup

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)
    page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
    page.wait_for_load_state("networkidle")
    html = page.content()
    browser.close()

soup = BeautifulSoup(html, "html.parser")

# AI overview container — this selector breaks whenever Google updates the page
overview = soup.find("div", {"class": "IVvmDb"})
if overview:
    print(overview.get_text())
else:
    print("No AI overview found")

Stealth patches help with fingerprinting. They don't fix the fundamental problems: AI overviews are inconsistent, the HTML structure changes without notice, and Google actively works against automated access to its search results at scale.

A Cleaner Approach

Instead of wrestling with bot detection, inconsistent rendering, and brittle selectors, use an agent that handles all of that for you:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="google_ai_mirror_prod_data",
    params={
        "keyword": "Google Gemini API pricing 2026"
    }
)

print(result)

What comes back is a clean dictionary with answer and sources. No HTML parsing. No browser wrangling. No login sessions to maintain.

Working with the Response

Extract the answer:

answer = result["answer"]
print(answer)

Pull all cited sources:

for source in result["sources"]:
    print(f"{source['title']}: {source['url']}")

A more complete example — store results with timestamps and extract all URLs for downstream scraping:

import json
from datetime import datetime
from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

# Collect multiple queries
queries = [
    "Google Gemini API pricing 2026",
    "Claude API pricing 2026",
    "OpenAI GPT-4 pricing 2026"
]

results = []

for query in queries:
    result = nimble.agent.run(
        agent="google_ai_mirror_prod_data",
        params={"keyword": query}
    )

    # Store with timestamp
    record = {
        "query": query,
        "timestamp": datetime.now().isoformat(),
        "answer": result["answer"],
        "sources": result["sources"],
        "source_urls": [s["url"] for s in result["sources"]]
    }

    results.append(record)

# Save to file
with open("google_ai_results.json", "w") as f:
    json.dump(results, f, indent=2)

# Print source URLs
for record in results:
    print(f"\n{record['query']}:")
    for url in record["source_urls"]:
        print(f"  {url}")

Use Cases

Track brand perception over time. Run the same query weekly and compare how Google's AI describes your product, company, or service. Notice when new sources enter the summary or when the framing shifts.

Understand source authority. Collect the sources Google's AI cites. Which domains does Google trust for your topic? Are your company's pages in the citation list? Are competitors' pages ranked higher?

Compare LLM perspectives. Run the same query against Google AI, ChatGPT, and Grok. See where they diverge. This reveals gaps in training data, regional differences, or recency bias across models.

Feed cited sources into research pipelines. Use the source URLs from Google's AI overview as seed content for further scraping, comparison analysis, or fact-checking workflows.

Getting Started with Nimble

Install the Python client:

pip install nimble_python

Then make your first request:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="google_ai_mirror_prod_data",
    params={"keyword": "your query here"}
)

print(result)

Pricing: Web Search Agents are $1 per 1,000 pages. Start free with 5,000 pages.

Sign up: https://app.nimbleway.com/signup

Continue Exploring

The same approach works across all four major LLM interfaces. These posts cover the other providers and related use cases.

How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.

FAQ

Answers to frequently asked questions

No items found.