Scraping Google AI Mode for LLM Overviews and Sources

Scraping Google AI Mode for LLM Overviews and Sources

Google Search now returns AI-generated overviews alongside traditional search results — concise summaries of web sources that Google's AI has pulled and cited. Unlike the standard Google Search API, these overviews are only available in the web interface and come with structured source attribution. If you need to collect Google's AI take on a topic at scale, you'll want to automate this.
Google AI Mode returns two key pieces of data: the answer (the AI-generated overview as plain text) and sources (a clean list of citations, each with title and url). You could use this to track how Google's AI summarizes your brand over time, compare its take against ChatGPT or Grok, or feed cited sources into downstream research pipelines.
What Google AI Returns
When Google renders an AI overview for a query, the response includes:
- answer — The plain text AI-generated summary. This is what you'd see in the gray box at the top of Google's search results.
- sources — A list of dictionaries, each containing:
- title — The page title or headline
- url — The full URL to the source
That's the schema. No metadata bloat, no extra fields. The sources list is your citation record — it's the only place citations appear in the response.
Scraping Google AI with Python
Let's walk through the DIY path first, then show you where it breaks.
Attempt 1: Plain Requests
Start simple. Hit Google's search endpoint with your query:
import requests
query = "Google Gemini API pricing 2026"
url = f"https://www.google.com/search?q={query}"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}
response = requests.get(url, headers=headers)
print(response.text[:500])What you get back: static HTML with the standard blue-link results. No AI overview. Google renders AI overviews client-side after the page loads — the initial HTML response doesn't include them. requests can't see what the browser renders.
Attempt 2: Add Playwright
A headless browser executes JavaScript and waits for the page to settle:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
page.wait_for_load_state("networkidle")
print(page.content()[:1000])
browser.close()Two things happen here. Either Google returns a CAPTCHA challenge — headless Chromium has a recognisable fingerprint and Google's bot detection catches it — or you get a rendered page where the AI overview simply isn't present. AI overviews don't appear on every query; Google decides when to show them based on query type, freshness, and geography.
Attempt 3: Stealth patching and parsing
If you get past bot detection, the next problem is extraction. The AI overview is buried in deeply nested, dynamically generated HTML. There's no stable semantic element — the class names are obfuscated and change frequently:
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
from bs4 import BeautifulSoup
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.goto("https://www.google.com/search?q=Google+Gemini+API+pricing+2026")
page.wait_for_load_state("networkidle")
html = page.content()
browser.close()
soup = BeautifulSoup(html, "html.parser")
# AI overview container — this selector breaks whenever Google updates the page
overview = soup.find("div", {"class": "IVvmDb"})
if overview:
print(overview.get_text())
else:
print("No AI overview found")Stealth patches help with fingerprinting. They don't fix the fundamental problems: AI overviews are inconsistent, the HTML structure changes without notice, and Google actively works against automated access to its search results at scale.
A Cleaner Approach
Instead of wrestling with bot detection, inconsistent rendering, and brittle selectors, use an agent that handles all of that for you:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.agent.run(
agent="google_ai_mirror_prod_data",
params={
"keyword": "Google Gemini API pricing 2026"
}
)
print(result)What comes back is a clean dictionary with answer and sources. No HTML parsing. No browser wrangling. No login sessions to maintain.
Working with the Response
Extract the answer:
answer = result["answer"]
print(answer)Pull all cited sources:
for source in result["sources"]:
print(f"{source['title']}: {source['url']}")A more complete example — store results with timestamps and extract all URLs for downstream scraping:
import json
from datetime import datetime
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
# Collect multiple queries
queries = [
"Google Gemini API pricing 2026",
"Claude API pricing 2026",
"OpenAI GPT-4 pricing 2026"
]
results = []
for query in queries:
result = nimble.agent.run(
agent="google_ai_mirror_prod_data",
params={"keyword": query}
)
# Store with timestamp
record = {
"query": query,
"timestamp": datetime.now().isoformat(),
"answer": result["answer"],
"sources": result["sources"],
"source_urls": [s["url"] for s in result["sources"]]
}
results.append(record)
# Save to file
with open("google_ai_results.json", "w") as f:
json.dump(results, f, indent=2)
# Print source URLs
for record in results:
print(f"\n{record['query']}:")
for url in record["source_urls"]:
print(f" {url}")Use Cases
Track brand perception over time. Run the same query weekly and compare how Google's AI describes your product, company, or service. Notice when new sources enter the summary or when the framing shifts.
Understand source authority. Collect the sources Google's AI cites. Which domains does Google trust for your topic? Are your company's pages in the citation list? Are competitors' pages ranked higher?
Compare LLM perspectives. Run the same query against Google AI, ChatGPT, and Grok. See where they diverge. This reveals gaps in training data, regional differences, or recency bias across models.
Feed cited sources into research pipelines. Use the source URLs from Google's AI overview as seed content for further scraping, comparison analysis, or fact-checking workflows.
Getting Started with Nimble
Install the Python client:
pip install nimble_pythonThen make your first request:
from nimble_python import Nimble
nimble = Nimble(api_key="YOUR_API_KEY")
result = nimble.agent.run(
agent="google_ai_mirror_prod_data",
params={"keyword": "your query here"}
)
print(result)Pricing: Web Search Agents are $1 per 1,000 pages. Start free with 5,000 pages.
Sign up: https://app.nimbleway.com/signup
Continue Exploring
The same approach works across all four major LLM interfaces. These posts cover the other providers and related use cases.
- The Complete Guide to LLM Scraping — The full picture: scraping provider documentation pages and querying LLM interfaces, with DIY approaches for each.
- How to Extract ChatGPT Responses as Structured Data with Python — Collect answers, source citations, and links from ChatGPT's web interface programmatically.
- Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
- Scraping Gemini's Web Search Answers with Python — Collect Gemini's grounded answers with full source metadata and position data.
- How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.
How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.
FAQ
Answers to frequently asked questions





.png)
.png)
.avif)






