March 24, 2026

Scraping Gemini's Web Search Answers with Python

min read

Copied!

Tom Shaked

No items found.

March 24, 2026

Scraping Gemini's Web Search Answers with Python

min read

Copied!

Tom Shaked

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

What Gemini Returns

A structured Gemini response contains four key fields:

answer — The plain text response, just the words.
html — The same answer formatted as HTML, with structure and emphasis intact. Useful for rendering directly or converting to Markdown.
links — A list of URLs that appear in the answer. Simple list of strings.
sources — A list of objects, each with:
- title — The source's headline or page title.
- source_domain — The domain (example.com).
- description — A snippet describing the source.
- snippet — The excerpt from the source that appears in the answer.
- icon — URL to the source's favicon.
- startPosition — Float: character position in the answer where this source is first cited.
- endPosition — Float: character position where the citation ends.

The startPosition and endPosition fields are especially useful. They tell you exactly where in the answer each source was cited, letting you map sources back to the claims they support.

Scraping Gemini with Python

Let's walk through the progression of approaches, from DIY to practical.

Simple HTTP Request

Start simple:

import requests

response = requests.get("https://gemini.google.com")
print(response.status_code)
print(response.text[:500])

You'll get a redirect to Google's login page. Gemini requires authentication.

Adding Playwright for Rendering

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://gemini.google.com")
    print(page.title())
    browser.close()

Playwright opens the actual browser and renders the page. But you still hit the Google login wall. The page loads, the login form appears, and you need credentials.

Handling Google Login

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://gemini.google.com")

    # Wait for and interact with login form
    page.fill("input[type='email']", "your-email@gmail.com")
    page.press("input[type='email']", "Enter")

    page.wait_for_timeout(2000)
    page.fill("input[type='password']", "your-password")
    page.press("input[type='password']", "Enter")

    page.wait_for_load_state("networkidle")
    print(page.content()[:1000])
    browser.close()

This works — sometimes. But Google's login flow is complex. After entering credentials, you might hit 2FA, device verification, or a "suspicious activity" warning. And Google actively detects automation on their own properties. They serve different pages to Playwright than to a normal browser, making reliable scraping difficult.

Where This Breaks Down

Google's bot detection on Gemini is aggressive. The login requires handling cookies, token refresh, and conditional 2FA flows. Gemini's UI updates frequently — it's a fast-moving product — so selectors and page structure shift regularly. Building a production scraper requires ongoing maintenance just to keep authentication working, let alone parsing the dynamic interface.

A Cleaner Approach

Instead of fighting Google's bot detection, use a web agent that's built for this:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="gemini_mirror_prod_data",
    params={
        "prompt": "What are the current Gemini 1.5 Pro rate limits for the free tier?"
    }
)

print(result)

This returns all four fields: answer, html, links, and sources with position metadata.

Working with the Response

Parse the structured data:

# Plain text answer
print(result["answer"])

# HTML version for rendering or conversion
print(result["html"])

# All linked URLs
for link in result["links"]:
    print(link)

# Sources with position metadata
for source in result["sources"]:
    print(source["title"])
    print(source["source_domain"])
    print(source["description"])
    print(f"cited at position {source['startPosition']}–{source['endPosition']}")
    print()

The position metadata is particularly valuable. You can map which sources back up which parts of the answer:

# Find which sources cited positions in a specific range
def sources_in_range(sources, start, end):
    return [s for s in sources if s["startPosition"] <= end and s["endPosition"] >= start]

# Example: which sources appear in the first 200 characters?
early_sources = sources_in_range(result["sources"], 0, 200)
for source in early_sources:
    print(f"{source['source_domain']} cited early: {source['description']}")

Use Cases

Track Gemini's answer evolution — Run the same query weekly and compare how Gemini's response and source selection changes over time. Useful for monitoring product messaging, feature announcements, or how your company appears in AI-grounded search.
Build citation maps — Collect responses to 50+ related queries and analyze which domains Gemini cites for different topics. Does it favor certain sources? How does coverage vary across queries? This data is valuable for SEO and content strategy.
Compare LLM grounding — Ask Gemini, ChatGPT, and Google AI the same questions, then compare their sources and answer patterns. Understand which LLM cites your site, whose answers are most recent, whose sources are most diverse.
Extract HTML for custom rendering — The html field is ready to drop into a web UI, Markdown converter, or downstream pipeline. Build a custom interface that displays Gemini's answers with your own styling or additional context.

Getting Started with Nimble

Install the SDK:

pip install nimble_python

Set your API key and make your first request:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="gemini_mirror_prod_data",
    params={"prompt": "your query here"}
)

print(result)

Web Search Agents cost $1 per 1,000 pages. Free trial includes 5,000 pages. Sign up at https://app.nimbleway.com/signup.

Continue Exploring

The same approach works across all four major LLM interfaces. These posts cover the other providers and related use cases.

The Complete Guide to LLM Scraping — The full picture: scraping provider documentation pages and querying LLM interfaces, with DIY approaches for each.
How to Extract ChatGPT Responses as Structured Data with Python — Collect answers, source citations, and links from ChatGPT's web interface programmatically.
Scraping Google AI Mode for LLM Overviews and Sources — Pull Google's AI-generated overviews and their cited sources using Python.
Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.
How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.