March 24, 2026

How to Extract ChatGPT Responses as Structured Data with Python

min read

Copied!

Tom Shaked

No items found.

March 24, 2026

How to Extract ChatGPT Responses as Structured Data with Python

min read

Copied!

Tom Shaked

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

What ChatGPT Returns

When ChatGPT answers a research query via the web interface, it returns these fields:

answer — The plain text response, exactly as shown on the web.
links — A list of URLs that appeared in or near the answer (citations, related results, tracking URLs).
sources — A list of structured objects. Each has:
- title — The headline or title of the cited source.
- url — The direct link to the source.
- source — The publication or domain name (e.g., "Fast Company", "IPCC", "TechCrunch").
- snippet — A short excerpt from the source.

The sources field is what the API doesn't give you. That's your leverage. You can track which publications ChatGPT considers authoritative for a given topic, monitor if citations change week to week, or validate whether its sources actually support its claims.

Scraping ChatGPT with Python

Let's walk through the hard way first, so you see why a cleaner approach is necessary.

Step 1: A Naive Request

import requests

response = requests.get("https://chatgpt.com")
print(response.status_code)
print(response.text[:500])

What you get back:

302
<html>
    <head>
    	<title>Object moved</title>
    </head>
    <body>
    	<h1>Transfer in progress</h1>
    </body>
</html>

ChatGPT redirects unauthenticated requests to /auth/login. You hit a wall immediately.

Step 2: Bring in Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://chatgpt.com")
    print(page.url)
    print(page.content()[:500])
    browser.close()

Output:

https://chatgpt.com/auth/login
<html><head><title>ChatGPT</title>...</head><body>...

Now you see the login page, but you're blocked from proceeding. The page requires you to log in before it loads the chat interface.

Step 3: Handle Authentication

This is where it gets complex. You need to:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Navigate to login
    page.goto("https://chatgpt.com/auth/login")

    # Fill in email
    page.fill('input[type="email"]', "your-email@example.com")
    page.click('button:has-text("Continue")')

    # Wait for password field, fill it
    page.wait_for_selector('input[type="password"]')
    page.fill('input[type="password"]', "your-password")
    page.click('button:has-text("Continue")')

    # Handle 2FA if present
    page.wait_for_selector('input[placeholder*="code"]', timeout=30000)
    # You'd have to handle this manually or with an SMTP hook

    # Wait for dashboard to load
    page.wait_for_url("https://chatgpt.com/**", timeout=15000)

    print("Logged in!")
    browser.close()

You're now past the gate, but you've introduced dependencies: storing credentials (securely, you hope), handling multi-factor authentication, managing session cookies, and dealing with CSRF tokens if the UI changes.

Step 4: Query and Extract

Once logged in, you can interact with the chat interface:

from playwright.sync_api import sync_playwright
import json

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Assume you've logged in (code from Step 3)
    page.goto("https://chatgpt.com")

    # Click the chat input
    page.click('textarea')

    # Type your query
    page.fill('textarea', "What is the current price of GPT-4o per million tokens?")

    # Submit
    page.keyboard.press("Enter")

    # Wait for response
    page.wait_for_selector('[data-testid="citation"]', timeout=30000)

    # Extract sources
    sources = []
    for citation in page.query_selector_all('[data-testid="citation"]'):
        title = citation.query_selector('.citation-title').inner_text()
        url = citation.query_selector('a').get_attribute('href')
        sources.append({"title": title, "url": url})

    print(json.dumps(sources, indent=2))
    browser.close()

Where This Breaks Down

Session cookies expire. OpenAI rate-limits aggressive scraping. UI selectors change when they push updates—your [data-testid="citation"] query fails silently. Maintaining this per-provider (ChatGPT, Gemini, Claude, Grok) means duplicating the auth logic, selector logic, and error handling for each. You're locked into maintaining scraper code indefinitely, and your pipeline halts the moment OpenAI tweaks their frontend. For production use, this is a house of cards.

A Cleaner Approach

If what you need is structured, reliable output from ChatGPT—answer, sources, citations, all parsed and validated—use the Nimble agent.

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={
        "keyword": "OpenAI GPT-4o pricing per million tokens",
        "refinements": "current 2026"
    }
)

print(result)

What comes back is a dict with the structure we defined earlier: answer, links, sources (each with title, url, source, snippet). It's ready to use immediately. No session management, no selector hunting, no UI breakage.

Working with the Response

Extract and work with each field:

# The plain text answer
answer = result["answer"]
print(answer)

# All sources with metadata
for source in result["sources"]:
    print(f"{source['title']}")
    print(f"  Publication: {source['source']}")
    print(f"  URL: {source['url']}")
    print(f"  Snippet: {source['snippet']}")
    print()

# All links found in the response
for link in result["links"]:
    print(link)

Here's a more complete example—storing results over time and tracking how citations evolve:

from nimble_python import Nimble
import json
from datetime import datetime

nimble = Nimble(api_key="YOUR_API_KEY")

# Track the same query across multiple runs
query = "OpenAI GPT-4o pricing per million tokens"

for run in range(3):
    result = nimble.agent.run(
        agent="chatgpt_mirror_prod_data",
        params={"keyword": query, "refinements": "current 2026"}
    )

    # Store with timestamp
    record = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "answer": result["answer"],
        "sources": result["sources"],
        "links": result["links"]
    }

    with open(f"chatgpt_result_{run}.json", "w") as f:
        json.dump(record, f, indent=2)

    print(f"Run {run + 1} complete. Found {len(result['sources'])} sources.")

Use Cases

Citation tracking: Monitor which sources ChatGPT cites for a given query over weeks or months. Detect when it shifts from citing academic papers to news articles, or stops citing a competitor entirely.

Competitive research pipeline: Query ChatGPT about competitors, products, pricing, market trends. Collect structured answers and sources, store them in a database. Compare what it says about Competitor A vs. Competitor B, and which sources it trusts for each.

Answer consistency monitoring: Run the same query repeatedly and diff the answers. Does ChatGPT give consistent results, or does it contradict itself? Which sources does it pull in consistently, and which appear only sometimes?

LLM benchmarking: Use the same queries across ChatGPT, Gemini, Grok, and Claude. Compare their answers and their sources. Which LLM cites the most authoritative publications? Which one is most conservative? Which one hallucinates sources?

Getting Started with Nimble

Install the SDK:

pip install nimble_python

Make your first call:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={
        "keyword": "your query here",
        "refinements": "context or date if needed"
    }
)

print(result)

Pricing: Web Search Agents cost $1 per 1,000 pages. Free trial includes 5,000 pages.

Sign up: https://app.nimbleway.com/signup

Continue Exploring

If you're working with other LLM interfaces or want to extend what you've built here, these posts cover the same patterns for other providers and use cases.

The Complete Guide to LLM Scraping — The full picture: scraping provider documentation pages and querying LLM interfaces, with DIY approaches for each.
Scraping Google AI Mode for LLM Overviews and Sources — Pull Google's AI-generated overviews and their cited sources using Python.
Scraping Grok: Real-Time Answers, Images, and Web Search Results — Extract structured responses from Grok, including answer HTML and image data.
Scraping Gemini's Web Search Answers with Python — Collect Gemini's grounded answers with full source metadata and position data.
How to Track OpenAI, Gemini, and Grok Pricing Automatically — Build a monitor that detects pricing changes across providers before they affect your stack.
How to Monitor AI Model Deprecations in Real-Time — Set up alerts for model deprecation notices so you're not caught off guard when a model gets turned off.