March 24, 2026

How to Extract ChatGPT Responses as Structured Data with Python

clock
8
min read
Copied!

Tom Shaked

linkedin
No items found.
How to Extract ChatGPT Responses as Structured Data with Python
March 24, 2026

How to Extract ChatGPT Responses as Structured Data with Python

clock
8
min read
Copied!

Tom Shaked

linkedin
No items found.
How to Extract ChatGPT Responses as Structured Data with Python

Open ChatGPT in your browser, ask it a research question, and you get back an answer with clickable source links, snippets, and publication names. Hit the official API with the same question, and you get text. The web interface integrates real-time search results and cites where it found them. If you need those citations, those links, those sources—you need to scrape the web interface.

Here's why you'd want to: you're building a competitive intelligence pipeline, monitoring which sources ChatGPT trusts for your industry, or tracking how its answers change over time. You need the structured data, not just the prose. Let's build it.

What ChatGPT Returns

When ChatGPT answers a research query via the web interface, it returns these fields:

  • answer — The plain text response, exactly as shown on the web.
  • links — A list of URLs that appeared in or near the answer (citations, related results, tracking URLs).
  • sources — A list of structured objects. Each has:
    • title — The headline or title of the cited source.
    • url — The direct link to the source.
    • source — The publication or domain name (e.g., "Fast Company", "IPCC", "TechCrunch").
    • snippet — A short excerpt from the source.

The sources field is what the API doesn't give you. That's your leverage. You can track which publications ChatGPT considers authoritative for a given topic, monitor if citations change week to week, or validate whether its sources actually support its claims.

Scraping ChatGPT with Python

Let's walk through the hard way first, so you see why a cleaner approach is necessary.

Step 1: A Naive Request

import requests

response = requests.get("https://chatgpt.com")
print(response.status_code)
print(response.text[:500])

What you get back:

302
<html>
    <head>
    	<title>Object moved</title>
    </head>
    <body>
    	<h1>Transfer in progress</h1>
    </body>
</html>

ChatGPT redirects unauthenticated requests to /auth/login. You hit a wall immediately.

Step 2: Bring in Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://chatgpt.com")
    print(page.url)
    print(page.content()[:500])
    browser.close()

Output:

https://chatgpt.com/auth/login
<html><head><title>ChatGPT</title>...</head><body>...

Now you see the login page, but you're blocked from proceeding. The page requires you to log in before it loads the chat interface.

Step 3: Handle Authentication

This is where it gets complex. You need to:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Navigate to login
    page.goto("https://chatgpt.com/auth/login")

    # Fill in email
    page.fill('input[type="email"]', "your-email@example.com")
    page.click('button:has-text("Continue")')

    # Wait for password field, fill it
    page.wait_for_selector('input[type="password"]')
    page.fill('input[type="password"]', "your-password")
    page.click('button:has-text("Continue")')

    # Handle 2FA if present
    page.wait_for_selector('input[placeholder*="code"]', timeout=30000)
    # You'd have to handle this manually or with an SMTP hook

    # Wait for dashboard to load
    page.wait_for_url("https://chatgpt.com/**", timeout=15000)

    print("Logged in!")
    browser.close()

You're now past the gate, but you've introduced dependencies: storing credentials (securely, you hope), handling multi-factor authentication, managing session cookies, and dealing with CSRF tokens if the UI changes.

Step 4: Query and Extract

Once logged in, you can interact with the chat interface:

from playwright.sync_api import sync_playwright
import json

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()

    # Assume you've logged in (code from Step 3)
    page.goto("https://chatgpt.com")

    # Click the chat input
    page.click('textarea')

    # Type your query
    page.fill('textarea', "What is the current price of GPT-4o per million tokens?")

    # Submit
    page.keyboard.press("Enter")

    # Wait for response
    page.wait_for_selector('[data-testid="citation"]', timeout=30000)

    # Extract sources
    sources = []
    for citation in page.query_selector_all('[data-testid="citation"]'):
        title = citation.query_selector('.citation-title').inner_text()
        url = citation.query_selector('a').get_attribute('href')
        sources.append({"title": title, "url": url})

    print(json.dumps(sources, indent=2))
    browser.close()

Where This Breaks Down

Session cookies expire. OpenAI rate-limits aggressive scraping. UI selectors change when they push updates—your [data-testid="citation"] query fails silently. Maintaining this per-provider (ChatGPT, Gemini, Claude, Grok) means duplicating the auth logic, selector logic, and error handling for each. You're locked into maintaining scraper code indefinitely, and your pipeline halts the moment OpenAI tweaks their frontend. For production use, this is a house of cards.

A Cleaner Approach

If what you need is structured, reliable output from ChatGPT—answer, sources, citations, all parsed and validated—use the Nimble agent.

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={
        "keyword": "OpenAI GPT-4o pricing per million tokens",
        "refinements": "current 2026"
    }
)

print(result)

What comes back is a dict with the structure we defined earlier: answer, links, sources (each with title, url, source, snippet). It's ready to use immediately. No session management, no selector hunting, no UI breakage.

Working with the Response

Extract and work with each field:

# The plain text answer
answer = result["answer"]
print(answer)

# All sources with metadata
for source in result["sources"]:
    print(f"{source['title']}")
    print(f"  Publication: {source['source']}")
    print(f"  URL: {source['url']}")
    print(f"  Snippet: {source['snippet']}")
    print()

# All links found in the response
for link in result["links"]:
    print(link)

Here's a more complete example—storing results over time and tracking how citations evolve:

from nimble_python import Nimble
import json
from datetime import datetime

nimble = Nimble(api_key="YOUR_API_KEY")

# Track the same query across multiple runs
query = "OpenAI GPT-4o pricing per million tokens"

for run in range(3):
    result = nimble.agent.run(
        agent="chatgpt_mirror_prod_data",
        params={"keyword": query, "refinements": "current 2026"}
    )

    # Store with timestamp
    record = {
        "timestamp": datetime.utcnow().isoformat(),
        "query": query,
        "answer": result["answer"],
        "sources": result["sources"],
        "links": result["links"]
    }

    with open(f"chatgpt_result_{run}.json", "w") as f:
        json.dump(record, f, indent=2)

    print(f"Run {run + 1} complete. Found {len(result['sources'])} sources.")

Use Cases

Citation tracking: Monitor which sources ChatGPT cites for a given query over weeks or months. Detect when it shifts from citing academic papers to news articles, or stops citing a competitor entirely.

Competitive research pipeline: Query ChatGPT about competitors, products, pricing, market trends. Collect structured answers and sources, store them in a database. Compare what it says about Competitor A vs. Competitor B, and which sources it trusts for each.

Answer consistency monitoring: Run the same query repeatedly and diff the answers. Does ChatGPT give consistent results, or does it contradict itself? Which sources does it pull in consistently, and which appear only sometimes?

LLM benchmarking: Use the same queries across ChatGPT, Gemini, Grok, and Claude. Compare their answers and their sources. Which LLM cites the most authoritative publications? Which one is most conservative? Which one hallucinates sources?

Getting Started with Nimble

Install the SDK:

pip install nimble_python

Make your first call:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt_mirror_prod_data",
    params={
        "keyword": "your query here",
        "refinements": "context or date if needed"
    }
)

print(result)

Pricing: Web Search Agents cost $1 per 1,000 pages. Free trial includes 5,000 pages.

Sign up: https://app.nimbleway.com/signup

Continue Exploring

If you're working with other LLM interfaces or want to extend what you've built here, these posts cover the same patterns for other providers and use cases.

FAQ

Answers to frequently asked questions

No items found.