May 10, 2026

We Asked 3 AIs 100 Questions — Web Search Agents Revealed Where They Diverge

clock
10
min read
Copied!

Tom Shaked

linkedin
No items found.
We Asked 3 AIs 100 Questions — Web Search Agents Revealed Where They Diverge
May 10, 2026

We Asked 3 AIs 100 Questions — Web Search Agents Revealed Where They Diverge

clock
10
min read
Copied!

Tom Shaked

linkedin
No items found.
We Asked 3 AIs 100 Questions — Web Search Agents Revealed Where They Diverge

Ask ChatGPT, Perplexity, and Gemini the same question. You'll get one structured answer, one paragraph, and a 600-word essay with headers and a comparison table. There's nothing to compare.

Their outputs aren't shaped for comparison. Without normalizing the responses, you can't tell whether "Python is better for beginners" and "Python is the clearer choice early on" are the same answer or different ones, whether the disagreements you do spot are real positions or just formatting noise.

We wanted to know whether AI disagreement is real and topic-dependent, or whether it just looks that way because no one has built the right comparison layer. So we asked 100 questions spanning tech, finance, health, e-commerce, and society to all three simultaneously, and used Claude to judge whether they agreed.

What we built

300 live responses collected from three AI platforms via a single API. Each question sent to ChatGPT, Perplexity, and Gemini in parallel. Claude Haiku reads all three raw responses and returns a consensus label — strong, moderate, or split — plus a normalized verdict from each model.

The results are browsable through a Streamlit dashboard filtered by category, consensus type, or free text. A second tab lets anyone ask a live question and get real-time answers from all three models, judged by Claude.

Final breakdown across 100 questions: 44 strong consensus, 48 moderate, 8 split. The models agree more often than expected. When they don't, the disagreements cluster around money and lifestyle — the questions where context actually matters.

The full project is on GitHub: nimbleway/nimble-data-apps

Step 1: Getting comparable responses

The first problem is format.

Without a constraint, ChatGPT gives a structured answer, Perplexity gives a paragraph, Gemini gives an essay. There's no analysis you can run on that. Every question ships with a prompt template that enforces structure:

{question}

Reply in exactly this format — no other text:
VERDICT: [your answer in 5 words or fewer]
REASON: [one sentence only]

Before writing any fetch code, each model was tested with a single call.

ChatGPT:

VERDICT: Python for most beginners
REASON: Its simpler syntax makes learning fundamentals easier.

Perplexity:

VERDICT: Python is the gentler start
REASON: Python has simpler, more readable syntax and a gentler learning curve.

Gemini returned a 600-word essay with headers, a comparison table, and a pro tip. It ignored the format entirely — and testing showed this was consistent regardless of how the prompt was phrased. The decision: store Gemini's full response as-is and handle normalization in the analysis step.

Step 2: 300 parallel calls via Nimble

ChatGPT's web interface isn't an API. Perplexity's search interface isn't either. Gemini's response page isn't. Each one is a live browser session with its own authentication, rendering behavior, and UI that changes without notice.

Building direct access to all three means writing and maintaining separate automation for each platform: browser drivers, session handling, and brittle selectors that break when the interface updates. That's the ongoing cost before any actual work begins.

Nimble's Web Search Agents remove that entirely. Each AI platform is available as a web search agent — chatgpt, perplexity, gemini — called through the same SDK with the same interface:

from nimble_python import Nimble

nimble = Nimble(api_key="YOUR_API_KEY")

result = nimble.agent.run(
    agent="chatgpt",
    params={"prompt": question["prompt"]}
)
text = result.data.parsing.get("answer", "").strip()

Swapping agent="chatgpt" for agent="perplexity" or agent="gemini" is the only change. No custom code per platform, no separate credentials, no knowledge of how each site is structured underneath. The same fetch function handles all three.

That uniformity is what makes the parallel fetch practical. 300 total calls — 100 questions across 3 models — run with ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor, as_completed

def fetch_one(agent, question):
    result = nimble.agent.run(agent=agent, params={"prompt": question["prompt"]})
    parsing = result.data.parsing
    raw = (parsing.get("markdown") or parsing.get("answer") or "").strip()
    return {"raw": raw, "format": "freeform" if agent == "gemini" else "structured"}

with ThreadPoolExecutor(max_workers=9) as executor:
    futures = {
        executor.submit(fetch_one, agent, q): (agent, q)
        for q in questions for agent in ["chatgpt", "perplexity", "gemini"]
    }

Each response saves immediately as it arrives. A crash loses at most the in-flight calls. Re-running skips already-cached responses. ChatGPT takes ~56 seconds per call. With 9 workers, the full run takes 25–35 minutes.

Responses are stored with their question and format tag:

{
  "id": "q_023",
  "question": "AI agents or quantum computing — which will have more impact?",
  "responses": {
    "chatgpt":    {"raw": "VERDICT: AI agents\\nREASON: ...", "format": "structured"},
    "perplexity": {"raw": "VERDICT: AI agents will dominate\\nREASON: ...", "format": "structured"},
    "gemini":     {"raw": "AI agents will have a significantly greater impact...", "format": "freeform"}
  }
}

Step 3: Consensus analysis with Claude Haiku

Judging consensus across three responses — two structured, one a 600-word essay — is exactly the kind of task a language model handles well and rule-based approaches don't.

One Claude Haiku call per question. All three raw responses go in together. Haiku extracts the core position from each model and judges whether they agree, semantically:

import anthropic, json

client = anthropic.Anthropic(api_key="YOUR_API_KEY")

def analyze_with_haiku(question, responses):
    prompt = f"""Question asked to three AI models: {question}

--- ChatGPT ---
{responses['chatgpt']['raw']}

--- Perplexity ---
{responses['perplexity']['raw']}

--- Gemini ---
{responses['gemini']['raw']}

Extract each model's core position and judge whether the three models agree.

Return this exact JSON, nothing else:
{{
  "chatgpt":    {{"verdict": "...", "reason": "..."}},
  "perplexity": {{"verdict": "...", "reason": "..."}},
  "gemini":     {{"verdict": "...", "reason": "..."}},
  "consensus":  {{"label": "strong|moderate|split", "summary": "..."}}
}}

Rules:
- verdict: the model's answer in 5 words or fewer
- reason: one sentence capturing their key argument
- label: "strong" = all 3 broadly agree, "moderate" = 2 agree and 1 differs, "split" = clear disagreement
- summary: one sentence on what they agree or disagree on"""

    msg = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=400,
        messages=[{"role": "user", "content": prompt}],
    )
    return json.loads(msg.content[0].text.strip())

It handles format differences without special cases. For Gemini's AI agents essay — 600 words, multiple headers, a comparison table — Haiku reads the whole thing, finds the conclusion, and returns "AI agents". It reads "AI agents" and "AI agents will dominate early impact" as the same answer, because they are. The consensus label reflects actual agreement, not surface string overlap.

100 questions, 5 parallel workers, under 2 minutes. Total cost: ~$0.16.

Step 4: Dashboard

The dashboard is built with Streamlit. Two tabs: browse the 100 pre-analyzed questions, or ask your own.

streamlit run app.py

The browse tab opens with a full-width color bar — green for strong consensus, yellow for moderate, red for split — and three stat cards below it. Five category cards show the same bar scoped to their subset. A filterable, sortable table shows every question with its consensus pill and all three verdicts inline, no expanding required.

The live tab takes any question, fetches all three AIs via Nimble in parallel (~60 seconds), runs Haiku analysis on the results (~2 seconds), and renders the same consensus layout.

What the data shows

Across 100 questions, 44 reached strong consensus. 48 landed in moderate territory — two models broadly agreeing, one diverging. Only 8 were genuine three-way splits.

Strong consensus clusters around empirically settled questions: Python over JavaScript for beginners, index funds over individual stocks, creatine is safe at recommended doses, sunscreen works on cloudy days. The models agree easily when the answer isn't really a matter of opinion.

Split cases cluster around preference and context: Amazon Prime vs. Walmart+, real estate vs. stock market, 4-day week vs. flexible hours. The disagreements show up precisely where the answer depends on who's asking.

Why Nimble

ChatGPT's chat interface is a browser session. Perplexity's search page is a browser session. Gemini's response interface is a browser session. None of them expose a public API that returns the actual model response.

Hitting all three programmatically means building and maintaining browser automation for three separate platforms — session management, bot detection, UI changes — or using Nimble. Nimble's Web Search Agents wrap each interface as a callable endpoint. The same SDK call works for chatgpt, perplexity, and gemini. There's no separate authentication per platform, no Playwright setup, no selector maintenance when the UI updates.

The live tab uses the same fetch path. When a user asks a question, Nimble hits all three live interfaces in real time, returns the raw responses, and Claude judges the result. The same infrastructure that powered the 300-call batch run handles individual queries in the dashboard.

Continue Exploring

FAQ

Answers to frequently asked questions

No items found.