January 5, 2026

How to Scrape 100,000 URLs Without Melting Your Infrastructure

clock
14
min read
Copied!

Tom Shaked

linkedin
No items found.
How to Scrape 100,000 URLs Without Melting Your Infrastructure

How to Scrape 100,000 URLs Without Melting Your Infrastructure

How Nimble’s Asynchronous Batch model eliminates browser fleets, retries, proxies, and scaling headaches

For most engineering teams, scaling scraping is not a “bigger computers” problem, it’s an exponential complexity problem.

At a few thousand URLs, your Puppeteer or Playwright fleet already starts to feel fragile. At tens of thousands, it becomes a full time job. At 100,000 and more, the machine is on fire. You are babysitting crashed browsers, juggling proxies, tuning timeouts, replaying failed batches, and dumping logs into the void.

But scraping 100,000 URLs shouldn’t require Kubernetes gymnastics, browser SREs, or building your own scraping platform.

There is a better pattern, one built from the ground up for massive scale, real time, structured web extraction.

Why Traditional Scraping Collapses at Scale

If you're reading this, you’ve probably built something like this:

──────────────────────┐
 │ Message Queue │
 └───────────┬──────────┘
 ┌──────────────────────┐
 │ Worker Autoscaler │
 └───────┬──────────────┘
 │ spawn N
 ┌──────────────────────┐
 │ Headless Browser N │
 │ (Puppeteer, etc.) │
 └──────────┬───────────┘
 │ rotate IPs
 ┌──────────────────────┐
Proxy └──────────┬───────────┘
 ┌──────────────────────┐
Proxy Provider │
 └──────────┬───────────┘
 ┌──────────────────────┐
 │ Custom Retry │
 └──────────┬───────────┘
 ┌──────────────────────┐
 │ S3 / GCS Sink │
 └──────────────────────┘

This architecture is excellent. It works. It just doesn’t scale:

  • Browsers leak memory and crash under load
  • Network failures cause retry storms
  • Proxy bans cause cascaded failures
  • Rendering and JavaScript execution choke CPU
  • Logging and observability become an existential crisis
  • Simple changes now require cluster rewrites

At 100,000 URLs, you don’t have a scraper, you have a distributed system that is harder to maintain than your product.

Offloading Execution to the Platform

Instead of managing a browser fleet, you can push tens of thousands of URLs to a system that already manages:

  • Browser rendering
  • Proxy routing
  • Session management
  • Retries
  • Backoff
  • Observability
  • Cloud delivery
  • Structured parsing
  • Cleanup
  • Scale

That system is Nimble’s Web API Batch endpoint, which lets you:

  • Submit 1,000 URLs per request, repeating until you hit 100,000 or more
  • Add per URL interactions such as scrolling, clicking, form submissions
  • Capture network level API and XHR calls
  • Deliver results directly to S3 or GCS
  • Track status via task URLs and batch progress endpoints
  • Replay failed tasks in your own code using async re-submission
  • Stop running your own browsers entirely

Let’s walk through the workflow.

The Nimble Architecture for Massive Scale Extraction

Think of your team’s code as the conductor, not the orchestra.

┌──────────────────────────────┐
 │ Your Application │
 │ (submits batches of 1,000) │
 └───────────────┬──────────────┘
 ┌──────────────────────────────┐
 │ Nimble Batch Endpoint │
 │ (/api/v1/batch/web) │
 └───────────────┬──────────────┘
 ┌──────────────────────────────┐
 │ Nimble Browsing Engine │
 │ (Rendering, JS, Interactions)│
 └───────────────┬──────────────┘
 ┌──────────────────────────────┐
 │ Nimble Proxy Layer │
 │ (Proxy rotation, geotargets) │
 └───────────────┬──────────────┘
 ┌──────────────────────────────┐
 │ Structured Parsing and Capture │
 │ (HTML, JSON, XHR, Entities) │
 └───────────────┬──────────────┘
 ┌──────────────────────────────┐
 │ Cloud Delivery (S3 or GCS) │
 └──────────────────────────────┘

Your infrastructure never grows.
Your workers never scale.
Your browsers never spin up.

100,000 URLs becomes as simple as requesting an API, not maintaining a distributed, horizontally-scalable system.

Sending a 1,000 URL Batch

This first example shows the simplest valid batch:

No interactions or per-URL unique attributes.

Just fetch, render, parse, deliver, done.

import base64
import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"

# Base64 encoded "username:password" or API key pair
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIAL_STRING}",
    "Content-Type": "application/json",
}

# Build 1,000 simple URL requests
requests_payload = [
    {"url": f"https://www.example.com/page/{i}"} for i in range(1000)
]

data = {
    "requests": requests_payload,
    "render": True,
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

body = response.json()
print("HTTP status:", response.status_code)
print("Full response:", json.dumps(body, indent=2))
print("Batch id:", body.get("batch_id"))

What you get back from Nimble:

  • A batch_id for monitoring
  • A list of task_ids within the batch
  • Per task status_url
  • Per task download_url when using push or pull delivery

Adding Page Interactions for Scroll and Click

With Nimble, you can define page interactions per URL inside the batch request without writing browser automation code:

  • Infinite scroll
  • Click “Load more”
  • Click through tabs
  • Expand sections
  • Fill simple forms
  • Render JavaScript heavy pages

You remain inside a single batch request, still using the same /batch/web endpoint.

import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIAL_STRING}",
    "Content-Type": "application/json",
}

data = {
    "requests": [
        # URL 1: multiple interactions, infinite scroll and click on load more
        {
            "url": "https://news.example.com/infinite-feed",
            "render": True,
            "render_flow": [
                {
                    "infinite_scroll": {
                        "duration": 15000,
                        "delay_after_scroll": 1000
                    }
                },
                {
                    "wait_and_click": {
                        "selector": ".load-more",
                        "timeout": 10000,
                        "scroll": True
                    }
                }
            ],
        },
        # URL 2: a single scroll interaction
        {
            "url": "https://shop.example.com/category/widgets",
            "render": True,
            "render_flow": [
                {
                    "scroll": {
                        "x": 0,
                        "y": 2500,
                        "timeout": 15000
                    }
                }
            ],
        },
        # URL 3: no interactions
        {
            "url": "https://static.example.com/landing",
            "render": False
        },
    ],
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

print("HTTP status:", response.status_code)
print(json.dumps(response.json(), indent=2))

This pattern lets you crawl, inside a single batch:

  • Infinite scroll news pages
  • Ecommerce category pages
  • Search result pages
  • Long social feeds

All with precise, per URL interaction recipes.

If you have several website page “types”, you can define a single page interaction workflow which will be shared across all the URLs in the batch, and then divide your batch requests by the page type.

Network Capture for Hidden API Calls

Most modern sites hydrate their pages through client side API calls.

With Nimble, you can enable network capture and get:

  • XHR and Fetch requests
  • JSON payloads
  • API endpoints and parameters
  • Response bodies

In many cases, this is cleaner than parsing HTML.

In the Web API, this is done with the network_capture field, which is supported for real time, async, and batch Web requests, and requires render enabled for that request.

import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIALSTRING}",
    "Content-Type": "application/json",
}

data = {
    "requests": [
        # Request with network capture enabled
        {
            "url": "https://www.walmart.com/search?q=iphone",
            "render": True,
            "network_capture": [
                {
                    "method": "GET",
                    "resource_type": ["xhr", "fetch"]
                }
            ],
        },
        # Request without network capture
        {
            "url": "https://www.example.com/static",
        },
        # Another rendered request without capture
        {
            "url": "https://www.example.com/products",
            "render": True
        },
    ],
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

print("HTTP status:", response.status_code)
print(json.dumps(response.json(), indent=2))

This unlocks use cases such as:

  • Product grids loaded through Ajax
  • Price and stock availability endpoints
  • Pagination APIs
  • Hidden JSON data behind a rendered view

Monitoring Progress at Scale and Replaying Failed Requests

Once the batch is running, you have multiple visibility options.

Check progress for the entire batch

import time
import requests
API_KEY = "<base64-encoded-credential-string>"
BATCH_ID = "<batch_id>"
headers = {
    "Authorization": f"Basic {API_KEY}",
}
url = f"https://api.webit.live/api/v1/batches/{BATCH_ID}/progress"
while True:
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    progress = resp.json()
    print("Batch progress:", progress)
    if progress.get("progress") == 1:
        print("Batch completed")
        break
    time.sleep(10)

Check individual tasks

import time
import requests
API_KEY = "<base64-encoded-credential-string>"
headers = {
    "Authorization": f"Basic {API_KEY}",
    "Content-Type": "application/json",
}

# Each URL in a batch becomes its own task.
# Nimble returns a `status_url` for every task — this is the URL you poll.
# You do NOT construct this URL yourself.
status_url = batch_data["tasks"][0]["status_url"]
# 2. Poll the task using the returned status_url
while True:
    task_resp = requests.get(status_url, headers=headers)
    task_resp.raise_for_status()
    task = task_resp.json()
    state = task.get("task", {}).get("state")
    print("Task state:", state)
    if state in ("success", "failed"):
        print("Final task result:")
        print(task)
        break
    time.sleep(5)

Replay failed tasks

import requests
API_KEY = "<base64-encoded-credential-string>"
BATCH_ID = "<batch_id>"
headers = {
    "Authorization": f"Basic {API_KEY}",
    "Content-Type": "application/json",
}
url = f"https://api.webit.live/api/v1/replay/batch/{BATCH_ID}"
resp = requests.post(url, headers=headers)
resp.raise_for_status()

print("Replay triggered:")
print(resp.json())

This gives you full observability and retry control without building your own monitoring layer or inventing custom infrastructure endpoints.

When to Use Batch vs. Real Time

Think of these as two gears you can switch between, not two completely different systems.

Batch

Best for:

  • High volume, from hundreds to hundreds of thousands of URLs
  • Single page extraction per URL
  • Large autonomous jobs with async cloud delivery

Real time

Best for:

  • Multi step interactive sequences
  • Page, click, wait, click, parse
  • Programmatic workflows that need an immediate response
  • When you want deterministic behavior per request

Conclusion

Scraping at scale isn’t glamorous work. Most teams don’t struggle because they chose the “wrong” browser or the “wrong” proxy. They struggle because scaling scraping is fundamentally a distributed-systems problem disguised as a scripting problem.

Batch and async models give you a different way to think about the job: instead of scaling browsers, you scale requests. Instead of orchestrating retries, you let the system handle them. Instead of designing for failure, you inherit a workflow that already expects it.

Whether you’re collecting 10,000 URLs or 100,000, the goal is the same; less infrastructure to babysit, more time spent on the work that matters.

The fewer moving parts you operate, the fewer things you have to fix.

That’s the real win.

FAQ

Answers to frequently asked questions

No items found.