January 5, 2026

How to Scrape 100,000 URLs Without Melting Your Infrastructure

min read

Copied!

Tom Shaked

No items found.

January 5, 2026

How to Scrape 100,000 URLs Without Melting Your Infrastructure

min read

Copied!

Tom Shaked

No items found.

Table of Contents

Connect with Nimble

Connect on Slack

How to Scrape 100,000 URLs Without Melting Your Infrastructure

How Nimble’s Asynchronous Batch model eliminates browser fleets, retries, proxies, and scaling headaches

For most engineering teams, scaling scraping is not a “bigger computers” problem, it’s an exponential complexity problem.

At a few thousand URLs, your Puppeteer or Playwright fleet already starts to feel fragile. At tens of thousands, it becomes a full time job. At 100,000 and more, the machine is on fire. You are babysitting crashed browsers, juggling proxies, tuning timeouts, replaying failed batches, and dumping logs into the void.

But scraping 100,000 URLs shouldn’t require Kubernetes gymnastics, browser SREs, or building your own scraping platform.

There is a better pattern, one built from the ground up for massive scale, real time, structured web extraction.

Why Traditional Scraping Collapses at Scale

If you're reading this, you’ve probably built something like this:

──────────────────────┐
 │ Message Queue │
 └───────────┬──────────┘
 │
 ▼
 ┌──────────────────────┐
 │ Worker Autoscaler │
 └───────┬──────────────┘
 │ spawn N
 ▼
 ┌──────────────────────┐
 │ Headless Browser N │
 │ (Puppeteer, etc.) │
 └──────────┬───────────┘
 │ rotate IPs
 ▼
 ┌──────────────────────┐
 │ Proxy │
 └──────────┬───────────┘
 ▼
 ┌──────────────────────┐
 │ Proxy Provider │
 └──────────┬───────────┘
 ▼
 ┌──────────────────────┐
 │ Custom Retry │
 └──────────┬───────────┘
 ▼
 ┌──────────────────────┐
 │ S3 / GCS Sink │
 └──────────────────────┘

This architecture is excellent. It works. It just doesn’t scale:

Browsers leak memory and crash under load
Network failures cause retry storms
Proxy bans cause cascaded failures
Rendering and JavaScript execution choke CPU
Logging and observability become an existential crisis
Simple changes now require cluster rewrites

At 100,000 URLs, you don’t have a scraper, you have a distributed system that is harder to maintain than your product.

Offloading Execution to the Platform

Instead of managing a browser fleet, you can push tens of thousands of URLs to a system that already manages:

Browser rendering
Proxy routing
Session management
Retries
Backoff
Observability
Cloud delivery
Structured parsing
Cleanup
Scale

That system is Nimble’s Web API Batch endpoint, which lets you:

Submit 1,000 URLs per request, repeating until you hit 100,000 or more
Add per URL interactions such as scrolling, clicking, form submissions
Capture network level API and XHR calls
Deliver results directly to S3 or GCS
Track status via task URLs and batch progress endpoints
Replay failed tasks in your own code using async re-submission
Stop running your own browsers entirely

Let’s walk through the workflow.

The Nimble Architecture for Massive Scale Extraction

Think of your team’s code as the conductor, not the orchestra.

┌──────────────────────────────┐
 │ Your Application │
 │ (submits batches of 1,000) │
 └───────────────┬──────────────┘
 ▼
 ┌──────────────────────────────┐
 │ Nimble Batch Endpoint │
 │ (/api/v1/batch/web) │
 └───────────────┬──────────────┘
 ▼
 ┌──────────────────────────────┐
 │ Nimble Browsing Engine │
 │ (Rendering, JS, Interactions)│
 └───────────────┬──────────────┘
 ▼
 ┌──────────────────────────────┐
 │ Nimble Proxy Layer │
 │ (Proxy rotation, geotargets) │
 └───────────────┬──────────────┘
 ▼
 ┌──────────────────────────────┐
 │ Structured Parsing and Capture │
 │ (HTML, JSON, XHR, Entities) │
 └───────────────┬──────────────┘
 ▼
 ┌──────────────────────────────┐
 │ Cloud Delivery (S3 or GCS) │
 └──────────────────────────────┘

Your infrastructure never grows.
Your workers never scale.
Your browsers never spin up.

100,000 URLs becomes as simple as requesting an API, not maintaining a distributed, horizontally-scalable system.

Sending a 1,000 URL Batch

This first example shows the simplest valid batch:

No interactions or per-URL unique attributes.

Just fetch, render, parse, deliver, done.

import base64
import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"

# Base64 encoded "username:password" or API key pair
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIAL_STRING}",
    "Content-Type": "application/json",
}

# Build 1,000 simple URL requests
requests_payload = [
    {"url": f"https://www.example.com/page/{i}"} for i in range(1000)
]

data = {
    "requests": requests_payload,
    "render": True,
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

body = response.json()
print("HTTP status:", response.status_code)
print("Full response:", json.dumps(body, indent=2))
print("Batch id:", body.get("batch_id"))

What you get back from Nimble:

A batch_id for monitoring
A list of task_ids within the batch
Per task status_url
Per task download_url when using push or pull delivery

Adding Page Interactions for Scroll and Click

With Nimble, you can define page interactions per URL inside the batch request without writing browser automation code:

Infinite scroll
Click “Load more”
Click through tabs
Expand sections
Fill simple forms
Render JavaScript heavy pages

You remain inside a single batch request, still using the same /batch/web endpoint.

import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIAL_STRING}",
    "Content-Type": "application/json",
}

data = {
    "requests": [
        # URL 1: multiple interactions, infinite scroll and click on load more
        {
            "url": "https://news.example.com/infinite-feed",
            "render": True,
            "render_flow": [
                {
                    "infinite_scroll": {
                        "duration": 15000,
                        "delay_after_scroll": 1000
                    }
                },
                {
                    "wait_and_click": {
                        "selector": ".load-more",
                        "timeout": 10000,
                        "scroll": True
                    }
                }
            ],
        },
        # URL 2: a single scroll interaction
        {
            "url": "https://shop.example.com/category/widgets",
            "render": True,
            "render_flow": [
                {
                    "scroll": {
                        "x": 0,
                        "y": 2500,
                        "timeout": 15000
                    }
                }
            ],
        },
        # URL 3: no interactions
        {
            "url": "https://static.example.com/landing",
            "render": False
        },
    ],
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

print("HTTP status:", response.status_code)
print(json.dumps(response.json(), indent=2))

This pattern lets you crawl, inside a single batch:

Infinite scroll news pages
Ecommerce category pages
Search result pages
Long social feeds

All with precise, per URL interaction recipes.

If you have several website page “types”, you can define a single page interaction workflow which will be shared across all the URLs in the batch, and then divide your batch requests by the page type.

Network Capture for Hidden API Calls

Most modern sites hydrate their pages through client side API calls.

With Nimble, you can enable network capture and get:

XHR and Fetch requests
JSON payloads
API endpoints and parameters
Response bodies

In many cases, this is cleaner than parsing HTML.

In the Web API, this is done with the network_capture field, which is supported for real time, async, and batch Web requests, and requires render enabled for that request.

import json
import requests

API_URL = "https://api.webit.live/api/v1/batch/web"
CREDENTIAL_STRING = "<base64-encoded-credential-string>"

headers = {
    "Authorization": f"Basic {CREDENTIALSTRING}",
    "Content-Type": "application/json",
}

data = {
    "requests": [
        # Request with network capture enabled
        {
            "url": "https://www.walmart.com/search?q=iphone",
            "render": True,
            "network_capture": [
                {
                    "method": "GET",
                    "resource_type": ["xhr", "fetch"]
                }
            ],
        },
        # Request without network capture
        {
            "url": "https://www.example.com/static",
        },
        # Another rendered request without capture
        {
            "url": "https://www.example.com/products",
            "render": True
        },
    ],
    "storage_type": "s3",
    "storage_url": "s3://your-bucket-name/your/prefix/",
    "callback_url": "https://your.service.com/nimble-callback",
}

response = requests.post(API_URL, headers=headers, json=data)
response.raise_for_status()

print("HTTP status:", response.status_code)
print(json.dumps(response.json(), indent=2))

This unlocks use cases such as:

Product grids loaded through Ajax
Price and stock availability endpoints
Pagination APIs
Hidden JSON data behind a rendered view

Monitoring Progress at Scale and Replaying Failed Requests

Once the batch is running, you have multiple visibility options.

Check progress for the entire batch

import time
import requests
API_KEY = "<base64-encoded-credential-string>"
BATCH_ID = "<batch_id>"
headers = {
    "Authorization": f"Basic {API_KEY}",
}
url = f"https://api.webit.live/api/v1/batches/{BATCH_ID}/progress"
while True:
    resp = requests.get(url, headers=headers)
    resp.raise_for_status()
    progress = resp.json()
    print("Batch progress:", progress)
    if progress.get("progress") == 1:
        print("Batch completed")
        break
    time.sleep(10)

Check individual tasks

import time
import requests
API_KEY = "<base64-encoded-credential-string>"
headers = {
    "Authorization": f"Basic {API_KEY}",
    "Content-Type": "application/json",
}

# Each URL in a batch becomes its own task.
# Nimble returns a `status_url` for every task — this is the URL you poll.
# You do NOT construct this URL yourself.
status_url = batch_data["tasks"][0]["status_url"]
# 2. Poll the task using the returned status_url
while True:
    task_resp = requests.get(status_url, headers=headers)
    task_resp.raise_for_status()
    task = task_resp.json()
    state = task.get("task", {}).get("state")
    print("Task state:", state)
    if state in ("success", "failed"):
        print("Final task result:")
        print(task)
        break
    time.sleep(5)

Replay failed tasks

import requests
API_KEY = "<base64-encoded-credential-string>"
BATCH_ID = "<batch_id>"
headers = {
    "Authorization": f"Basic {API_KEY}",
    "Content-Type": "application/json",
}
url = f"https://api.webit.live/api/v1/replay/batch/{BATCH_ID}"
resp = requests.post(url, headers=headers)
resp.raise_for_status()

print("Replay triggered:")
print(resp.json())

This gives you full observability and retry control without building your own monitoring layer or inventing custom infrastructure endpoints.

When to Use Batch vs. Real Time

Think of these as two gears you can switch between, not two completely different systems.

Batch

Best for:

High volume, from hundreds to hundreds of thousands of URLs
Single page extraction per URL
Large autonomous jobs with async cloud delivery

Real time

Best for:

Multi step interactive sequences
Page, click, wait, click, parse
Programmatic workflows that need an immediate response
When you want deterministic behavior per request

Conclusion

Scraping at scale isn’t glamorous work. Most teams don’t struggle because they chose the “wrong” browser or the “wrong” proxy. They struggle because scaling scraping is fundamentally a distributed-systems problem disguised as a scripting problem.

Batch and async models give you a different way to think about the job: instead of scaling browsers, you scale requests. Instead of orchestrating retries, you let the system handle them. Instead of designing for failure, you inherit a workflow that already expects it.

Whether you’re collecting 10,000 URLs or 100,000, the goal is the same; less infrastructure to babysit, more time spent on the work that matters.

The fewer moving parts you operate, the fewer things you have to fix.

That’s the real win.

‍

FAQ

Answers to frequently asked questions

No items found.

Technical Guides

The Ultimate Guide to Building a Smarter Digital Shelf in 2025

Learn how e-commerce businesses can build a more efficient, revenue-generating digital shelf by implementing real-time data into their digital shelf strategy.