Web Scraping Error Handling Checklist

A practical checklist for handling retries, timeouts, blocks, and fallbacks in production web scrapers.

A scraper that works once is not the same as a scraper you can trust in production. This checklist is designed to help you build more reliable crawlers by handling the failures that appear again and again in real projects: slow responses, temporary blocks, broken selectors, missing data, JavaScript rendering issues, and downstream storage errors. Use it before you launch a new scraper, when a crawl starts failing, or when your workflow changes. The aim is simple: make failures predictable, recoverable, and visible instead of silent and expensive.

Overview

Web scraping error handling is less about catching exceptions and more about designing a controlled response to uncertainty. Websites change. Networks fail. Proxies go bad. Dynamic pages render inconsistently. Rate limits appear without warning. A robust web scraper assumes these problems will happen and decides in advance what to do.

The most useful way to think about scraper reliability is to separate failures into a few operational categories:

Request failures: connection errors, DNS issues, TLS failures, socket resets, timeout errors
Response failures: HTTP 403, 404, 429, 500-series errors, empty responses, malformed content
Rendering failures: JavaScript not finished, navigation stalls, missing elements, detached nodes in browser automation
Parsing failures: selectors stop matching, JSON is incomplete, HTML structure changes
Data quality failures: fields are blank, formats change, duplicate items appear, records become inconsistent
Pipeline failures: queues back up, databases reject writes, scheduled jobs overlap, alerts do not fire

Your checklist should cover all six. If it only covers retries, it is incomplete. Retrying the wrong failure type can make a crawler slower, louder, and easier to block.

A practical rule is to define four decisions for every scraper:

What counts as a transient failure?
How many times should it retry?
When should it fall back to another method?
When should it stop and raise an alert?

If your team can answer those four questions per target site, you already have the foundation of a scraper reliability checklist.

Checklist by scenario

Use the scenarios below as a reusable preflight and incident-response list. They are written so you can adapt them to Python requests and Beautiful Soup, Scrapy, Playwright web scraping, Puppeteer scraping, or a mixed pipeline.

1. Network and connection failures

What you are protecting against: request timeouts, broken connections, DNS errors, SSL problems, and temporary packet loss.

Set explicit connect timeouts and read timeouts. Do not rely on unlimited defaults.
Use different timeout values for lightweight HTML pages and heavier browser sessions.
Retry only idempotent requests unless you know the target action is safe to repeat.
Apply exponential backoff with jitter so retries do not cluster.
Cap total retry attempts per URL and per job.
Log the final exception type, target hostname, and elapsed time.
Track timeout rate by domain so you can spot infrastructure drift early.

Fallback strategy: if a plain HTTP fetch fails repeatedly, queue the URL for a slower second-pass worker instead of hammering it immediately.

2. HTTP status code handling

What you are protecting against: scraping logic that treats every non-200 response the same.

Treat 404 as a likely terminal result unless the site is known to return temporary 404s.
Treat 429 as a rate limiting event, not a parsing failure.
Treat 403 as a possible permissions, headers, fingerprint, or proxy issue.
Retry 500, 502, 503, 504 with backoff, but set a hard ceiling.
Record the response body for a sample of failures so you can inspect block pages and edge cases.
Respect Retry-After if present.

Fallback strategy: on repeated 429s, lower concurrency, pause the host queue, or switch to a different crawl window. For a broader guide, see Rate Limiting for Web Scrapers: How to Crawl Responsibly Without Getting Blocked.

3. Bot detection and blocking signals

What you are protecting against: being blocked slowly rather than explicitly. Many sites do not return a clear error code.

Check for block indicators such as CAPTCHA markup, challenge pages, unexpected redirects, or repeated empty templates.
Compare page title and canonical URL to what you expected.
Fingerprint suspiciously short or identical response bodies.
Monitor sudden drops in extracted item counts per page.
Store a small sample of failed HTML for debugging.
Separate target-level blocks from proxy-level failures in your logs.

Fallback strategy: move the request to a different session, identity, or proxy group only after you confirm the issue is not a bad selector or incomplete rendering. If proxy strategy is part of your stack, see How to Use Proxies for Web Scraping: Rotation, Sessions, and Common Pitfalls.

4. Dynamic pages and browser automation failures

What you are protecting against: pages that technically load but are not ready to parse.

Wait for a meaningful selector or network condition, not a fixed sleep alone.
Set separate limits for navigation timeout and selector timeout.
Detect whether content is loaded via API calls, embedded JSON, or lazy-loaded DOM nodes.
Handle infinite scroll, pagination, and load-more patterns deliberately rather than relying on repeated scrolling.
Close pages and browser contexts reliably to avoid memory leaks.
Take screenshots or save HTML on failure in a small sample of cases.

Fallback strategy: if browser automation is only needed for a subset of URLs, first try direct API or HTML extraction, then escalate to Playwright or Puppeteer only when required. Related reading: How to Scrape JavaScript-Rendered Websites With Playwright, Puppeteer Web Scraping Guide: Extract Data From Modern Web Apps, and How to Handle Pagination, Infinite Scroll, and Load More Buttons When Scraping.

5. Parsing failures and selector drift

What you are protecting against: code that runs successfully but extracts the wrong thing.

Validate the presence of critical fields before marking a page successful.
Use multiple extraction paths where appropriate, such as JSON-LD first and CSS selectors second.
Keep selectors tied to stable attributes rather than presentation classes when possible.
Log parser version or extraction strategy name with every record.
Save representative fixtures so parser changes can be tested offline.
Alert on sudden drops in field completion rate, not just hard exceptions.

Fallback strategy: if a primary selector fails, try a secondary parser before requeuing the URL. If both fail, store the raw response for review rather than silently returning partial data.

6. Data validation and quality checks

What you are protecting against: subtle bad data entering your pipeline unnoticed.

Define required, optional, and derived fields for each record type.
Check for impossible values, blank identifiers, malformed prices, and invalid dates.
Normalize whitespace, encoding, and number formats consistently.
Deduplicate by stable keys or content signatures.
Track null rates and field coverage over time.
Flag unusual spikes in record count, duplicate count, or empty content.

Fallback strategy: quarantine questionable records into a review queue instead of letting them overwrite trusted data.

7. Queue, storage, and downstream failures

What you are protecting against: the scrape succeeding but the output disappearing or corrupting later.

Make database writes idempotent where possible.
Use retry rules for transient storage failures, but do not retry malformed payloads indefinitely.
Record whether a URL failed at fetch, parse, validation, or write stage.
Prevent scheduled jobs from overlapping unless they are designed for parallel runs.
Persist enough metadata to replay failed items safely.
Set alerts for queue age, backlog growth, and write error rate.

Fallback strategy: if the primary sink is unavailable, write to durable intermediate storage and replay later.

8. Tool-specific reliability checks

For lightweight Python scrapers:

Use sessions for connection reuse.
Set user-agent and headers deliberately.
Pair timeout settings with parser validation, not just exception handling.
Review your approach against Python Web Scraping Tutorial for Beginners: Requests and Beautiful Soup.

For Scrapy or larger crawlers:

Use per-domain concurrency and download delay controls.
Separate retry middleware rules by status code and exception type.
Export crawl stats and inspect them after every meaningful release.

For Playwright or Puppeteer:

Reuse browser instances carefully, but isolate unstable sessions.
Limit page count per browser to reduce memory creep.
Capture console and request failures when debugging.
If you are choosing a stack, compare trade-offs in Selenium vs Playwright vs Puppeteer for Web Scraping.

What to double-check

Before you mark a scraper as production-ready, pause and verify these points. This is where many reliability issues are introduced.

Timeouts are explicit: there is a documented connect timeout, read timeout, navigation timeout, and overall job deadline.
Retries are selective: you are not retrying parsing bugs, authentication failures, or permanent 404s as though they were transient outages.
Backoff is real: retries spread out over time with jitter rather than firing instantly.
Success is defined by data quality: a page is not “successful” just because your HTTP client returned 200.
Fallbacks have boundaries: escalating every failed request to a headless browser can destroy throughput and mask parser issues.
Observability exists: logs, metrics, and sample failure artefacts are available without reproducing the bug from scratch.
Runbooks exist: someone on the team knows what to do when block rates jump, selectors fail, or queue lag grows.
Robots, permissions, and usage expectations are reviewed: your workflow should reflect the legal and operational context of the target site.
Scheduling is controlled: cron jobs or worker schedules will not pile up and hit the same site in bursts.
Internal consistency checks are in place: if a category normally returns 200 items and now returns 3, that should be visible quickly.

A useful test is to pick five recent failures and ask, “Could we tell within five minutes what happened, what the scraper did, and whether it should have done something else?” If the answer is no, your error handling is still too implicit.

Common mistakes

Most scraper reliability problems come from a handful of repeated design choices. Avoiding them is often more valuable than adding another library.

Retrying everything

Blind retries create noise, wasted bandwidth, and more blocking risk. A parser bug will not heal on the third attempt. Split failures into transient, conditional, and terminal categories.

Using one timeout for everything

A tiny JSON endpoint and a JavaScript-heavy product page should not share the same expectations. Separate timeout budgets by request type.

Sleeping instead of waiting for signals

Fixed delays are simple but brittle. In browser automation tutorial material, this is one of the most common reasons people struggle with dynamic sites. Prefer waiting for selectors, responses, or app-specific readiness signals.

Trusting status code 200 too much

Many block pages, consent pages, and empty templates still return 200. Validate content, not just transport success.

Escalating too early to a headless browser

Headless browser scraping is useful, but it is also slower and more expensive. Always check whether the data is already in HTML, embedded JSON, or a public API response.

Ignoring partial failures

If one field starts failing quietly while the rest of the page parses, you still have an incident. Completion rates matter.

Keeping too little failure evidence

When every failing response is discarded, debugging turns into guesswork. Save a controlled sample of raw HTML, screenshots, redirect chains, or response metadata.

No distinction between target issues and infrastructure issues

A failing proxy, a blocked user agent, and a changed selector may all look like “scraper broken” unless you label failures properly. Your logs should help you separate these paths quickly.

Letting alerts depend only on exceptions

The nastiest scraper failures are often silent degradations: fewer records, more blanks, wrong fields, or stale pages. Alert on business-level outputs as well as application errors.

If you are still selecting libraries or planning a new stack, the comparisons in Best Python Libraries for Web Scraping: Updated Comparison and Best Node.js Libraries for Web Scraping and Browser Automation can help you match reliability features to your workflow.

When to revisit

This checklist should not live in a project wiki untouched for a year. Revisit it whenever the inputs around your crawler change.

Before seasonal planning cycles: traffic patterns, product ranges, and crawl volumes often change, which affects rate limits, queue depth, and timeouts.
When workflows or tools change: moving from requests to Playwright, changing proxies, or altering storage backends should trigger a review.
After every meaningful site redesign: especially if the target moves from server-rendered HTML to client-side rendering.
After repeated incidents: if the same failure happens more than once, the checklist needs a new rule.
When adding new domains: do not assume one site’s retry and timeout profile will fit another.
When data quality expectations change: new required fields mean new validation and new failure modes.

For a practical review process, end each release or incident with five actions:

List the top three failure modes seen this period.
Decide which were transient, which were permanent, and which were caused by your own code.
Update retry, timeout, or fallback rules accordingly.
Add one metric or alert that would have shortened detection time.
Test one stored failure fixture before the next deployment.

If you want this article to be genuinely useful, treat it as an operating document rather than a one-time read. A robust web scraper is rarely built by adding more code. It is built by deciding, in advance, how the system should behave when the web is slow, inconsistent, and occasionally hostile.

Web Scraping Error Handling Checklist: Retries, Timeouts, and Fallbacks