How to Clean Scraped Data: Deduplication, Normalisation, and Validation
data-cleaningdeduplicationvalidationnormalizationweb-scraping

How to Clean Scraped Data: Deduplication, Normalisation, and Validation

CCode Scrape Hub Editorial
2026-06-11
11 min read

A practical guide to clean scraped data with repeatable rules for deduplication, normalisation, and validation.

Cleaning is where scraped data becomes useful. A scraper can collect thousands of rows, but if the output contains duplicates, inconsistent formats, broken values, and missing fields, it will be difficult to search, analyse, or feed into downstream systems. This guide explains a practical approach to clean scraped data with repeatable rules for deduplication, normalisation, and validation. The aim is not to build a perfect universal pipeline, but to give you a framework you can reuse as sites change, fields evolve, and new edge cases appear.

Overview

If you scrape website data regularly, data quality problems tend to show up in predictable ways. The same product appears under several URLs. Prices mix symbols, commas, and text labels. Dates arrive in several formats. Empty strings are mixed with null values. Titles include stray whitespace or tracking text. A field you expected to be numeric suddenly contains “Contact for price”.

The important point is that cleaning scraped data is not a final tidy-up step you do once. It is part of the scraper design. Reliable data extraction usually needs four stages:

  1. Capture raw data exactly as it appeared.
  2. Transform the fields into consistent internal formats.
  3. Validate the transformed records against rules.
  4. Review exceptions so your rules improve over time.

This matters whether you use a simple web scraping Python script with requests and Beautiful Soup, a Scrapy tutorial workflow, or browser automation with Playwright web scraping or Puppeteer scraping. The extraction method changes, but the cleaning principles stay mostly the same.

A useful way to think about cleaning is that every field needs answers to three questions:

  • What counts as the same record?
  • What is the standard format for this field?
  • What values are acceptable and what should be flagged?

Once those rules are explicit, your pipeline becomes easier to maintain, test, and revisit.

Core framework

A strong cleaning process starts with a small set of repeatable rules. For most scraping projects, the following framework is enough to keep your data usable without overengineering it.

1. Keep the raw version before cleaning

Do not overwrite the original extraction. Store the raw HTML-derived values or JSON fields alongside the cleaned output, or in a separate raw table or file. This gives you an audit trail when a selector changes or a normalisation rule turns out to be too aggressive.

For example, keep both:

  • raw_price = "£1,299.00 inc VAT"
  • clean_price = 1299.00

If you only keep the cleaned field, you lose the context needed to debug parsing errors later. For storage options, it helps to choose a format that matches the size and complexity of your pipeline. See Store Scraped Data in CSV, JSON, SQLite, or Postgres: What to Choose.

2. Define a canonical schema

Before you deduplicate scraped data, decide what each field should look like in its final form. A schema can be lightweight, but it should be explicit. For each field, define:

  • Field name
  • Type: string, integer, decimal, boolean, date, URL
  • Required or optional
  • Allowed values or ranges
  • Cleaning rule
  • Deduplication role, if any

A simple product schema might include:

  • product_id: string, required
  • title: string, required, trimmed
  • price_gbp: decimal, optional, non-negative
  • currency: string, optional, ISO-style internal code
  • availability: enum, optional
  • product_url: URL, required, canonicalised
  • scraped_at: datetime, required

The schema becomes your contract between scraping, cleaning, storage, and reporting.

3. Deduplicate in layers

Duplicate records usually come from pagination overlap, parameterised URLs, session-specific links, repeated cards on the same page, or repeated runs over time. The cleanest approach is to deduplicate in layers rather than rely on one rule.

Common layers include:

  • Exact row duplicates: every field is identical.
  • Same canonical URL: tracking parameters removed, fragments ignored, host normalised where appropriate.
  • Same source identifier: SKU, listing ID, or embedded product ID.
  • Near duplicates: same title plus same price plus same source domain.

In practice, a stable site-specific identifier is best. If the page exposes a product ID, use it. If not, create a deterministic fingerprint from the fields that tend to remain stable, such as normalised title + canonical URL path + source site.

Be careful with title-only matching. Similar items often share names, especially in job listings, property pages, and marketplace feeds.

4. Normalise one field type at a time

Normalisation means converting extracted values into consistent internal representations. The goal is not to preserve display formatting. The goal is to make the data comparable and machine-friendly.

Useful normalisation rules include:

  • Text: trim whitespace, collapse repeated spaces, decode entities, preserve meaningful case only where needed.
  • URLs: resolve relative links, remove fragments, decide how to handle query parameters, standardise trailing slashes.
  • Numbers: strip currency symbols and separators carefully, convert to numeric types.
  • Dates: convert to a single output format such as ISO 8601.
  • Booleans: map variants like “In stock”, “Available”, “Yes” to a clear internal value.
  • Nulls: standardise empty strings, “N/A”, “-”, and missing fields into a single null representation.

This is where many web scraping tutorial examples stop too early. Extracting data from HTML is only half the job; making fields consistent is what makes the output reusable.

5. Validate after normalisation, not before

Validation answers a simple question: does the cleaned record make sense? It should happen after transformation, because many raw values look invalid until they are parsed.

Typical validation checks include:

  • Required fields are present.
  • URLs are absolute and parse correctly.
  • Prices are numeric and non-negative.
  • Dates parse successfully and fall within a sensible range.
  • Enum fields only use approved values.
  • Record keys are unique where expected.

It helps to separate validation results into:

  • Error: the record should be rejected or quarantined.
  • Warning: the record is usable but should be reviewed.
  • Info: a non-blocking note, such as a fallback parser being used.

This makes exception handling far more manageable, especially if your scraper runs on a schedule. If you run recurring jobs, pair cleaning rules with operational checks from Web Scraping Error Handling Checklist: Retries, Timeouts, and Fallbacks and Schedule a Web Scraper With Cron, GitHub Actions, and Cloud Functions.

6. Make rules deterministic

A cleaning rule should produce the same result every time for the same input. Avoid vague manual logic such as “fix obvious title issues later”. If a rule matters, encode it.

Good examples:

  • Remove UTM parameters from URLs.
  • Map “out of stock”, “sold out”, and “unavailable” to unavailable.
  • Convert prices with commas and currency symbols to decimals.

Deterministic rules are easier to test and safer to update.

Practical examples

Here is what the framework looks like in common scraping scenarios.

Cleaning ecommerce product data

Suppose you scrape product listings from category pages and individual detail pages. Common issues include duplicate products across categories, inconsistent availability labels, and price strings with tax or shipping notes.

A practical flow might look like this:

  1. Extract raw fields: title, URL, image URL, price text, SKU, availability text.
  2. Canonicalise the product URL by removing tracking parameters.
  3. Normalise title by trimming and collapsing whitespace.
  4. Parse price text into numeric value and store currency separately.
  5. Map availability labels to a fixed internal set.
  6. Deduplicate by SKU where available, otherwise by canonical URL.
  7. Validate that title and URL exist and that price is not negative.

Internal values could become:

  • availability = in_stock | out_of_stock | preorder | unknown
  • price_gbp = Decimal("1299.00")
  • product_url = https://example.com/product/widget-123

If you monitor product pages over time, keep snapshots separate from the master entity. The product itself may be unique, but the price and availability are time-based observations.

Cleaning lead generation or directory data

Directories and profile pages often contain messy contact details. Business names may vary slightly. Phone numbers may include spaces, punctuation, or international prefixes. Addresses may have line breaks and optional county fields.

For this kind of data cleaning scraping workflow:

  • Strip surrounding whitespace from each field.
  • Standardise phone numbers into a consistent internal format that suits your region and use case.
  • Lowercase and trim email addresses.
  • Split multi-line addresses into structured components only if you can do it reliably; otherwise keep both raw and cleaned single-line versions.
  • Deduplicate using domain, email, phone, or a composite fingerprint instead of business name alone.

A business called “Acme Ltd”, “ACME Limited”, and “Acme” may be the same entity, but exact title matching will miss that. At the same time, aggressive fuzzy matching can merge different businesses incorrectly. Start with conservative rules and review collisions manually.

Cleaning article or SERP-style data

Content and SEO datasets bring a different pattern of issues: repeated URLs with different tracking parameters, title variations between list pages and detail pages, and timestamps in human-readable text.

A sensible process is:

  • Canonicalise URLs first.
  • Store page title and listing title separately if both are useful.
  • Normalise publication dates into ISO format.
  • Use a content hash or canonical URL as the primary key.
  • Treat ranking position or snippet text as observations tied to a crawl time, not as permanent attributes.

This is especially important when you scrape website data on a schedule, because a ranking page can produce many near-duplicate rows across runs.

Example cleaning logic in Python

If you build a python web scraper, keep cleaning logic in small functions rather than one large block. Even a lightweight structure helps:

from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode
from decimal import Decimal
import re

TRACKING_PARAMS = {"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content"}

def clean_text(value):
    if value is None:
        return None
    value = re.sub(r"\s+", " ", value).strip()
    return value or None

def canonical_url(url):
    if not url:
        return None
    parts = urlsplit(url)
    filtered_query = [(k, v) for k, v in parse_qsl(parts.query) if k not in TRACKING_PARAMS]
    return urlunsplit((parts.scheme, parts.netloc.lower(), parts.path.rstrip("/"), urlencode(filtered_query), ""))

def parse_price(value):
    if not value:
        return None
    cleaned = re.sub(r"[^0-9.,]", "", value)
    cleaned = cleaned.replace(",", "")
    return Decimal(cleaned) if cleaned else None

The details will vary by site and locale, but the pattern is stable: one function per field type, then a validation step after transformation.

Example validation checklist

Whether you use Python or Node, a good validation pass can be as simple as a list of assertions:

  • Title exists and is at least a few characters long.
  • Canonical URL exists and starts with http.
  • Price is either null or numeric.
  • Availability value is inside the allowed enum.
  • Record fingerprint is unique in the batch.

Save failed rows for inspection rather than dropping them silently. Silent failures are one of the main reasons scraping datasets become hard to trust.

Common mistakes

Most cleaning problems come from a few recurring mistakes. If you avoid these, your pipeline will be much easier to maintain.

Cleaning too early in the extraction step

It is tempting to parse and fix every field inside the scraper callback or page loop. That works for small experiments, but it quickly becomes hard to debug. Prefer a clearer split between extraction and cleaning so you can inspect raw values when something changes.

Using weak deduplication keys

Titles, names, and snippets are often unstable. If a site provides IDs, use them. If not, canonical URLs and composite fingerprints are usually safer than display text alone.

Over-normalising and losing meaning

Lowercasing all text, stripping punctuation, or trimming query parameters indiscriminately can destroy useful distinctions. Some URL parameters change tracking only; others identify the actual content. Some punctuation is noise; some is part of the product model or company name.

Ignoring null handling

Empty string, whitespace, “N/A”, “None”, and missing keys should not be treated as separate values unless you have a clear reason. Pick one internal null representation and apply it consistently.

Assuming one parser fits every site

Price parsing, date parsing, and URL rules often vary by source. A shared base layer is useful, but site-specific overrides are normal. Treat them as configuration, not as a failure of your general approach.

Dropping bad rows without logging them

Rejected records are useful signals. They tell you when a layout changed, when your parser became too strict, or when the source data itself is inconsistent. Keep an error log or quarantine table so you can inspect failed records later.

Forgetting operational causes of dirty data

Some data quality issues begin before cleaning. Incomplete pages from timeouts, bot challenges, blocked requests, or failed pagination can create misleading gaps and duplicates. That is why cleaning should be considered alongside crawl strategy, rate limiting scraping practices, and proxy handling. Related reading: Rate Limiting for Web Scrapers: How to Crawl Responsibly Without Getting Blocked, How to Use Proxies for Web Scraping: Rotation, Sessions, and Common Pitfalls, and How to Handle Pagination, Infinite Scroll, and Load More Buttons When Scraping.

When to revisit

The best cleaning rules are living rules. Revisit your pipeline when the source site changes, when your storage model changes, or when the downstream use of the data becomes more demanding.

In practical terms, review your cleaning logic when:

  • A site redesign changes field formats or identifiers.
  • Your primary extraction method changes, such as moving from requests-based scraping to headless browser scraping.
  • You add a new source with different naming, date, or price conventions.
  • You start storing history instead of latest-state records.
  • You add analytics, alerts, or reporting that depend on stricter validation.
  • Your exception queue grows and the same failures keep recurring.

A simple maintenance routine works well:

  1. Review a sample of raw rows from each source.
  2. Check validation failures and warning counts.
  3. Inspect duplicate groups to see whether your keys are still correct.
  4. Update site-specific rules only where needed.
  5. Add tests for every new edge case you fix.

If you want one practical rule to take away, make it this: every time you patch a cleaning bug, turn that patch into a named rule and a test. Over time, that turns an ad hoc scraper into a dependable data tool.

As your stack grows, it also helps to review the extraction side. Different libraries expose different failure modes and data shapes, especially on dynamic sites. For comparison and tooling context, see Best Python Libraries for Web Scraping: Updated Comparison, Best Node.js Libraries for Web Scraping and Browser Automation, Selenium vs Playwright vs Puppeteer for Web Scraping, and Puppeteer Web Scraping Guide: Extract Data From Modern Web Apps.

To put this into action, document your schema, write field-level cleaning functions, validate after transformation, and keep failed rows visible. That workflow is simple enough for a small script and durable enough for larger scraping systems. Clean data rarely happens by accident; it comes from explicit rules that are easy to inspect, test, and improve.

Related Topics

#data-cleaning#deduplication#validation#normalization#web-scraping
C

Code Scrape Hub Editorial

Senior SEO Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

2026-06-09T21:58:17.983Z