ETLtabular AIClickHouse

From HTML to Tables: Building a Pipeline to Turn Unstructured Web Data into Tabular Foundation-Ready Datasets

UUnknown

2026-02-24

10 min read

Practical ETL to turn scraped HTML into validated, foundation-ready tables—schema design, normalisation, ClickHouse ingestion, and code examples.

Turn messy web pages into production-ready tables for tabular foundation models — fast

You're fighting broken HTML, rate limits, and noisy values while your stakeholders want clean, schema-validated tables that feed analytics, ML and the new generation of tabular foundation models. This guide shows a reproducible, production-grade ETL: extract with Python and Node.js, normalise values, validate against a formal CSV/table schema, and store into columnar systems like ClickHouse so your datasets are foundation-ready.

Why this matters in 2026

Tabular foundation models accelerated in 2025–26 as organisations realised structured data unlocks many enterprise AI workflows. Investors doubled-down on analytical databases: ClickHouse raised large rounds and gained traction as a low-latency OLAP backend for real-time feature stores and analytics. At the same time, model builders demand high-quality, consistent tables with clear types, units and provenance. That means you need more than a scraper: you need a disciplined pipeline that enforces schema, normalises values, and logs provenance for auditing.

Overview: the pipeline in four steps

Extract — scrape pages reliably and respectfully (Playwright/Requests + concurrency, proxy rotation).
Transform & Normalise — parse strings to typed fields (dates, currencies, units), canonicalise categories, deduplicate and enrich.
Validate — apply a formal table schema (Frictionless Table Schema, Pandera, Great Expectations) with clear error handling.
Load & Store — write to CSV/Parquet and ingest into ClickHouse or cloud OLAP for fast queries and model pipelines.

Step 1 — Extract: resilient scraping that preserves provenance

Start with extraction that captures both the raw HTML and metadata that matter downstream: source URL, fetch timestamp, HTTP status, and a content hash. This keeps provenance and lets you re-run normalisation without re-crawling.

Python example (Playwright + BeautifulSoup)

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import hashlib, json, time

def fetch(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, timeout=30000)
        html = page.content()
        status = page.status if hasattr(page, 'status') else 200
        browser.close()
    fetched_at = time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime())
    content_hash = hashlib.sha256(html.encode('utf-8')).hexdigest()
    return {'url': url, 'html': html, 'status': status, 'fetched_at': fetched_at, 'content_hash': content_hash}

# then parse the html with BeautifulSoup to extract fields

Important: use rotating proxies, respect robots.txt and rate limits. Capture raw HTML in object storage (S3, GCS) and store a pointer in your dataset rows.

Node.js example (Playwright + Cheerio)

const { chromium } = require('playwright');
const crypto = require('crypto');

async function fetchPage(url){
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, { timeout: 30000 });
  const html = await page.content();
  await browser.close();
  return {
    url,
    html,
    fetched_at: new Date().toISOString(),
    content_hash: crypto.createHash('sha256').update(html).digest('hex')
  };
}

Step 2 — Transform & normalise: the critical difference

Scraped text is inconsistent. Normalisation converts heterogeneous strings into reliable, typed columns that tabular models and analytics pipelines can consume. This includes:

Date parsing — multiple formats, timezones, and relative strings.
Numeric parsing — currencies, thousands separators, negative parenthesis.
Categorical canonicalisation — map synonyms and typos to canonical categories.
Unit standardisation — sizes, weights, speeds converted to consistent units.
Provenance & confidence — store source, parsing confidence, and raw value.

Python: example normaliser using pandas

import pandas as pd
from dateutil import parser as date_parser
import re

def parse_price(s):
    if not s or pd.isna(s):
        return None
    s = re.sub(r'[^
\d.,-]', '', s)  # strip currency symbols
    s = s.replace('\u00A0','')
    s = s.replace(',', '')
    try:
        return float(s)
    except Exception:
        return None

def parse_date(s):
    try:
        return date_parser.parse(s, fuzzy=True).isoformat()
    except Exception:
        return None

# apply to dataframe
# df['price_normalised'] = df['price_raw'].apply(parse_price)
# df['date_normalised'] = df['date_raw'].apply(parse_date)

Design your normalisers to be deterministic and testable. Package them into functions, keep small mapping tables for categories, and record when a value was coerced.

Handling high-cardinality categorical fields

For columns with thousands of unique values (e.g., product SKUs, vendor names), store the canonical string plus a mapped ID when possible. Tabular foundation models prefer stable string tokens but many downstream ML pipelines need numeric encodings — provide both.

Step 3 — Schema design & validation

Before saving a single CSV row, define a formal schema. A schema ensures consistent types, units, constraints and is essential for auditing and model training.

What to include in your schema

Column name and descriptive title.
Type (integer, number, string, datetime, boolean).
Format hints (iso-datetime, uri, email).
Constraints — required, min/max, regex patterns.
Unit & semantic tags (currency: GBP, unit: metres).
Provenance fields — source_url, fetched_at, content_hash, scrape_id.
Confidence — parser confidence or a quality score.

Frictionless Table Schema example (JSON)

{
  'fields': [
    {'name': 'id', 'type': 'integer', 'constraints': {'required': True}},
    {'name': 'title', 'type': 'string'},
    {'name': 'price_gbp', 'type': 'number', 'unit': 'GBP'},
    {'name': 'available_at', 'type': 'datetime', 'format': 'iso'},
    {'name': 'source_url', 'type': 'string', 'format': 'uri'},
    {'name': 'fetched_at', 'type': 'datetime', 'format': 'iso'},
    {'name': 'content_hash', 'type': 'string'}
  ]
}

Use the frictionless-py toolkit in Python to validate and fix tables, or Great Expectations for richer assertions and integrations with data warehouses.

Validate in Python with frictionless and pandera

from frictionless import Table, validate
from pandera import DataFrameSchema, Column, Check

# frictionless quick validation
report = validate('data.csv', schema='schema.json')
print(report)

# Pandera for stricter runtime checks
schema = DataFrameSchema({
    'id': Column(int, Check.greater_than(0), nullable=False),
    'price_gbp': Column(float, nullable=True),
    'available_at': Column(object, nullable=True)
})
validated = schema.validate(df)

Step 4 — Storage & table design for ClickHouse

When your table is validated, store it in a columnar format (Parquet) for long-term storage and push to ClickHouse for fast analytical queries and model feature materialisation. ClickHouse is especially good for high-ingest, low-latency queries and mounting as a feature store for tabular models.

Recommended ClickHouse table design

Use MergeTree engines for time-series or incremental ingestion.
Partition by date (fetched_at) for efficient pruning.
Use low-level types (UInt64, Float64, DateTime) for numeric features.
Keep string columns (LowCardinality(String)) for medium-cardinality categories to compress well.
Keep provenance columns (source_url, content_hash) as Nullable(String).

ClickHouse DDL example

CREATE TABLE IF NOT EXISTS web_data.products (
    id UInt64,
    title String,
    price_gbp Float64,
    available_at DateTime,
    source_url String,
    fetched_at DateTime,
    content_hash String,
    parser_confidence Float32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(fetched_at)
ORDER BY (id, fetched_at);

Ingesting from Python

from clickhouse_driver import Client
import pandas as pd

client = Client('clickhouse-host')
# convert dataframe types to match DDL and insert
records = df.to_dict('records')
client.execute('INSERT INTO web_data.products VALUES', records)

For high-volume ingestion use clickhouse's native bulk formats (TabSeparated, native) or the HTTP interface for streaming writes. ClickHouse's scalability makes it a natural backend for feature stores and serving tabular datasets to model training pipelines.

Quality gates & monitoring

Automate checks at the end of each ETL run:

Row counts and delta from previous runs.
Null percentage per column (alert if > threshold).
Value distribution checks (min/max outliers, new categories).
Schema drift detection — new columns or type changes.
Downstream sampling — load sample rows into a notebook to sanity check before training.

Integrate with observability

Emit metrics to Prometheus or DataDog: scraped_pages, parse_failures, validation_errors, ingest_latency. Store failed rows for manual remediation. In 2026, teams increasingly attach lineage metadata to each row to meet audit and compliance requirements for AI models.

Advanced strategies for tabular foundation readiness

Column metadata — include column descriptions, expected ranges, cardinality estimates and tags. Tabular models and feature stores can ingest this to choose encoders and imputation strategies.
Missingness handling — record explicit missingness codes and impute only downstream. Some tabular foundation models perform better when missingness is encoded as a separate boolean column.
Sampling and balancing — create stratified samples for training to preserve rare categories without exploding dataset size.
Privacy-preserving transforms — tokenise or hash identifiers, apply k-anonymity or differential privacy where needed for PII.
Feature materialisation — compute moving averages, time-windowed aggregates in ClickHouse for fast training data extraction.

Node.js pipeline snapshot: validation and CSV output

const { writeToPath } = require('@fast-csv/format');
const Ajv = require('ajv');
const ajv = new Ajv();

const schema = { /* same frictionless-like schema in JSON */ };
const validate = ajv.compile(schema);

function validateRow(r){
  const ok = validate(r);
  return { ok, errors: validate.errors };
}

// write validated rows to CSV
writeToPath('data.csv', rows, { headers: true });

Common pitfalls and how to avoid them

Relying on fragile CSS selectors — prefer structural extraction with XPath or text heuristics and maintain mappings per site.
Writing CSV without a schema — leads to type drift and silent failures. Always create and validate against a schema before publishing a dataset.
Discarding provenance — you’ll need raw HTML and fetch metadata for debugging and legal audits.
Ignoring character encodings — normalise to UTF-8 early to avoid later corruption.
Skipping unit tests — normalisation functions must have deterministic tests with edge cases (empty, malformed, locale variants).

Case study: building a product price dataset for tabular models

Scenario: you need a high-quality table of product prices across UK retailers. Key requirements: GBP price (Float), retailer (canonical ID), product_title (string), scraped_at (datetime), url, and parser_confidence. Steps:

Extract pages with Playwright and store HTML blobs.
Parse candidate price strings and canonicalise currency — convert all to GBP using a recent FX rate snapshot.
Map retailer names using a small lookup table to canonical IDs.
Validate each row using frictionless and log failures to Sentry for manual review.
Save validated data to Parquet and batch-insert into ClickHouse partitioned by month.

Result: a stable table that feeds both business dashboards and tabular foundation models for pricing prediction and elasticity analysis.

Future trends to design for in 2026 and beyond

Tabular foundation models will expect standardized column metadata; pipelines that provide rich metadata will be preferred.
Hybrid ingestion — real-time and batch pipelines converge: ClickHouse and similar OLAP engines will be used as both feature stores and model serving backends.
Model governance — automated lineage, schema registries and validation will become compliance requirements for enterprise AI.
Tooling consolidation — expect more integrations between scraping stacks and data quality tools (Great Expectations, Frictionless) as the industry matures.

"Structured, high-quality tables are the bedrock of scalable enterprise AI in 2026. Without reliable schemas and provenance, models fail or produce unverifiable results."

Actionable checklist

Define a frictionless Table Schema before scraping a single page.
Store raw HTML and fetch metadata for every row.
Implement deterministic parsers with unit tests for dates, prices and categories.
Validate every output file with automated checks and fail loudly on schema drift.
Ingest validated tables into ClickHouse with partitioning and LowCardinality where appropriate.

Final notes on compliance and ethics

Always check terms of service and robots.txt. For UK/EU customers, consider GDPR obligations when you store PII. Apply pseudonymisation where needed and maintain an access control policy for dataset exports used for model training.

Takeaways

Turning HTML into high-quality tabular datasets is a multidisciplinary task: robust extraction, deterministic normalisation, formal schema validation, and careful storage design. In 2026, teams that invest in these engineering practices will unlock real value from tabular foundation models and build datasets that are auditable, reusable, and production-ready.

Next steps (practical)

Draft a minimal Frictionless Table Schema for your target dataset.
Implement one deterministic parser (price or date) with tests.
Run a small end-to-end: scrape 100 pages, normalise, validate and ingest into ClickHouse.

Ready to build a production pipeline? If you want a runnable starter repo (Python + ClickHouse + schema examples and a Playwright scraper) tailored to your site, request a customised blueprint and we’ll produce code and a deployment plan you can run in 48 hours.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Designing Scrapers for an AI-First Web: What Changes When Users Start with LLMs

business•10 min read

How to Monetise Creator Content Ethically: Building a Revenue Share Pipeline for Training Data

costs•9 min read

Cost Forecasting Workbook: Plan Your Scraping Infrastructure When Memory Prices Are Volatile

geodata•9 min read

From Crowd Signals to Clean Datasets: Using Waze-Like Streams Without Breaking TOS

nodejs•10 min read

Reducing Memory Use in Large-Scale JS Scrapers: Patterns and Code Snippets

From Our Network

Trending stories across our publication group

Build a Tiny Local AI That Runs in Your Mobile Browser (No Cloud Required)

codeacademy.site

mobile AI•10 min read

Build a Tiny Local AI That Runs in Your Mobile Browser (No Cloud Required)

How to Run Continuous SEO Audits for Windows Release Notes and Patch KBs

windows.page

SEO•10 min read

How to Run Continuous SEO Audits for Windows Release Notes and Patch KBs

PWA + Local AI: Shipping an Offline Assistant for Android and iOS with TypeScript

typescript.website

pwa•10 min read

PWA + Local AI: Shipping an Offline Assistant for Android and iOS with TypeScript

Chaos Engineering with Process Roulette: A Step-by-Step Guide to Hardening Services

thecode.website

SRE•9 min read

Chaos Engineering with Process Roulette: A Step-by-Step Guide to Hardening Services

Why ClickHouse’s $400M Raise Changes the OLAP Landscape (and What Developers Should Do Next)

codeguru.app

databases•9 min read

Why ClickHouse’s $400M Raise Changes the OLAP Landscape (and What Developers Should Do Next)

4-Step Android Tune-Up for Developers: Automate the Routine with ADB and Shell Scripts

codewithme.online

android•10 min read

4-Step Android Tune-Up for Developers: Automate the Routine with ADB and Shell Scripts

2026-02-24T06:27:48.859Z