starter-projectraspberry-pidevops

Starter Project: Deploying a Raspberry Pi 5 Scraper Node Image (Systemd, Chromium, AI HAT Support)

wwebscraper

2026-02-04

10 min read

A production-ready Pi 5 starter image: headless Chromium, systemd auto-updates, and AI HAT+ 2 inference — flash, boot, deploy.

Hook: Stop wrestling with flaky scraper nodes — deploy a production-ready Pi 5 image that just works

If you need reliable, low-cost scraper capacity for dynamic sites but keep losing time to setup, browser reliability, and OS drift, this starter project is for you. In 2026, the Raspberry Pi 5 plus the new AI HAT+ 2 is an affordable edge scraping and inference platform — but only if the OS image, browser, and services are configured for reliability, auto-updates, and secure operation.

Why this starter image matters in 2026

Late 2025 and early 2026 saw two trends converge: better, affordable on-device NPUs (AI HAT+ 2 family) and an increased defensive posture from modern websites (bot detection, fingerprinting, rate limits). That makes a few things essential for production scraper nodes:

Headless Chromium tuned for stealth and stability — modern anti-bot systems react to incorrect flags, missing fonts, or GPU settings.
Systemd-managed services and auto-update workflows — unattended nodes must update code, apply security fixes, and reboot gracefully.
AI HAT+ 2 support for local inference — handle lightweight parsing, OCR, or on-device LLM ranking to reduce network calls and making scraping resilient.
Opinionated defaults — security-first user, minimal packages, automated logging and healthchecks.

What you get: the starter image and scripts (opinionated)

This project provides a ready-to-flash Raspberry Pi 5 image and a supporting Git repository with scripts and systemd units. High-level contents:

Base image: Raspberry Pi OS (64-bit) tuned for Pi 5 and kernel 6.x
Headless Chromium (stable 2026 build) with patched flags and sandboxing
Node/Playwright and Python/selenium examples for scraping
AI HAT+ 2 drivers and a small inference agent (llama.cpp or vendor runtime) + example model pipeline
Systemd units: scraper.service, updater.service + updater.timer, hat-agent.service, watchdog.service
Auto-update scripts: git-based code pulls, package updates, and optional A/B image swap scripts
Security: unprivileged scraper user, firewall (ufw), mandatory logging to /var/log/scraper

Where to get the image and scripts

Clone the starter repository to inspect everything before flashing. The repo contains checksums and an automated flasher script.

git clone https://github.com/webscraper-uk/pi5-scraper-starter.git
cd pi5-scraper-starter
# Inspect files, then download the signed image
./download-image.sh --verify

Flash using Raspberry Pi Imager or dd:

# example: on macOS/Linux (verify target device carefully)
gzip -d pi5-scraper-2026-01.img.gz
sudo dd if=pi5-scraper-2026-01.img of=/dev/sdX bs=4M status=progress conv=fsync
sync

Key design decisions (why we chose these defaults)

Raspberry Pi OS 64-bit — best device/vendor support and timely kernel updates for Pi 5's USB-C, PCIe, and NPU bridge.
Headless Chromium (stable 2026) — chosen for compatibility with Playwright/Puppeteer and wide support for modern JS-heavy sites.
systemd-first orchestration — systemd services, timers, and unit sandboxing are robust on constrained nodes.
Git-based auto-updates — deterministic, auditable updates for scraper code (plus package auto-upgrades using unattended-upgrades).
On-device inference option — where latency, privacy, or bandwidth matters, running lightweight models on the AI HAT+ 2 cuts round trips to cloud APIs.

Installation & first boot: step-by-step

1. Flash and first-boot

Flash the image (see commands above).
On first boot the image runs init scripts that:

Create the unprivileged user scraper.
Register the device in your fleet (optional — a lightweight registration to your management endpoint).
Install the AI HAT+ 2 kernel module if connected.

2. Verify Chromium and Playwright

sudo -u scraper /usr/bin/chromium-browser --headless=new --disable-gpu --no-sandbox --remote-debugging-port=9222 &
# Run a sample Playwright test
cd /home/scraper/examples/playwright
npm ci
node run-sample.js

Notes: We intentionally use the newer --headless=new mode in 2026 builds — it addresses several rendering differences and improves stealth when combined with proper flags and fonts.

3. Attach and enable AI HAT+ 2

Connect the AI HAT+ 2 via the recommended PCIe or header interface (refer to your HAT vendor instructions).
Check kernel modules and device nodes:

lsmod | grep ai_hat
# or
ls /dev | grep ai_hat

The image includes a systemd unit hat-agent.service which starts a small REST gRPC agent that exposes an inference endpoint on localhost. The agent uses the vendor runtime where available or falls back to llama.cpp for CPU-only inference.

Important systemd units (copy-ready)

scraper.service

[Unit]
Description=Scraper Node
After=network-online.target hat-agent.service
Wants=network-online.target

[Service]
User=scraper
Group=scraper
WorkingDirectory=/home/scraper/app
ExecStart=/usr/bin/node /home/scraper/app/index.js
Restart=on-failure
RestartSec=10
# Sandbox and resource limits
PrivateTmp=yes
ProtectSystem=full
NoNewPrivileges=yes
LimitNOFILE=4096

[Install]
WantedBy=multi-user.target

updater.service + updater.timer (auto-update)

[Unit]
Description=Auto-update scraper code

[Service]
Type=oneshot
User=scraper
WorkingDirectory=/home/scraper/app
ExecStart=/usr/local/bin/pi-updater.sh

# pi-updater.sh pulls code, runs migrations, restarts services as needed

[Install]
WantedBy=multi-user.target

# Timer unit (pi-updater.timer): run daily or on boot
[Unit]
Description=Run pi-updater daily

[Timer]
OnBootSec=5min
OnUnitActiveSec=24h
Persistent=true

[Install]
WantedBy=timers.target

pi-updater.sh is opinionated: it runs 'git fetch && git reset --hard origin/main', then runs dependency install and a smoke-test script before swapping to the new version. Failures roll back using git reflog or keep the previous release directory.

Headless Chromium configuration: flags & tuning

Modern bot detectors examine flags, GPU usage, and even WebGL signatures. Our image ships Chromium with a tuned launcher script that sets secure and stealthy defaults while keeping stability.

#!/bin/bash
# /usr/local/bin/chrome-headless.sh
exec /usr/bin/chromium --no-first-run --no-default-browser-check \
  --headless=new --disable-gpu --disable-dev-shm-usage \
  --enable-features=NetworkService,VaapiVideoDecoder \
  --remote-debugging-address=127.0.0.1 --remote-debugging-port=9222 \
  --disable-blink-features=AutomationControlled "$@"

Other recommended measures (applied in the image):

Install a basic font set to avoid font fingerprint anomalies.
Use ephemeral user profiles and clear browser state between runs.
Run Chromium under the unprivileged scraper user and in a PID/namespace sandbox where possible.

AI HAT+ 2 integration: a practical example

Use the AI HAT for small classification or summarisation tasks to reduce overall scrape traffic. Example flow:

Scraper fetches page HTML and images using headless Chromium.
It calls local hat-agent over gRPC: /infer with the page text or screenshot.
The hat-agent returns structured outputs (entities, summary, OCR text) which are stored locally and forwarded to your pipeline.

# Example Python call to hat-agent
import requests
payload = {"type":"ocr","image_base64": "..."}
r = requests.post("http://127.0.0.1:8001/v1/infer", json=payload)
print(r.json())

The starter image includes a tiny benchmark that runs a lightweight model on the HAT and reports throughput. In many cases, moving simple classification on-device cuts cloud costs and improves throughput for high-volume scraping fleets.

Operational best practices & advanced strategies

Fleet registration & monitoring

On first boot the image optionally registers with your fleet control plane (webhook + device token).
Ship metrics to your monitoring stack (Prometheus exporter + pushgateway or agent). The image exposes /metrics for CPU, memory, Chromium sessions, and hat-agent latency.

Auto-updates with safety

Use updater.timer to pull code daily.
Updater runs smoke-tests: starts the new code in a temporary environment and runs a short scrape job against a test endpoint. If the test passes, it swaps symlinks and restarts services.
Keep revert hooks configured: if healthchecks fail post-update, systemd will re-run the previous release and mark the node for manual inspection.

Anti-detection measures

Rotate TLS fingerprints by using a small middleware that injects region-appropriate headers and supported ciphers.
Throttle concurrency per target site using a site-specific policy file in /etc/scraper/policies.
Leverage the AI HAT to do on-device content parsing instead of heavy JavaScript execution for many pages.

Security hardening (opinionated)

Create an unprivileged user (scraper), disable SSH root logins, and use SSH keys.
Enable ufw with default deny incoming and explicit allowed ports (ssh, metrics only on localhost or through a VPN).
Run services under systemd sandboxing flags (ProtectSystem, NoNewPrivileges, PrivateTmp).
Sign and verify images, and keep an A/B image fallback if you run full firmware upgrades across fleet. See notes on remote attestation and sovereign cloud controls when compliance matters.

Troubleshooting common issues

Chromium fails to start or crashes

Check /var/log/scraper/chrome.log for crash stacks.
Try disabling --enable-features that enable special GPU paths; the Pi 5 has evolving GPU stack in 2026 kernels.
Ensure /dev/shm is large enough; the image sets tmpfs to 256MB by default.

AI HAT agent reports device not found

Check lsmod and dmesg for device initialisation messages.
Verify vendor runtime is installed: /opt/ai-hat/bin/hat-runtime --version
If vendor runtime is unavailable, daemon falls back to CPU path (llama.cpp) and logs a warning; check /var/log/scraper/hat.log

2026 trends & future-proofing your scraper nodes

In 2026 the edge compute landscape is maturing. Practical points to consider:

On-device models will become standard: small LLMs and transformer-based classifiers running on NPUs (like AI HAT+ 2) will be the default way to preprocess and de-duplicate scraped content before sending it upstream.
Browsers will keep introducing anti-automation signals: plan for continual maintenance of Chromium flags and Playwright adapters. The image's auto-update strategy must include smoke tests to detect breaking changes early.
Edge OS management will shift to declarative A/B images and remote attestation; adopt signed images and remote health attestation to meet compliance needs.

Practical rule: keep scraping logic declarative and small on-device — use the Pi for fetching, parsing coarse structure, and local inference. Move heavy transformation to centralized pipelines.

Example: end-to-end workflow (code snippets)

Start Chromium service:

sudo systemctl start scraper.service
sudo systemctl status scraper.service

Run a sample scraping job that uses the hat-agent for summarisation:

# Node example (simplified)
const puppeteer = require('puppeteer-core');
const axios = require('axios');
(async ()=>{
  const browser = await puppeteer.connect({browserWSEndpoint:'ws://127.0.0.1:9222'});
  const page = await browser.newPage();
  await page.goto('https://example.com', {waitUntil:'networkidle2'});
  const screenshot = await page.screenshot({encoding:'base64'});
  const r = await axios.post('http://127.0.0.1:8001/v1/infer', {type:'ocr', image_base64:screenshot});
  console.log(r.data);
  await browser.close();
})();

Checklist before deploying at scale

Verify image checksum and signature for each build.
Run the included smoke test suite on a staging node.
Ensure updater.timer is enabled and smoke tests are passing.
Configure fleet registration and monitoring endpoints.
Audit and tune Chromium flags vs your target sites.

Actionable takeaways

Use the provided, opinionated image to cut setup time and reduce drift.
Make updates safe: use systemd timers, smoke-tests, and revert hooks.
Leverage AI HAT+ 2 for on-device inference to save bandwidth and speed up pipelines.
Harden nodes with unprivileged users, systemd sandboxing, and signed images.

Next steps & call-to-action

Ready to try it? Clone the starter repo, inspect the image, and flash one Pi 5 as a staging node. The repo includes detailed docs, a signed image, and example pipelines that integrate with common orchestration systems. If you want bespoke tuning (site-specific Chromium flags, custom hat models, or fleet onboarding scripts), open an issue or pull request — the template is designed to be forked and extended for enterprise fleets.

Download the starter repo and image: https://github.com/webscraper-uk/pi5-scraper-starter

Credits & sources

This starter project is informed by recent hardware and software trends (AI HAT+ 2 hardware acceleration, evolving Chromium headless modes in 2025–2026, and improved edge orchestration best practices). For details on the AI HAT+ 2 and Pi 5 updates, consult vendor documentation and community testing notes included in the repo.

Final note

Scraping at scale in 2026 requires more than scripts — it needs a reproducible, secure image and update strategy. This starter image provides an opinionated platform to move fast without compromising reliability. Flash a Pi 5, run the smoke tests, and you’ll have a resilient scraper node with headless Chromium, systemd orchestration, auto-updates, and AI HAT support within minutes.

webscraper

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.