Reducing Memory Use in Large-Scale JS Scrapers: Patterns and Code Snippets
Practical Node.js + Puppeteer patterns — streaming, lazy DOM parsing and worker pools — to stop memory growth in long-running crawlers.
Cut memory use in long-running Node.js + Puppeteer crawlers — patterns that actually work
If your crawler grows memory over days or weeks, brings servers to their knees, or forces you to buy more RAM (expensive in 2026 as AI demand spikes), this guide is for you. Below are concrete, production-ready patterns — with code — for streaming, lazy DOM parsing, and worker pools to keep long-running Puppeteer crawlers stable and efficient.
Why memory matters in 2026 (and why you should optimise now)
Late 2025 and early 2026 saw memory prices rise because of surging demand for AI training hardware. That hit cloud costs and on-prem capacity planning. For teams running large-scale crawlers, memory leaks or inefficient scraping patterns now translate directly into higher infrastructure bills and more frequent restarts. Efficient memory usage isn't just optimisation — it's cost control and reliability.
High-level patterns to control memory growth
- Reuse browsers, limit pages: A single Chromium instance with many short-lived pages is often better than many browser processes.
- Pool and recycle pages or contexts: Reuse pages to avoid repeated allocation and native leaks; recycle periodically.
- Offload large downloads via streaming: Never buffer large responses into Node buffers when you can stream to disk or S3.
- Lazy DOM parsing: Extract only what you need inside the page context; avoid grabbing full HTML or large blobs.
- Worker isolation: Use child processes or a worker pool to contain leaks and reclaim memory via process restarts.
- Instrument and collect heap snapshots: Track trends and automate restarts when memory grows past thresholds.
1) Browser & page lifecycle: reuse, recycle, and avoid leaks
Puppeteer scripts often create a browser per worker and a page per task. That’s OK, but memory grows when pages accumulate listeners, timers, or unreleased handles. Use a page pool and a periodic recycle strategy.
Page pool (single-process) — pattern
Keep one browser instance, maintain a pool of pages, and reuse pages for tasks. Recreate a page after N uses or when memory metrics grow.
// pagePool.js (simplified)
const puppeteer = require('puppeteer');
class PagePool {
constructor(opts = {}) {
this.maxPages = opts.maxPages || 5;
this.recreateAfter = opts.recreateAfter || 100; // uses
this.browser = null;
this.pool = [];
}
async init(launchArgs = []) {
this.browser = await puppeteer.launch(launchArgs);
}
async acquire() {
let pageData = this.pool.pop();
if (!pageData && (await this._pagesCount()) < this.maxPages) {
const page = await this.browser.newPage();
pageData = { page, uses: 0 };
}
return pageData;
}
async release(pageData) {
pageData.uses += 1;
// clean up page: remove event listeners inside Node side
pageData.page.removeAllListeners();
if (pageData.uses >= this.recreateAfter) {
await pageData.page.close();
const page = await this.browser.newPage();
this.pool.push({ page, uses: 0 });
} else {
this.pool.push(pageData);
}
}
async _pagesCount() {
const targets = await this.browser.targets();
return targets.filter(t => t.type() === 'page').length;
}
async close() {
await Promise.all(this.pool.map(p => p.page.close()));
await this.browser.close();
}
}
module.exports = PagePool;
Key notes: call page.removeAllListeners() and close pages you recreate. Node-side listeners are a common source of leaks.
Isolate long-running tasks with incognito contexts
When sessions or site state could persist, use browser.createIncognitoBrowserContext(), run a few pages inside it, then dispose the context to clear cookies and caches in one go.
// quick example
const ctx = await browser.createIncognitoBrowserContext();
const page = await ctx.newPage();
// ... tasks ...
await ctx.close(); // clears context state
2) Streaming large responses — avoid buffering in memory
Downloading large files or images via Puppeteer's response.buffer() easily swamps RAM. Use Node HTTP streaming or the Chromium CDP to stream directly to disk or cloud storage.
Preferred: fetch resource URL with native Node stream
When you can get a direct URL (images, PDFs, CSVs), fetch with node's HTTP(s) stream or undici and pipe to disk.
const fs = require('fs');
const { request } = require('undici');
async function streamToFile(url, destPath) {
const { body } = await request(url);
await new Promise((resolve, reject) => {
const out = fs.createWriteStream(destPath);
body.pipe(out);
out.on('finish', resolve);
out.on('error', reject);
});
}
This avoids copying the whole body into Node buffers. Use an S3 streaming uploader for cloud storage.
When you must get network body from the page: use CDP streaming
Puppeteer’s high-level API sometimes requires using the DevTools Protocol. Use Fetch and stream responses via their requestId to avoid buffering huge bodies in memory.
// CDP fetch example (simplified)
const client = await page.target().createCDPSession();
await client.send('Fetch.enable', { patterns: [{ requestStage: 'Response' }] });
client.on('Fetch.requestPaused', async ev => {
const { requestId } = ev;
const resp = await client.send('Fetch.getResponseBody', { requestId });
// resp.body is base64: prefer streaming at request time instead
await client.send('Fetch.continueRequest', { requestId });
});
CDP approaches are advanced and brittle across Chromium versions; prefer node-side streaming when possible.
3) Lazy DOM parsing — only serialize what you need
Many scrapers call page.content() or grab big JSON blobs. That increases memory. Instead, run lightweight selectors inside the page and return minimal serializable values.
Prefer page.$eval / page.$$eval over page.evaluate + querySelectorAll copies
// Good: runs in browser and returns small data
const title = await page.$eval('h1.title', el => el.textContent.trim());
// For lists use $$eval and map to primitives
const items = await page.$$eval('.product', nodes =>
nodes.map(n => ({
id: n.getAttribute('data-id'),
price: n.querySelector('.price')?.textContent.trim(),
}))
);
These methods execute in the browser and only send back small JSON results across the DevTools channel. Avoid returning DOM nodes (ElementHandles) unless you explicitly dispose them.
Dispose ElementHandles and JSHandles
If you use page.$ or page.evaluateHandle, call handle.dispose() to release the remote object.
const handle = await page.$('.huge-list');
// do something
await handle.dispose(); // important
Avoid page.content() for large pages
page.content() sends the entire HTML back to Node and will multiply memory usage under concurrency. If you need specific parts, query them inside the page and return only strings or objects.
4) Worker pools and process isolation
A single Node process is convenient but leaks and native memory fragmentation can accumulate. Use a worker pool pattern where each worker is a separate process that launches a browser (or shares one via remote connect). Restart workers periodically to reclaim memory.
Why processes not threads
Puppeteer uses native Chromium and can leak native memory that Node GC won't free. Child processes or containers allow the OS to reclaim memory fully on exit.
Simple worker pool with child_process.fork
// master.js
const { fork } = require('child_process');
const poolSize = require('os').cpus().length;
const workers = [];
for (let i = 0; i < poolSize; i++) {
workers.push(fork('./worker.js'));
}
// round-robin dispatch
let idx = 0;
function dispatch(task) {
workers[idx].send(task);
idx = (idx + 1) % workers.length;
}
// worker.js
process.on('message', async (task) => {
const result = await doTask(task); // launches its own browser/page or connects
process.send({ id: task.id, result });
});
// periodic restart (in master)
setInterval(() => {
const w = workers.shift();
const newW = fork('./worker.js');
workers.push(newW);
w.kill(); // OS reclaims memory
}, 1000 * 60 * 60); // restart every hour (tune as needed)
Restart cadence should be informed by metrics (see below). Restarting every X tasks or Y minutes provides a practical balance between throughput and memory safety.
5) Observability: measure memory and automate responses
Visibility matters. Expose both Node and Chromium memory metrics and make decisions based on them.
Node-level memory sampling
setInterval(() => {
const m = process.memoryUsage();
// rss, heapTotal, heapUsed, external
console.log('mem', m);
}, 60000);
Chromium heap snapshots and GC
Use the DevTools Protocol to trigger GC and take heap snapshots for debugging. In production, prefer HeapProfiler.collectGarbage as part of a health-check.
const client = await page.target().createCDPSession();
await client.send('HeapProfiler.collectGarbage');
// for snapshots (expensive):
await client.send('HeapProfiler.takeHeapSnapshot');
Automate restarts when either Node or Chromium metrics exceed thresholds. For example, if rss > 1GB or Chromium privateBytes grows by 20% over an hour, recycle the worker.
6) Practical tips and gotchas
- Disable heavy features if you don't need them: images, fonts, or unnecessary JS. Use request interception or set page.setRequestInterception to block.
- Clear intervals/timeouts inside page context: timers set by page scripts remain running. Inject a cleanup script before close or use incognito contexts.
- Avoid global singletons that hold DOM handles: storing ElementHandles or page references in caches will leak memory.
- Run Node with --expose-gc and call global.gc() occasionally in workers if you have detectable JS heap pressure — but don't rely on it to solve native leaks.
- Prefer headless=‘new’ modes or lean browser flags that reduce concurrency overhead; tune --js-flags and chrome args (e.g. --disable-dev-shm-usage).
7) Example: a production-ready scraping flow
Combine the patterns into a flow: worker pool (processes) → shared browser or per-worker browser → page pool within worker → stream assets with undici → lazy DOM extraction → metrics and recycle on thresholds.
// simplified flow pseudocode
master: spawn N workers
for each task: send to worker
worker: on startup -> launch browser
create local pagePool
on message(task):
page = await pagePool.acquire()
try {
await page.goto(task.url, { waitUntil: 'domcontentloaded' });
const data = await page.$$eval('.item', nodes => nodes.map(n => ({ id: n.dataset.id })))
for (const asset of data.assets) {
// stream via node-side undici
await streamToS3(asset.url);
}
send result back
} finally {
await pagePool.release(page)
}
// periodic health check inside worker
if (process.memoryUsage().rss > WORKER_RSS_LIMIT || chromiumMemory > CHROME_LIMIT) {
process.exit(0) // master restarts worker
}
8) When to use shared browser vs one browser per worker
Shared browser (connect via WebSocket) can reduce memory footprint because Chromium is a single process. However, a single large browser can become a single point of failure and harder to recycle. Best practice in 2026: use a hybrid approach — a small number of browser pools (e.g., 2–4 browsers per host) and multiple worker processes connecting to them. This balances memory efficiency and recoverability.
9) Advanced: native memory debugging and heap analysis
When leaks are subtle, take heap snapshots from Chromium and Node and compare over time. Tools in 2025–2026 have improved: automated diffing of heap snapshots and integrations with observability platforms. Export snapshots and use tools like Chrome DevTools or community tools to find detached DOM trees or retained nodes.
10) Quick checklist before you deploy
- Limit concurrent pages per browser (start small — 3–10).
- Use streaming for all large downloads.
- Use page.$eval / $$eval and dispose handles.
- Run workers as processes and restart periodically.
- Instrument Node and Chromium memory and set automated restarts.
- Test for leaks with long-running load tests (72+ hours) before scaling.
“Measure, then automate safe restarts.” — practical rule-of-thumb for stable scrapers in 2026
Actionable takeaways
- Immediately: stop using page.content() and response.buffer() for large payloads; switch to streaming where possible.
- This week: implement a page pool and ensure JSHandles are disposed.
- This month: move to process-isolated workers with health checks and automated restart logic.
- Ongoing: collect and alert on memory metrics (Node rss, Chromium private bytes) and tune thresholds based on real traffic.
Closing — planning for the future
Memory-conscious design is no longer optional. With memory costs rising in 2026 and modern crawlers running longer and at larger scale, you must treat memory as a first-class performance metric. The patterns above are pragmatic: streaming, lazy DOM parsing, pooled pages, and worker isolation keep crawlers stable, predictable, and cost-efficient.
If you want a ready-to-deploy starter: build a small proof-of-concept that uses a child-process worker pool, a page pool inside each worker, and undici streaming for assets. Run it for 72 hours against a realistic workload, collect memory metrics, and tune restart thresholds.
Get help or share your results
Try these patterns in your stack and report back with metrics (heap trends, rss, throughput). If you need a review of your crawler architecture, reach out for a practical audit that includes a leak-hunt, recommended thresholds, and a restart/recycle policy tuned to your workload.
Ready to reduce memory waste and stabilise your scrapers? Export your crawler metrics (process.memoryUsage and Chromium stats) and run the checklist above — then contact us for a production audit and a customised restart policy tuned to real-world traffic patterns.
Related Reading
- When Nintendo Deletes Your Island: How Creators Can Archive, Monetize, and Protect Long-Term Projects
- Siri Chooses Gemini: Lessons for Teams Selecting Third-Party LLM Providers
- How to Trim Your Procurement Tech Stack Without Slowing Ops
- Cosy Winter Suppers: 10 Low‑Energy Olive Oil Dishes to Keep You Warm
- Pet-Approved: Safety Checklist for Any Dog Cologne or Scented Accessory
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
How to Monetise Creator Content Ethically: Building a Revenue Share Pipeline for Training Data
Cost Forecasting Workbook: Plan Your Scraping Infrastructure When Memory Prices Are Volatile
From Crowd Signals to Clean Datasets: Using Waze-Like Streams Without Breaking TOS
Avoiding Legal Landmines When Scraping Health Data: A UK-Focused Playbook
The Art of Curating Information: How to Create a High-Impact Newsletter for Developers
From Our Network
Trending stories across our publication group