Reduce Memory in Node.js Puppeteer Crawlers

Practical Node.js + Puppeteer patterns — streaming, lazy DOM parsing and worker pools — to stop memory growth in long-running crawlers.

Cut memory use in long-running Node.js + Puppeteer crawlers — patterns that actually work

If your crawler grows memory over days or weeks, brings servers to their knees, or forces you to buy more RAM (expensive in 2026 as AI demand spikes), this guide is for you. Below are concrete, production-ready patterns — with code — for streaming, lazy DOM parsing, and worker pools to keep long-running Puppeteer crawlers stable and efficient.

Why memory matters in 2026 (and why you should optimise now)

Late 2025 and early 2026 saw memory prices rise because of surging demand for AI training hardware. That hit cloud costs and on-prem capacity planning. For teams running large-scale crawlers, memory leaks or inefficient scraping patterns now translate directly into higher infrastructure bills and more frequent restarts. Efficient memory usage isn't just optimisation — it's cost control and reliability.

High-level patterns to control memory growth

Reuse browsers, limit pages: A single Chromium instance with many short-lived pages is often better than many browser processes.
Pool and recycle pages or contexts: Reuse pages to avoid repeated allocation and native leaks; recycle periodically.
Offload large downloads via streaming: Never buffer large responses into Node buffers when you can stream to disk or S3.
Lazy DOM parsing: Extract only what you need inside the page context; avoid grabbing full HTML or large blobs.
Worker isolation: Use child processes or a worker pool to contain leaks and reclaim memory via process restarts.
Instrument and collect heap snapshots: Track trends and automate restarts when memory grows past thresholds.

1) Browser & page lifecycle: reuse, recycle, and avoid leaks

Puppeteer scripts often create a browser per worker and a page per task. That’s OK, but memory grows when pages accumulate listeners, timers, or unreleased handles. Use a page pool and a periodic recycle strategy.

Page pool (single-process) — pattern

Keep one browser instance, maintain a pool of pages, and reuse pages for tasks. Recreate a page after N uses or when memory metrics grow.

// pagePool.js (simplified)
const puppeteer = require('puppeteer');

class PagePool {
  constructor(opts = {}) {
    this.maxPages = opts.maxPages || 5;
    this.recreateAfter = opts.recreateAfter || 100; // uses
    this.browser = null;
    this.pool = [];
  }

  async init(launchArgs = []) {
    this.browser = await puppeteer.launch(launchArgs);
  }

  async acquire() {
    let pageData = this.pool.pop();
    if (!pageData && (await this._pagesCount()) < this.maxPages) {
      const page = await this.browser.newPage();
      pageData = { page, uses: 0 };
    }
    return pageData;
  }

  async release(pageData) {
    pageData.uses += 1;

    // clean up page: remove event listeners inside Node side
    pageData.page.removeAllListeners();

    if (pageData.uses >= this.recreateAfter) {
      await pageData.page.close();
      const page = await this.browser.newPage();
      this.pool.push({ page, uses: 0 });
    } else {
      this.pool.push(pageData);
    }
  }

  async _pagesCount() {
    const targets = await this.browser.targets();
    return targets.filter(t => t.type() === 'page').length;
  }

  async close() {
    await Promise.all(this.pool.map(p => p.page.close()));
    await this.browser.close();
  }
}

module.exports = PagePool;

Key notes: call page.removeAllListeners() and close pages you recreate. Node-side listeners are a common source of leaks.

Isolate long-running tasks with incognito contexts

When sessions or site state could persist, use browser.createIncognitoBrowserContext(), run a few pages inside it, then dispose the context to clear cookies and caches in one go.

// quick example
const ctx = await browser.createIncognitoBrowserContext();
const page = await ctx.newPage();
// ... tasks ...
await ctx.close(); // clears context state

2) Streaming large responses — avoid buffering in memory

Downloading large files or images via Puppeteer's response.buffer() easily swamps RAM. Use Node HTTP streaming or the Chromium CDP to stream directly to disk or cloud storage.

Preferred: fetch resource URL with native Node stream

When you can get a direct URL (images, PDFs, CSVs), fetch with node's HTTP(s) stream or undici and pipe to disk.

const fs = require('fs');
const { request } = require('undici');

async function streamToFile(url, destPath) {
  const { body } = await request(url);
  await new Promise((resolve, reject) => {
    const out = fs.createWriteStream(destPath);
    body.pipe(out);
    out.on('finish', resolve);
    out.on('error', reject);
  });
}

This avoids copying the whole body into Node buffers. Use an S3 streaming uploader for cloud storage.

When you must get network body from the page: use CDP streaming

Puppeteer’s high-level API sometimes requires using the DevTools Protocol. Use Fetch and stream responses via their requestId to avoid buffering huge bodies in memory.

// CDP fetch example (simplified)
const client = await page.target().createCDPSession();
await client.send('Fetch.enable', { patterns: [{ requestStage: 'Response' }] });

client.on('Fetch.requestPaused', async ev => {
  const { requestId } = ev;
  const resp = await client.send('Fetch.getResponseBody', { requestId });
  // resp.body is base64: prefer streaming at request time instead
  await client.send('Fetch.continueRequest', { requestId });
});

CDP approaches are advanced and brittle across Chromium versions; prefer node-side streaming when possible.

3) Lazy DOM parsing — only serialize what you need

Many scrapers call page.content() or grab big JSON blobs. That increases memory. Instead, run lightweight selectors inside the page and return minimal serializable values.

Prefer page.$eval / page.$$eval over page.evaluate + querySelectorAll copies

// Good: runs in browser and returns small data
const title = await page.$eval('h1.title', el => el.textContent.trim());

// For lists use $$eval and map to primitives
const items = await page.$$eval('.product', nodes =>
  nodes.map(n => ({
    id: n.getAttribute('data-id'),
    price: n.querySelector('.price')?.textContent.trim(),
  }))
);

These methods execute in the browser and only send back small JSON results across the DevTools channel. Avoid returning DOM nodes (ElementHandles) unless you explicitly dispose them.

Dispose ElementHandles and JSHandles

If you use page.$ or page.evaluateHandle, call handle.dispose() to release the remote object.

const handle = await page.$('.huge-list');
// do something
await handle.dispose(); // important

Avoid page.content() for large pages

page.content() sends the entire HTML back to Node and will multiply memory usage under concurrency. If you need specific parts, query them inside the page and return only strings or objects.

4) Worker pools and process isolation

A single Node process is convenient but leaks and native memory fragmentation can accumulate. Use a worker pool pattern where each worker is a separate process that launches a browser (or shares one via remote connect). Restart workers periodically to reclaim memory.

Why processes not threads

Puppeteer uses native Chromium and can leak native memory that Node GC won't free. Child processes or containers allow the OS to reclaim memory fully on exit.

Simple worker pool with child_process.fork

// master.js
const { fork } = require('child_process');
const poolSize = require('os').cpus().length;
const workers = [];

for (let i = 0; i < poolSize; i++) {
  workers.push(fork('./worker.js'));
}

// round-robin dispatch
let idx = 0;
function dispatch(task) {
  workers[idx].send(task);
  idx = (idx + 1) % workers.length;
}

// worker.js
process.on('message', async (task) => {
  const result = await doTask(task); // launches its own browser/page or connects
  process.send({ id: task.id, result });
});

// periodic restart (in master)
setInterval(() => {
  const w = workers.shift();
  const newW = fork('./worker.js');
  workers.push(newW);
  w.kill(); // OS reclaims memory
}, 1000 * 60 * 60); // restart every hour (tune as needed)

Restart cadence should be informed by metrics (see below). Restarting every X tasks or Y minutes provides a practical balance between throughput and memory safety.

5) Observability: measure memory and automate responses

Visibility matters. Expose both Node and Chromium memory metrics and make decisions based on them.

Node-level memory sampling

setInterval(() => {
  const m = process.memoryUsage();
  // rss, heapTotal, heapUsed, external
  console.log('mem', m);
}, 60000);

Chromium heap snapshots and GC

Use the DevTools Protocol to trigger GC and take heap snapshots for debugging. In production, prefer HeapProfiler.collectGarbage as part of a health-check.

const client = await page.target().createCDPSession();
await client.send('HeapProfiler.collectGarbage');
// for snapshots (expensive):
await client.send('HeapProfiler.takeHeapSnapshot');

Automate restarts when either Node or Chromium metrics exceed thresholds. For example, if rss > 1GB or Chromium privateBytes grows by 20% over an hour, recycle the worker.

6) Practical tips and gotchas

Disable heavy features if you don't need them: images, fonts, or unnecessary JS. Use request interception or set page.setRequestInterception to block.
Clear intervals/timeouts inside page context: timers set by page scripts remain running. Inject a cleanup script before close or use incognito contexts.
Avoid global singletons that hold DOM handles: storing ElementHandles or page references in caches will leak memory.
Run Node with --expose-gc and call global.gc() occasionally in workers if you have detectable JS heap pressure — but don't rely on it to solve native leaks.
Prefer headless=‘new’ modes or lean browser flags that reduce concurrency overhead; tune --js-flags and chrome args (e.g. --disable-dev-shm-usage).

7) Example: a production-ready scraping flow

Combine the patterns into a flow: worker pool (processes) → shared browser or per-worker browser → page pool within worker → stream assets with undici → lazy DOM extraction → metrics and recycle on thresholds.

// simplified flow pseudocode
master: spawn N workers
for each task: send to worker

worker: on startup -> launch browser
create local pagePool
on message(task):
  page = await pagePool.acquire()
  try {
    await page.goto(task.url, { waitUntil: 'domcontentloaded' });
    const data = await page.$$eval('.item', nodes => nodes.map(n => ({ id: n.dataset.id })))

    for (const asset of data.assets) {
      // stream via node-side undici
      await streamToS3(asset.url);
    }

    send result back
  } finally {
    await pagePool.release(page)
  }

// periodic health check inside worker
if (process.memoryUsage().rss > WORKER_RSS_LIMIT || chromiumMemory > CHROME_LIMIT) {
  process.exit(0) // master restarts worker
}

8) When to use shared browser vs one browser per worker

Shared browser (connect via WebSocket) can reduce memory footprint because Chromium is a single process. However, a single large browser can become a single point of failure and harder to recycle. Best practice in 2026: use a hybrid approach — a small number of browser pools (e.g., 2–4 browsers per host) and multiple worker processes connecting to them. This balances memory efficiency and recoverability.

9) Advanced: native memory debugging and heap analysis

When leaks are subtle, take heap snapshots from Chromium and Node and compare over time. Tools in 2025–2026 have improved: automated diffing of heap snapshots and integrations with observability platforms. Export snapshots and use tools like Chrome DevTools or community tools to find detached DOM trees or retained nodes.

10) Quick checklist before you deploy

Limit concurrent pages per browser (start small — 3–10).
Use streaming for all large downloads.
Use page.$eval / $$eval and dispose handles.
Run workers as processes and restart periodically.
Instrument Node and Chromium memory and set automated restarts.
Test for leaks with long-running load tests (72+ hours) before scaling.

“Measure, then automate safe restarts.” — practical rule-of-thumb for stable scrapers in 2026

Actionable takeaways

Immediately: stop using page.content() and response.buffer() for large payloads; switch to streaming where possible.
This week: implement a page pool and ensure JSHandles are disposed.
This month: move to process-isolated workers with health checks and automated restart logic.
Ongoing: collect and alert on memory metrics (Node rss, Chromium private bytes) and tune thresholds based on real traffic.

Closing — planning for the future

Memory-conscious design is no longer optional. With memory costs rising in 2026 and modern crawlers running longer and at larger scale, you must treat memory as a first-class performance metric. The patterns above are pragmatic: streaming, lazy DOM parsing, pooled pages, and worker isolation keep crawlers stable, predictable, and cost-efficient.

If you want a ready-to-deploy starter: build a small proof-of-concept that uses a child-process worker pool, a page pool inside each worker, and undici streaming for assets. Run it for 72 hours against a realistic workload, collect memory metrics, and tune restart thresholds.

Try these patterns in your stack and report back with metrics (heap trends, rss, throughput). If you need a review of your crawler architecture, reach out for a practical audit that includes a leak-hunt, recommended thresholds, and a restart/recycle policy tuned to your workload.

Ready to reduce memory waste and stabilise your scrapers? Export your crawler metrics (process.memoryUsage and Chromium stats) and run the checklist above — then contact us for a production audit and a customised restart policy tuned to real-world traffic patterns.

Reducing Memory Use in Large-Scale JS Scrapers: Patterns and Code Snippets

Cut memory use in long-running Node.js + Puppeteer crawlers — patterns that actually work

Why memory matters in 2026 (and why you should optimise now)

High-level patterns to control memory growth

1) Browser & page lifecycle: reuse, recycle, and avoid leaks

Page pool (single-process) — pattern

Isolate long-running tasks with incognito contexts

2) Streaming large responses — avoid buffering in memory

Preferred: fetch resource URL with native Node stream

When you must get network body from the page: use CDP streaming

3) Lazy DOM parsing — only serialize what you need

Prefer page.$eval / page.$$eval over page.evaluate + querySelectorAll copies

Dispose ElementHandles and JSHandles

Avoid page.content() for large pages

4) Worker pools and process isolation

Why processes not threads

Simple worker pool with child_process.fork

5) Observability: measure memory and automate responses

Node-level memory sampling

Chromium heap snapshots and GC

6) Practical tips and gotchas

7) Example: a production-ready scraping flow

8) When to use shared browser vs one browser per worker

9) Advanced: native memory debugging and heap analysis

10) Quick checklist before you deploy

Actionable takeaways

Closing — planning for the future

Related Topics

webscraper

Up Next

How to Detect Website Structure Changes Before Your Scraper Breaks

How to Scrape Data From Logins and Session-Based Websites

Cheerio vs JSDOM vs Puppeteer: Best Way to Parse Web Pages in Node.js

Cut memory use in long-running Node.js + Puppeteer crawlers — patterns that actually work

Why memory matters in 2026 (and why you should optimise now)

High-level patterns to control memory growth

1) Browser & page lifecycle: reuse, recycle, and avoid leaks

Page pool (single-process) — pattern

Isolate long-running tasks with incognito contexts

2) Streaming large responses — avoid buffering in memory

Preferred: fetch resource URL with native Node stream

When you must get network body from the page: use CDP streaming

3) Lazy DOM parsing — only serialize what you need

Prefer page.$eval / page.$$eval over page.evaluate + querySelectorAll copies

Dispose ElementHandles and JSHandles

Avoid page.content() for large pages

4) Worker pools and process isolation

Why processes not threads

Simple worker pool with child_process.fork

5) Observability: measure memory and automate responses

Node-level memory sampling

Chromium heap snapshots and GC

6) Practical tips and gotchas

7) Example: a production-ready scraping flow

8) When to use shared browser vs one browser per worker

9) Advanced: native memory debugging and heap analysis

10) Quick checklist before you deploy

Actionable takeaways

Closing — planning for the future

Get help or share your results

Related Reading

Related Topics

webscraper

Up Next

How to Detect Website Structure Changes Before Your Scraper Breaks

How to Scrape Data From Logins and Session-Based Websites

Cheerio vs JSDOM vs Puppeteer: Best Way to Parse Web Pages in Node.js