Scraping Pagination, Infinite Scroll, and Load More

Learn how to scrape pagination, infinite scroll, and load more buttons using a practical framework that works across static and dynamic sites.

Pagination, infinite scroll, and load more buttons are three of the most common reasons a scraper works on page one and fails everywhere else. This guide shows how to recognise each pattern, choose the simplest extraction method, and build a scraper that keeps collecting complete results even when the site uses JavaScript, background API calls, or changing front-end frameworks. If you scrape catalogue pages, job boards, search listings, or article archives, this is the part of a web scraping tutorial that usually determines whether your project stays reliable.

Overview

The core problem is simple: the data you want is rarely exposed in a single HTML document. Websites split content across numbered pages, reveal more items after a click, or keep appending results as the user scrolls. The page looks continuous to a person, but under the hood the site may be doing one of several different things.

For scraping, that distinction matters more than the visual design. Two sites can both look like infinite scroll, but one may fetch JSON from a clean API endpoint while the other renders fragments inside a browser session. A “load more” button may just request the next batch with a cursor token, or it may trigger a GraphQL call that changes every session. Traditional pagination may use stable query parameters, or it may hide state inside JavaScript variables.

A practical scraper starts by answering one question: what actually causes the next batch of records to appear?

In most cases, you will be dealing with one of these patterns:

Pagination: separate URLs such as ?page=2, /page/3, or category paths with numbered segments.
Load more button: a button triggers additional requests and appends new items to the same document.
Infinite scroll: scrolling near the bottom triggers more requests automatically.
Hybrid patterns: the site scrolls, then exposes a load more button, or uses paginated API calls behind an infinite scroll interface.

The good news is that the method for handling all three is more consistent than it first appears. Whether you use a Python web scraper with Requests and Beautiful Soup, Scrapy, or browser automation with Playwright or Puppeteer, the same workflow applies:

Inspect the network activity.
Find the underlying data source if possible.
Fall back to browser actions only when necessary.
Track stopping conditions carefully.
Deduplicate and validate records as you go.

If you are new to HTTP parsing, pair this article with our Python Web Scraping Tutorial for Beginners: Requests and Beautiful Soup. If your target is heavily JavaScript-rendered, see How to Scrape JavaScript-Rendered Websites With Playwright.

Core framework

Use this framework whenever you need to scrape multiple pages or handle dynamic content scraping reliably.

Do not assume that the visible interaction is the right automation target. Open developer tools and watch the network tab while moving to the next batch of results.

Look for:

XHR or Fetch requests returning JSON
GraphQL requests with variables such as page, offset, or cursor
HTML fragment responses
URL changes in the browser address bar
hidden form fields or tokens

If you can request structured data directly, that is usually more stable and faster than scraping rendered HTML. Browser automation is excellent, but it should not be your first choice when a clean data endpoint already exists.

2. Prefer the simplest workable method

A useful rule is:

Use HTTP requests when the next page is accessible through predictable URLs or network calls.
Use a headless browser when the content only appears after JavaScript execution, scrolling, clicking, or authenticated session logic.

For Python projects, that often means Requests plus Beautiful Soup for standard pagination, and Playwright for JavaScript-heavy interfaces. For Node.js, Puppeteer scraping or Playwright web scraping are common choices. If you are comparing tooling, our Selenium vs Playwright vs Puppeteer for Web Scraping article gives a useful overview.

3. Decide how you will advance through the result set

There are three main progression models:

Page numbers: increment page=1,2,3...
Offset and limit: increment offset=0,20,40...
Cursor or token: use a returned cursor from one response in the next request

Cursor-based systems are common in modern applications because they are more flexible than fixed page numbers. They also tend to break simple scripts if you do not capture the exact token the site expects.

4. Define a stop condition before you start

Many scraping bugs come from weak stopping logic. Build one or more of these checks into your scraper:

No next page link exists
The API returns an empty list
The load more button becomes disabled or disappears
The cursor is null or missing
No new item IDs appear after a scroll or click
A maximum page or item limit is reached for safety

That last point matters in production. Even if a site appears endless, your scraper should not be.

5. Track identity, not just position

When scraping dynamic websites, item order can change during collection. New records may be inserted at the top, sponsored entries may appear, and lazy-loaded content may render twice. Relying on “the twentieth card on the page” is brittle. Instead, capture a stable identifier for each record where possible:

product URL
listing ID
article slug
job ID
canonical link

Store those IDs in a set or database table during the run. If the next page or next scroll batch contains only duplicates, that is often a useful signal that you are done or that the site is recycling visible results.

This sounds small, but it keeps scrapers maintainable. One function should collect the item data from the current state. Another should move to the next state. That separation makes it easier to debug whether a failure came from selectors, timing, or navigation logic.

In Scrapy, this often means one callback for parsing results and another request for the next page. In Playwright or Puppeteer, it means extracting cards after each successful scroll or click, then checking whether another action is needed.

If you want a deeper library comparison before choosing a stack, see Best Python Libraries for Web Scraping: Updated Comparison and Best Node.js Libraries for Web Scraping and Browser Automation.

Practical examples

These examples focus on the decision-making process rather than site-specific selectors, so you can adapt them to new targets.

Example 1: Standard pagination with Requests and Beautiful Soup

This is the cleanest case for a web scraping python workflow. You inspect the category page and notice the URL changes from ?page=1 to ?page=2. The HTML already contains the data you need.

Your approach:

Request the first page.
Parse all result cards.
Look for the next page link or increment the page parameter.
Stop when there is no next page or the results are empty.

This method is ideal for article archives, forums, and some ecommerce categories. It is lightweight, cheap to run, and easy to schedule with cron jobs for scraping.

A minimal Python pattern looks like this:

import requests
from bs4 import BeautifulSoup

page = 1
seen = set()

while True:
    url = f"https://example.com/products?page={page}"
    r = requests.get(url, timeout=30)
    soup = BeautifulSoup(r.text, "html.parser")

    cards = soup.select(".product-card")
    if not cards:
        break

    new_count = 0
    for card in cards:
        link = card.select_one("a")
        href = link.get("href") if link else None
        if not href or href in seen:
            continue
        seen.add(href)
        new_count += 1
        # extract fields here

    if new_count == 0:
        break

    next_link = soup.select_one("a[rel='next']")
    if not next_link:
        break

    page += 1

Even here, note the duplicate check. It protects you from pagination loops and inconsistent result ordering.

Example 2: Load more button with Playwright

You inspect the page and find that the initial HTML only contains the first 24 items. Clicking “Load more” appends more cards without changing the URL. The network panel shows additional fetch requests, but the simplest reliable route is to automate the click and extract after each batch.

Your approach:

Open the page in Playwright.
Extract current items.
Click the load more button if visible.
Wait for either network completion or a rise in item count.
Repeat until the button disappears or no new items are added.

A sketch:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/listings", wait_until="networkidle")

    seen = set()

    while True:
        items = page.locator(".listing-card")
        count_before = items.count()

        for i in range(count_before):
            href = items.nth(i).locator("a").get_attribute("href")
            if href and href not in seen:
                seen.add(href)
                # extract fields

        button = page.locator("button:has-text('Load more')")
        if not button.is_visible():
            break

        button.click()
        page.wait_for_timeout(1500)

        count_after = page.locator(".listing-card").count()
        if count_after <= count_before:
            break

    browser.close()

In a real project, waiting on a specific response or a selector change is better than a fixed sleep. But the pattern is the key: extract, click, verify that new content arrived, repeat.

Example 3: Infinite scroll scraping by monitoring growth

Infinite scroll often looks harder than it is. In many cases, the browser sends predictable API calls while the page scrolls. If you can call that endpoint directly, use it. If not, browser automation works well.

Your approach:

Load the page and extract the first visible batch.
Scroll to the bottom.
Wait until either the item count increases or a loader disappears.
Repeat until several consecutive scrolls produce no new items.

A reliable stop condition is important because some pages keep scrolling while repeating the same records.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto('https://example.com/feed', { waitUntil: 'networkidle' });

  const seen = new Set();
  let stableRounds = 0;

  while (stableRounds < 3) {
    const links = await page.locator('.feed-item a').evaluateAll(nodes =>
      nodes.map(n => n.href).filter(Boolean)
    );

    const before = seen.size;
    links.forEach(link => seen.add(link));

    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(1500);

    if (seen.size === before) {
      stableRounds += 1;
    } else {
      stableRounds = 0;
    }
  }

  console.log([...seen]);
  await browser.close();
})();

Using several stable rounds instead of one avoids stopping too early on slow pages.

Example 4: Hidden API behind a dynamic interface

This is often the best outcome. The site appears to require headless browser scraping, but the network tab reveals a JSON endpoint like:

/api/search?query=laptops&offset=48&limit=24

At that point, you may not need a browser at all. Replaying the API request can make the scraper simpler, faster, and easier to monitor. Watch for:

required headers
session cookies
anti-CSRF tokens
cursor values in the response body

If the API requires a browser-derived token that changes often, browser automation may still be the practical option. But always check the network layer first. It is one of the most useful habits in any data extraction tutorial.

For a modern browser-led workflow, our Puppeteer Web Scraping Guide: Extract Data From Modern Web Apps goes deeper into rendered applications.

Common mistakes

Most failures in scrape pagination and infinite scroll scraping come from a small set of avoidable issues.

Scraping the page source too early

If you collect HTML before the page has rendered new content, you will only scrape the initial batch. Wait for a meaningful signal: increased item count, a specific response, or the disappearance of a loader.

Using fixed delays everywhere

sleep(2) is easy, but it is not robust. Slow networks, heavy pages, or intermittent API delays will break that assumption. Prefer event-driven waits where possible.

Ignoring duplicate records

Dynamic result sets often repeat cards across batches. Without deduplication, you may think the scraper is finding more data when it is just re-reading the same items.

Clicking without checking state

A load more button may be present but disabled. Or it may be covered by a cookie banner. Verify visibility, enabled state, and post-click changes.

Missing API calls in the network tab

Many developers jump straight into Playwright web scraping when a simpler JSON request would do. The browser is useful, but it should not hide easier paths from you.

Not planning for rate limiting and retries

Even when the extraction logic is correct, repeated page loads or API calls can trigger throttling. Add backoff, reasonable concurrency, and idempotent retries. For larger projects, this becomes part of scraper reliability rather than just parsing logic.

Assuming one site pattern generalises perfectly

One ecommerce site may paginate by page number, another by offset, another by cursor, and another by a signed request body. Reuse the framework, not the assumptions.

Forgetting that list pages can change mid-run

This is common on news, jobs, and marketplaces. If new items are inserted while you scrape, page 2 may no longer be the same page 2 you expected. Stable IDs and timestamps help you manage this.

When to revisit

The best scraper for pagination or dynamic content is never truly finished. Revisit your approach when the site changes how it reveals records, when your extraction volume grows, or when your maintenance burden starts to creep up.

In practice, update your scraper when:

the next page URL structure changes
the site replaces page numbers with cursors
a previously public JSON endpoint moves behind authenticated requests
load more interactions become scroll-based
your browser script becomes slow enough that an API-based rewrite is worth it
new anti-bot behaviour appears and your old timing assumptions stop working

A practical review checklist looks like this:

Reinspect the network tab. Sites often change front-end code before they change data endpoints.
Retest your stop conditions. Confirm that your scraper still knows when to end cleanly.
Validate record counts. Compare a manual browse with scraper output to catch silent misses.
Review selectors. Dynamic interfaces change class names often; prefer stable attributes where available.
Reconsider the tool choice. A requests-based scraper may now need Playwright, or a browser workflow may now be replaceable by direct API calls.

If you are maintaining several projects, it helps to keep a short runbook for each target: how navigation works, what request pattern drives the next batch, what the stop condition is, and what fields act as stable identifiers. That documentation saves time the next time the site shifts from pagination to infinite scroll, which happens more often than most teams expect.

The enduring lesson is this: do not think in terms of buttons or scrolling first. Think in terms of state changes and data flow. Once you know what event reveals the next records and what signal proves they arrived, you can usually scrape multiple pages with confidence, whether the site uses classic links, JavaScript buttons, or a modern app shell.

For broader tooling guidance, you may also want to read Scrapy tutorial style resources in our Python coverage, along with our browser automation comparisons. But the framework in this guide should stay useful even as interfaces change: inspect the underlying requests, choose the simplest extraction path, track stable identifiers, and stop only when you can prove there is nothing new left to collect.

How to Handle Pagination, Infinite Scroll, and Load More Buttons When Scraping

Overview

Core framework

1. Identify the navigation pattern, but verify the data path

2. Prefer the simplest workable method

3. Decide how you will advance through the result set

4. Define a stop condition before you start

5. Track identity, not just position

6. Build extraction and navigation as separate steps

Practical examples

Example 2: Load more button with Playwright

Example 3: Infinite scroll scraping by monitoring growth

Example 4: Hidden API behind a dynamic interface

Common mistakes

Scraping the page source too early

Using fixed delays everywhere

Ignoring duplicate records

Clicking without checking state

Missing API calls in the network tab

Not planning for rate limiting and retries

Assuming one site pattern generalises perfectly

Forgetting that list pages can change mid-run

When to revisit

Related Topics

Code Scrape Hub Editorial

Up Next

How to Detect Website Structure Changes Before Your Scraper Breaks

How to Scrape Data From Logins and Session-Based Websites

Cheerio vs JSDOM vs Puppeteer: Best Way to Parse Web Pages in Node.js