Pagination, infinite scroll, and load more buttons are three of the most common reasons a scraper works on page one and fails everywhere else. This guide shows how to recognise each pattern, choose the simplest extraction method, and build a scraper that keeps collecting complete results even when the site uses JavaScript, background API calls, or changing front-end frameworks. If you scrape catalogue pages, job boards, search listings, or article archives, this is the part of a web scraping tutorial that usually determines whether your project stays reliable.
Overview
The core problem is simple: the data you want is rarely exposed in a single HTML document. Websites split content across numbered pages, reveal more items after a click, or keep appending results as the user scrolls. The page looks continuous to a person, but under the hood the site may be doing one of several different things.
For scraping, that distinction matters more than the visual design. Two sites can both look like infinite scroll, but one may fetch JSON from a clean API endpoint while the other renders fragments inside a browser session. A “load more” button may just request the next batch with a cursor token, or it may trigger a GraphQL call that changes every session. Traditional pagination may use stable query parameters, or it may hide state inside JavaScript variables.
A practical scraper starts by answering one question: what actually causes the next batch of records to appear?
In most cases, you will be dealing with one of these patterns:
- Pagination: separate URLs such as
?page=2,/page/3, or category paths with numbered segments. - Load more button: a button triggers additional requests and appends new items to the same document.
- Infinite scroll: scrolling near the bottom triggers more requests automatically.
- Hybrid patterns: the site scrolls, then exposes a load more button, or uses paginated API calls behind an infinite scroll interface.
The good news is that the method for handling all three is more consistent than it first appears. Whether you use a Python web scraper with Requests and Beautiful Soup, Scrapy, or browser automation with Playwright or Puppeteer, the same workflow applies:
- Inspect the network activity.
- Find the underlying data source if possible.
- Fall back to browser actions only when necessary.
- Track stopping conditions carefully.
- Deduplicate and validate records as you go.
If you are new to HTTP parsing, pair this article with our Python Web Scraping Tutorial for Beginners: Requests and Beautiful Soup. If your target is heavily JavaScript-rendered, see How to Scrape JavaScript-Rendered Websites With Playwright.
Core framework
Use this framework whenever you need to scrape multiple pages or handle dynamic content scraping reliably.
1. Identify the navigation pattern, but verify the data path
Do not assume that the visible interaction is the right automation target. Open developer tools and watch the network tab while moving to the next batch of results.
Look for:
- XHR or Fetch requests returning JSON
- GraphQL requests with variables such as
page,offset, orcursor - HTML fragment responses
- URL changes in the browser address bar
- hidden form fields or tokens
If you can request structured data directly, that is usually more stable and faster than scraping rendered HTML. Browser automation is excellent, but it should not be your first choice when a clean data endpoint already exists.
2. Prefer the simplest workable method
A useful rule is:
- Use HTTP requests when the next page is accessible through predictable URLs or network calls.
- Use a headless browser when the content only appears after JavaScript execution, scrolling, clicking, or authenticated session logic.
For Python projects, that often means Requests plus Beautiful Soup for standard pagination, and Playwright for JavaScript-heavy interfaces. For Node.js, Puppeteer scraping or Playwright web scraping are common choices. If you are comparing tooling, our Selenium vs Playwright vs Puppeteer for Web Scraping article gives a useful overview.
3. Decide how you will advance through the result set
There are three main progression models:
- Page numbers: increment
page=1,2,3... - Offset and limit: increment
offset=0,20,40... - Cursor or token: use a returned cursor from one response in the next request
Cursor-based systems are common in modern applications because they are more flexible than fixed page numbers. They also tend to break simple scripts if you do not capture the exact token the site expects.
4. Define a stop condition before you start
Many scraping bugs come from weak stopping logic. Build one or more of these checks into your scraper:
- No next page link exists
- The API returns an empty list
- The load more button becomes disabled or disappears
- The cursor is null or missing
- No new item IDs appear after a scroll or click
- A maximum page or item limit is reached for safety
That last point matters in production. Even if a site appears endless, your scraper should not be.
5. Track identity, not just position
When scraping dynamic websites, item order can change during collection. New records may be inserted at the top, sponsored entries may appear, and lazy-loaded content may render twice. Relying on “the twentieth card on the page” is brittle. Instead, capture a stable identifier for each record where possible:
- product URL
- listing ID
- article slug
- job ID
- canonical link
Store those IDs in a set or database table during the run. If the next page or next scroll batch contains only duplicates, that is often a useful signal that you are done or that the site is recycling visible results.
6. Build extraction and navigation as separate steps
This sounds small, but it keeps scrapers maintainable. One function should collect the item data from the current state. Another should move to the next state. That separation makes it easier to debug whether a failure came from selectors, timing, or navigation logic.
In Scrapy, this often means one callback for parsing results and another request for the next page. In Playwright or Puppeteer, it means extracting cards after each successful scroll or click, then checking whether another action is needed.
If you want a deeper library comparison before choosing a stack, see Best Python Libraries for Web Scraping: Updated Comparison and Best Node.js Libraries for Web Scraping and Browser Automation.
Practical examples
These examples focus on the decision-making process rather than site-specific selectors, so you can adapt them to new targets.
Example 1: Standard pagination with Requests and Beautiful Soup
This is the cleanest case for a web scraping python workflow. You inspect the category page and notice the URL changes from ?page=1 to ?page=2. The HTML already contains the data you need.
Your approach:
- Request the first page.
- Parse all result cards.
- Look for the next page link or increment the page parameter.
- Stop when there is no next page or the results are empty.
This method is ideal for article archives, forums, and some ecommerce categories. It is lightweight, cheap to run, and easy to schedule with cron jobs for scraping.
A minimal Python pattern looks like this:
import requests
from bs4 import BeautifulSoup
page = 1
seen = set()
while True:
url = f"https://example.com/products?page={page}"
r = requests.get(url, timeout=30)
soup = BeautifulSoup(r.text, "html.parser")
cards = soup.select(".product-card")
if not cards:
break
new_count = 0
for card in cards:
link = card.select_one("a")
href = link.get("href") if link else None
if not href or href in seen:
continue
seen.add(href)
new_count += 1
# extract fields here
if new_count == 0:
break
next_link = soup.select_one("a[rel='next']")
if not next_link:
break
page += 1Even here, note the duplicate check. It protects you from pagination loops and inconsistent result ordering.
Example 2: Load more button with Playwright
You inspect the page and find that the initial HTML only contains the first 24 items. Clicking “Load more” appends more cards without changing the URL. The network panel shows additional fetch requests, but the simplest reliable route is to automate the click and extract after each batch.
Your approach:
- Open the page in Playwright.
- Extract current items.
- Click the load more button if visible.
- Wait for either network completion or a rise in item count.
- Repeat until the button disappears or no new items are added.
A sketch:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/listings", wait_until="networkidle")
seen = set()
while True:
items = page.locator(".listing-card")
count_before = items.count()
for i in range(count_before):
href = items.nth(i).locator("a").get_attribute("href")
if href and href not in seen:
seen.add(href)
# extract fields
button = page.locator("button:has-text('Load more')")
if not button.is_visible():
break
button.click()
page.wait_for_timeout(1500)
count_after = page.locator(".listing-card").count()
if count_after <= count_before:
break
browser.close()In a real project, waiting on a specific response or a selector change is better than a fixed sleep. But the pattern is the key: extract, click, verify that new content arrived, repeat.
Example 3: Infinite scroll scraping by monitoring growth
Infinite scroll often looks harder than it is. In many cases, the browser sends predictable API calls while the page scrolls. If you can call that endpoint directly, use it. If not, browser automation works well.
Your approach:
- Load the page and extract the first visible batch.
- Scroll to the bottom.
- Wait until either the item count increases or a loader disappears.
- Repeat until several consecutive scrolls produce no new items.
A reliable stop condition is important because some pages keep scrolling while repeating the same records.
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com/feed', { waitUntil: 'networkidle' });
const seen = new Set();
let stableRounds = 0;
while (stableRounds < 3) {
const links = await page.locator('.feed-item a').evaluateAll(nodes =>
nodes.map(n => n.href).filter(Boolean)
);
const before = seen.size;
links.forEach(link => seen.add(link));
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await page.waitForTimeout(1500);
if (seen.size === before) {
stableRounds += 1;
} else {
stableRounds = 0;
}
}
console.log([...seen]);
await browser.close();
})();Using several stable rounds instead of one avoids stopping too early on slow pages.
Example 4: Hidden API behind a dynamic interface
This is often the best outcome. The site appears to require headless browser scraping, but the network tab reveals a JSON endpoint like:
/api/search?query=laptops&offset=48&limit=24At that point, you may not need a browser at all. Replaying the API request can make the scraper simpler, faster, and easier to monitor. Watch for:
- required headers
- session cookies
- anti-CSRF tokens
- cursor values in the response body
If the API requires a browser-derived token that changes often, browser automation may still be the practical option. But always check the network layer first. It is one of the most useful habits in any data extraction tutorial.
For a modern browser-led workflow, our Puppeteer Web Scraping Guide: Extract Data From Modern Web Apps goes deeper into rendered applications.
Common mistakes
Most failures in scrape pagination and infinite scroll scraping come from a small set of avoidable issues.
Scraping the page source too early
If you collect HTML before the page has rendered new content, you will only scrape the initial batch. Wait for a meaningful signal: increased item count, a specific response, or the disappearance of a loader.
Using fixed delays everywhere
sleep(2) is easy, but it is not robust. Slow networks, heavy pages, or intermittent API delays will break that assumption. Prefer event-driven waits where possible.
Ignoring duplicate records
Dynamic result sets often repeat cards across batches. Without deduplication, you may think the scraper is finding more data when it is just re-reading the same items.
Clicking without checking state
A load more button may be present but disabled. Or it may be covered by a cookie banner. Verify visibility, enabled state, and post-click changes.
Missing API calls in the network tab
Many developers jump straight into Playwright web scraping when a simpler JSON request would do. The browser is useful, but it should not hide easier paths from you.
Not planning for rate limiting and retries
Even when the extraction logic is correct, repeated page loads or API calls can trigger throttling. Add backoff, reasonable concurrency, and idempotent retries. For larger projects, this becomes part of scraper reliability rather than just parsing logic.
Assuming one site pattern generalises perfectly
One ecommerce site may paginate by page number, another by offset, another by cursor, and another by a signed request body. Reuse the framework, not the assumptions.
Forgetting that list pages can change mid-run
This is common on news, jobs, and marketplaces. If new items are inserted while you scrape, page 2 may no longer be the same page 2 you expected. Stable IDs and timestamps help you manage this.
When to revisit
The best scraper for pagination or dynamic content is never truly finished. Revisit your approach when the site changes how it reveals records, when your extraction volume grows, or when your maintenance burden starts to creep up.
In practice, update your scraper when:
- the next page URL structure changes
- the site replaces page numbers with cursors
- a previously public JSON endpoint moves behind authenticated requests
- load more interactions become scroll-based
- your browser script becomes slow enough that an API-based rewrite is worth it
- new anti-bot behaviour appears and your old timing assumptions stop working
A practical review checklist looks like this:
- Reinspect the network tab. Sites often change front-end code before they change data endpoints.
- Retest your stop conditions. Confirm that your scraper still knows when to end cleanly.
- Validate record counts. Compare a manual browse with scraper output to catch silent misses.
- Review selectors. Dynamic interfaces change class names often; prefer stable attributes where available.
- Reconsider the tool choice. A requests-based scraper may now need Playwright, or a browser workflow may now be replaceable by direct API calls.
If you are maintaining several projects, it helps to keep a short runbook for each target: how navigation works, what request pattern drives the next batch, what the stop condition is, and what fields act as stable identifiers. That documentation saves time the next time the site shifts from pagination to infinite scroll, which happens more often than most teams expect.
The enduring lesson is this: do not think in terms of buttons or scrolling first. Think in terms of state changes and data flow. Once you know what event reveals the next records and what signal proves they arrived, you can usually scrape multiple pages with confidence, whether the site uses classic links, JavaScript buttons, or a modern app shell.
For broader tooling guidance, you may also want to read Scrapy tutorial style resources in our Python coverage, along with our browser automation comparisons. But the framework in this guide should stay useful even as interfaces change: inspect the underlying requests, choose the simplest extraction path, track stable identifiers, and stop only when you can prove there is nothing new left to collect.