Scrapify Logo

Advanced Crawling Actions

Advanced9 minutes

Master the available crawler actions to efficiently extract data from complex websites.

Advanced Crawling Actions

Understanding Crawler Actions

Scrapify allows you to create sophisticated crawler sequences by combining different actions. This tutorial will cover all available actions and how to use them effectively to extract data from even the most complex websites.

Important Note

When creating crawlers, always practice ethical web scraping by respecting robots.txt files, implementing reasonable delays between requests, and avoiding excessive server load. Scrapify provides built-in tools to help you scrape responsibly.

Navigation Actions

Click Button Action

The Click Button action simulates a user clicking on an element on the webpage. This is essential for navigating through pages, submitting forms, or interacting with the website.

  1. When you add a Click Button action, you'll be prompted to select the element to click
  2. The crawler identifies clickable elements (buttons, links, etc.) as you hover over them
  3. Once selected, the element's XPath and other identifying attributes are stored for reliable selection during crawling
  4. This action is useful for navigating through pagination, opening dropdown menus, or clicking on product details

Hover Action

The Hover action simulates a user hovering over an element, which can be essential for websites that reveal content or navigation options on hover.

  1. Select any element on the page to hover over
  2. This action is useful for dropdown menus that only appear on hover
  3. Can be combined with a subsequent Click action to navigate multi-level menus

Scroll to Element Action

This action scrolls the viewport until the selected element is visible. Useful for websites with lazy-loading content.

  1. Select any element on the page to scroll to
  2. Especially useful for "infinite scroll" websites where content loads as you scroll down
  3. Ensures that elements are in view before attempting to interact with them

Scroll to Bottom Action

This action scrolls the page all the way to the bottom, useful for triggering lazy-loading or infinite scrolling mechanisms.

  1. No element selection needed - simply scrolls to the bottom of the current page
  2. Particularly useful for social media feeds or product listings that load more items when you reach the bottom

Input Actions

Enter Text Action

The Enter Text action allows you to input text into form fields, search boxes, or any text input element.

  1. Select an input element on the page
  2. Specify the text you want to enter
  3. This action is essential for search forms, login credentials, or filter inputs
  4. Combine with Click Button to submit forms after entering text

Keypress Action

The Keypress action simulates pressing a specific key on the keyboard, which can trigger various website behaviors.

  1. Specify which key to press (e.g., Enter, Escape, Space)
  2. Useful for submitting forms with Enter, closing modals with Escape, or advancing carousels with arrow keys
  3. No element selection is needed as this action applies to the entire page

Timing and Control Actions

Wait Action

The Wait action pauses the crawler for a specified amount of time. This is crucial for several reasons:

  1. Specify a duration in seconds (up to 30 seconds)
  2. Allows time for dynamic content to load after actions like clicking or scrolling
  3. Helps the crawler appear more human-like and avoid detection
  4. Reduces server load by spacing out requests
  5. Essential when navigating single-page applications where content loads dynamically

Pro Tip

Add Wait actions with varying durations (2-5 seconds) between navigation actions to make your crawler behave more like a human user. This both improves reliability by giving pages time to load and helps avoid triggering anti-bot measures on websites.

Loop Action

The Loop action allows you to repeat a sequence of actions over a set of similar elements or URLs, essential for handling pagination or processing lists of items.

Scrapify supports several loop types:

  • Single Element Loop: Repeatedly perform actions on a single element (useful for clicking "Load More" buttons multiple times)
  • Fixed List Loop: Iterate through a pre-defined list of similar elements (like product cards)
  • Variable List Loop: Dynamically identify and loop through similar elements based on a reference element
  • URL List Loop: Crawl multiple pages from a list of URLs

Loop actions can contain nested actions that will be executed for each iteration of the loop.

Data Extraction

Scrape Action

The Scrape action is at the core of data extraction, allowing you to select and extract content from elements on the page.

  1. Select individual elements or tables to extract data from
  2. Use "Similar Elements" mode to automatically identify and scrape similar elements across the page
  3. Choose from different selection strategies for similar elements:
    • Highest Frequency: Select elements with the most common pattern (default)
    • Lowest Frequency: Select elements with less common patterns
    • Longest Path: Prioritize elements with more specific XPaths
    • Shortest Path: Use broader XPaths for selection
    • ClassList: Match elements with identical CSS classes
  4. Specify custom column headers for the extracted data
  5. Extract data from tables with automatic row/column detection

Building Advanced Crawler Workflows

Combining Actions for Complex Navigation

Real-world crawling tasks often require combining multiple actions into a logical sequence. Here's an example workflow for extracting product data across multiple pages:

  1. Page Navigation Setup:
    • Start with a Loop (URL List) to process multiple starting URLs
    • Within the loop, add a Wait action (3 seconds) to ensure page loads completely
  2. Category Selection:
    • Add a Click Button action to select a product category
    • Add a Wait action (2 seconds) for the category page to load
  3. Filter Application:
    • Add a Click Button action to open a filter dropdown
    • Add a Click Button action to select a specific filter option
    • Add a Wait action (3 seconds) for filtered results to load
  4. Data Extraction:
    • Add a Scrape action with Similar Elements enabled to extract all product listings
  5. Pagination:
    • Add a Loop (Fixed Number) to iterate through pagination
    • Inside the loop, add a Click Button action targeting the "Next Page" button
    • Add a Wait action (3 seconds) for the new page to load
    • Add another Scrape action to extract products from each page

Best Practice

When building complex workflows, always test with a small subset of pages first. This allows you to identify and fix any issues before scaling up to the full dataset. Remember to include adequate Wait actions between steps to ensure reliable crawling.

Anti-Detection Strategies

Many websites implement measures to detect and block automated crawlers. Here are strategies to make your crawler more human-like and avoid detection:

  • Variable wait times: Add Wait actions with different durations (2-7 seconds) between actions
  • Natural navigation paths: Include occasional clicks on non-target elements before returning to your main crawling path
  • Scroll actions: Add scroll actions between clicks to simulate human reading behavior
  • Session management: Maintain consistent sessions rather than creating new ones for each request
  • Rate limiting: Limit the number of pages you crawl per minute

Troubleshooting Common Crawler Issues

Even well-designed crawlers can encounter issues. Here are solutions to common problems:

  • Elements not found: Use more robust element identification by combining XPath with other attributes like ID, text content, or class lists
  • Dynamic content not loading: Increase the duration of Wait actions to allow more time for JavaScript rendering
  • Inconsistent scraping results: Try different selection strategies for Similar Elements to find the most reliable pattern
  • Navigation failures: Add conditional checks and retry logic using Loop actions to handle unexpected site behaviors
  • Being blocked or rate-limited: Implement longer Wait times and more human-like browsing patterns

Conclusion

By mastering the various crawler actions in Scrapify and combining them effectively, you can build powerful data extraction workflows for even the most complex websites. Remember to maintain ethical scraping practices by:

  • Respecting robots.txt files and website terms of service
  • Implementing reasonable delays between requests
  • Only extracting publicly available data
  • Limiting the frequency and volume of your crawling

In our next tutorial, we'll cover how to effectively organize and export the data you've collected using Scrapify's data processing features.