Skip to Content

WebCrawler Crawl activity

Introduction

A WebCrawler Crawl activity, using its WebCrawler connection, crawls websites and is intended to be used as a target to consume data in an operation.

Create a WebCrawler Crawl activity

An instance of a WebCrawler Crawl activity is created from a WebCrawler connection using its Crawl activity type.

To create an instance of an activity, drag the activity type to the design canvas or copy the activity type and paste it on the design canvas. For details, see Creating an activity instance in Component reuse.

An existing WebCrawler Crawl activity can be edited from these locations:

Configure a WebCrawler Crawl activity

Follow these steps to configure a WebCrawler Crawl activity:

  • Step 1: Enter a name and specify settings
    Provide a name for the activity and configure settings including the seed website URLs, output content format, crawl depth and page limits, crawl delay, URL filter logic, CSS selector tag list, path restriction, meta tag retrieval, and error handling.

  • Step 2: Review the data schemas
    Any request or response schemas are displayed.

Step 1: Enter a name and specify settings

In this step, provide a name for the activity and configure settings including the seed website URLs, output content format, crawl depth and page limits, crawl delay, URL filter logic, CSS selector tag list, path restriction, meta tag retrieval, and error handling. Each user interface element of this step is described below.

Tip

Fields with a variable icon support using global variables, project variables, and Jitterbit variables. Begin either by typing an open square bracket [ into the field or by clicking the variable icon to display a list of the existing variables to choose from.

  • Endpoint menu: If you have multiple endpoints of the same connector type configured, a menu at the top of the screen displays the current endpoint name. Click the menu to switch to a different endpoint. For more information, see Change the assigned endpoint in Configuration screens.

    • Edit endpoint: Appears when you hover over the current endpoint name. Click to edit the currently selected endpoint's connection configuration.
  • Name: Enter a name to identify the activity. The name must be unique for each WebCrawler Crawl activity and must not contain forward slashes / or colons :.

  • Website URLs: Click the add icon to add a row to the table and enter a URL for each seed URL to use as a starting point for the crawl.

    To save the row, click the submit icon in the rightmost column.

    To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .

    To delete all rows, click Clear All.

  • Output content format: Select the output content format to be used, either Text or HTML.

  • Maximum depth: Enter the link depth for a crawl (max_depth). While the default value is 1, there is no required depth limit.

  • Maximum pages: Enter the maximum number of pages to retrieve during a crawl (items_limit). The default value is 10.

  • Crawl delay (ms): Enter the crawl delay in milliseconds. The default value is 5.

  • Regex URL filter logic: Select the regex filter mode to apply to URLs discovered during the crawl:

    • Include: Only crawl URLs that match the patterns specified in Include regex URLs:

      • Include regex URLs: Click the add icon to add a row to the table and enter an Include URL for each regular expression pattern to match against discovered URLs. Only URLs matching at least one pattern are crawled.

        To save the row, click the submit icon in the rightmost column.

        To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .

        To delete all rows, click Clear All.

    • Exclude: Skip URLs that match the patterns specified in Exclude regex URLs:

      • Exclude regex URLs: Click the add icon to add a row to the table and enter an Exclude URL for each regular expression pattern to match against discovered URLs. URLs matching any pattern are skipped during the crawl.

        To save the row, click the submit icon in the rightmost column.

        To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .

        To delete all rows, click Clear All.

  • Tag list (CSS selectors): Click the add icon to add a row to the table and enter a CSS selector Tag List for each page element to exclude from the scraped output. Use standard CSS selector syntax to target specific elements (for example, .ads or footer).

    To save the row, click the submit icon in the rightmost column.

    To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .

    To delete all rows, click Clear All.

  • Restrict to path: Select to restrict the crawl to URLs that share the same path prefix as the seed URL. For example, if the seed URL is https://example.com/blog/, only URLs under /blog/ are crawled.

  • Retrieve meta tags: Select to retrieve meta tags during the crawl.

  • Continue on error: Select to continue the activity execution if an error is encountered for a dataset in a batch request. If any errors are encountered, they are written to the operation log.

  • Save & Exit: If enabled, click to save the configuration for this step and close the activity configuration.

  • Next: Click to temporarily store the configuration for this step and continue to the next step. The configuration will not be saved until you click the Finished button on the last step.

  • Discard Changes: After making changes, click to close the configuration without saving changes made to any step. A message asks you to confirm that you want to discard changes.

Step 2: Review the data schemas

Any request or response schemas are displayed. Each user interface element of this step is described below.

  • Data schema: These data schemas are inherited by adjacent transformations and are displayed again during transformation mapping.

    Note

    Data supplied in a transformation takes precedence over the activity configuration.

  • Refresh: Click the refresh icon or the word Refresh to regenerate schemas from the WebCrawler endpoint. This action also regenerates a schema in other locations throughout the project where the same schema is referenced, such as in an adjacent transformation.

  • Back: Click to temporarily store the configuration for this step and return to the previous step.

  • Finished: Click to save the configuration for all steps and close the activity configuration.

  • Discard Changes: After making changes, click to close the configuration without saving changes made to any step. A message asks you to confirm that you want to discard changes.

Next steps

After configuring a WebCrawler Crawl activity, complete the configuration of the operation by adding and configuring other activities, transformations, or scripts as operation steps. You can also configure the operation settings, which include the ability to chain operations together that are in the same or different workflows.

Menu actions for an activity are accessible from the project pane and the design canvas. For details, see Activity actions menu in Connector basics.

WebCrawler Crawl activities can be used as a target with these operation patterns:

To use the activity with scripting functions, write the data to a temporary location and then use that temporary location in the scripting function.

When ready, deploy and run the operation and validate behavior by checking the operation logs.