WebCrawler Scrape Page activity
Introduction
A WebCrawler Scrape Page activity, using its WebCrawler connection, scrapes a page and is intended to be used as a target to consume data in an operation.
Create a WebCrawler Scrape Page activity
An instance of a WebCrawler Scrape Page activity is created from a WebCrawler connection using its Scrape Page activity type.
To create an instance of an activity, drag the activity type to the design canvas or copy the activity type and paste it on the design canvas. For details, see Creating an activity instance in Component reuse.
An existing WebCrawler Scrape Page activity can be edited from these locations:
- The design canvas (see Component actions menu in Design canvas).
- The project pane's Components tab (see Component actions menu in Project pane Components tab).
Configure a WebCrawler Scrape Page activity
Follow these steps to configure a WebCrawler Scrape Page activity:
-
Step 1: Enter a name and specify settings
Provide a name for the activity and configure settings including the website URL, output content format, CSS selector tag list, metadata inclusion, and error handling. -
Step 2: Review the data schemas
Any request or response schemas are displayed.
Step 1: Enter a name and specify settings
In this step, provide a name for the activity and configure settings including the website URL, output content format, CSS selector tag list, metadata inclusion, and error handling. Each user interface element of this step is described below.
Tip
Fields with a variable icon support using global variables, project variables, and Jitterbit variables. Begin either by typing an open square bracket [ into the field or by clicking the variable icon to display a list of the existing variables to choose from.
-
Endpoint menu: If you have multiple endpoints of the same connector type configured, a menu at the top of the screen displays the current endpoint name. Click the menu to switch to a different endpoint. For more information, see Change the assigned endpoint in Configuration screens.
- Edit endpoint: Appears when you hover over the current endpoint name. Click to edit the currently selected endpoint's connection configuration.
-
Name: Enter a name to identify the activity. The name must be unique for each WebCrawler Scrape Page activity and must not contain forward slashes
/or colons:. -
Website URL: Enter the URL of the page to scrape.
-
Output content format: Specify the output content format to be used, one of Text or HTML.
-
Tag list (CSS selectors): Click the add icon to add a row to the table and enter a CSS selector Tag List for each page element to exclude from the scraped output. Use standard CSS selector syntax to target specific elements (for example,
.adsorfooter).To save the row, click the submit icon in the rightmost column.
To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .
To delete all rows, click Clear All.
-
Include metadata: Select to scrape metadata found in the page.
-
Continue on error: Select to continue the activity execution if an error is encountered for a dataset in a batch request. If any errors are encountered, they are written to the operation log.
-
Save & Exit: If enabled, click to save the configuration for this step and close the activity configuration.
-
Next: Click to temporarily store the configuration for this step and continue to the next step. The configuration will not be saved until you click the Finished button on the last step.
-
Discard Changes: After making changes, click to close the configuration without saving changes made to any step. A message asks you to confirm that you want to discard changes.
Step 2: Review the data schemas
Any request or response schemas are displayed. Each user interface element of this step is described below.
-
Data schema: These data schemas are inherited by adjacent transformations and are displayed again during transformation mapping.
Note
Data supplied in a transformation takes precedence over the activity configuration.
-
Refresh: Click the refresh icon or the word Refresh to regenerate schemas from the WebCrawler endpoint. This action also regenerates a schema in other locations throughout the project where the same schema is referenced, such as in an adjacent transformation.
-
Back: Click to temporarily store the configuration for this step and return to the previous step.
-
Finished: Click to save the configuration for all steps and close the activity configuration.
-
Discard Changes: After making changes, click to close the configuration without saving changes made to any step. A message asks you to confirm that you want to discard changes.
Next steps
After configuring a WebCrawler Scrape Page activity, complete the configuration of the operation by adding and configuring other activities, transformations, or scripts as operation steps. You can also configure the operation settings, which include the ability to chain operations together that are in the same or different workflows.
Menu actions for an activity are accessible from the project pane and the design canvas. For details, see Activity actions menu in Connector basics.
WebCrawler Scrape Page activities can be used as a target with these operation patterns:
- Transformation pattern
- Two-transformation pattern (as the first or second target)
To use the activity with scripting functions, write the data to a temporary location and then use that temporary location in the scripting function.
When ready, deploy and run the operation and validate behavior by checking the operation logs.