WebCrawler connection
Introduction
A WebCrawler connection, created using the WebCrawler connector, allows for the crawling and scraping of information from targeted websites and pages. Once a connection is configured, you can create instances of WebCrawler activities associated with that connection to be used either as sources (to provide data in an operation) or as targets (to consume data in an operation).
Create or edit a WebCrawler connection
A new WebCrawler connection is created using the WebCrawler connector from one of these locations:
- The design component palette's Project endpoints and connectors tab (see Design component palette).
- The Global Endpoints page (see Create a global endpoint in Global Endpoints).
An existing WebCrawler connection can be edited from these locations:
- The design component palette's Project endpoints and connectors tab (see Design component palette).
- The project pane's Components tab (see Component actions menu in Project pane Components tab).
- The Global Endpoints page (see Edit a global endpoint in Global Endpoints).
Configure a WebCrawler connection
Each user interface element of the WebCrawler connection configuration screen is described below.
Tip
Fields with a variable icon support using global variables, project variables, and Jitterbit variables. Begin either by typing an open square bracket [ into the field or by clicking the variable icon to display a list of the existing variables to choose from.
-
Connection name: Enter a name to use to identify the connection. The name must be unique for each WebCrawler connection and must not contain forward slashes
/or colons:. This name is also used to identify the WebCrawler endpoint, which refers to both a specific connection and its activities. -
Base URL: Enter a base URL to point WebCrawler activities to by default. This URL can be overridden by activity configuration settings.
-
Referrer: Enter the URL to use as the HTTP
Refererrequest header. This identifies the origin page of the request. -
Authentication: Select the authentication method to use when connecting to a website, one of API Key, Bearer Token, Basic Auth, or No Auth.
-
API Key: Select this option to authenticate using an API key:
-
Key: Enter the API key header or query parameter name.
-
Value: Enter the API key to use for authentication.
-
Add to: Select where to include the API key in the request, either Headers or Query params.
-
Headers: Includes the API key as a request header with the name set for Key.
-
Query params: Includes the API key as a URL query parameter with the name set for Key.
-
-
-
Bearer Token: Select this option to authenticate using a bearer token:
-
Bearer token: Enter the bearer token to use for authentication.
Important
Do not include a
Bearerprefix when authenticating with a bearer token. The connector automatically adds it to the header when using this authentication method.
-
-
Basic Auth: Select this option to authenticate using a username and password:
-
Username: Enter the username.
-
Password: Enter the password.
Important
Do not include a
Basicprefix when authenticating with a username and password. The connector automatically adds it to the header when using this authentication method.When using a private agent, additional configuration may be required for basic authentication over HTTPS.
-
-
No Auth: Select this option if authentication is not required.
-
-
Optional settings: Click to expand additional optional settings:
-
Use Proxy Settings (Private Agent Only): When using a private agent, this setting can be selected to use private agent proxy settings.
-
Follow redirects: Select to follow HTTP redirects when the target URL returns a redirect response.
-
Enforce Robot.txt: Select to honor the target website's
robots.txtdirectives. When selected, pages disallowed byrobots.txtare not crawled or scraped. -
SSL certificate verification: Select to verify the SSL certificate of the target server.
-
User Agent: Enter the
User-Agentstring to include in the request headers when making requests to target websites. -
Timeout: Enter the request timeout duration in milliseconds.
-
Only applicable when using HTTPS: Select the TLS protocol version to use for HTTPS connections, one of Negotiate, Use TLSv1.3, Use TLSv1.2, Use TLSv1.1, or Use TLSv1.0.
-
Request Headers: Click the add icon to add a row to the table below and enter a Name and Value for each custom request header to include in all requests made through this connection.
To save the row, click the submit icon in the rightmost column.
To edit or delete a single row, hover over the rightmost column and use the edit icon or delete icon .
To delete all rows, click Clear All.
-
Name: Enter the name of the request header.
-
Value: Enter the value of the request header.
-
-
Send request headers in activity execution: Select to include the connection-level request headers when executing activities associated with this connection.
-
-
Test: Click to verify the connection using the specified configuration. When the connection is tested, the latest version of the connector is downloaded by the agent(s) in the agent group associated with the current environment. This connector supports suspending the download of the latest connector version by using the Disable Auto Connector Update organization policy.
-
Save Changes: Click to save and close the connection configuration.
-
Discard Changes: After making changes to a new or existing configuration, click to close the configuration without saving. A message asks you to confirm that you want to discard changes.
-
Delete: After opening an existing connection configuration, click to permanently delete the connection from the project and close the configuration (see Component dependencies, deletion, and removal). A message asks you to confirm that you want to delete the connection.
Next steps
After a WebCrawler connection has been created, you place an activity type on the design canvas to create activity instances to be used either as sources (to provide data in an operation) or as targets (to consume data in an operation).
Menu actions for a connection and its activity types are accessible from the project pane and design component palette. For details, see Actions menus in Connector Basics.
These activity types are available:
- Crawl: Crawls websites and is intended to be used as a target in an operation.