Jitterbit Document Compliance Agent

Overview

Jitterbit provides the Document Compliance Agent to customers through Jitterbit Marketplace. This agent automates compliance-focused document processing by retrieving PDF files from Amazon S3, extracting their text content using optical character recognition (OCR), and applying AI-powered analysis to detect and remove personally identifiable information (PII), helping organizations meet data privacy requirements without manual document review.

The agent retrieves PDF files from a configured source bucket, submits each file for asynchronous OCR processing, and collects the full extracted text across all pages. An LLM then analyzes the extracted text in two passes: once to identify and store PII data, and again to produce a sanitized version with all PII removed. Both output files are written to an SFTP server, the original PDF is archived to a separate S3 bucket, and email notifications are sent to configured recipients after each processing run.

The agent performs the following tasks:

Retrieves a list of PDF files from a configured Amazon S3 source bucket.
Reads each PDF file and submits it to Amazon Textract for asynchronous OCR text extraction.
Collects extracted text across all pages, handling multi-page documents using Textract pagination tokens.
Sends extracted text to an LLM to detect and extract PII data, and writes the output to a file on an SFTP server.
Sends extracted text to the LLM to produce a cleaned version with all PII removed, and writes the output to a file on an SFTP server.
Archives processed PDF files to a configured destination S3 bucket.
Sends email notifications to configured recipients after processing completes.

This document explains how to set up and operate this AI agent. It covers architecture, prerequisites, and steps to install, configure, and operate the AI agent.

AI agent architecture

This AI agent connects Amazon S3, Amazon Textract, an LLM, and an SFTP server to extract and sanitize text from PDF documents. A typical processing run follows these steps:

The Initial Controller operation initializes processing variables and triggers the file listing step.
The agent connects to the configured Amazon S3 source bucket and retrieves a list of available PDF files.
For each file, the agent reads the PDF content from Amazon S3 and submits it to Amazon Textract for asynchronous OCR processing.
The agent waits for the Textract job to complete, retrieves the extracted text using the job ID, and follows pagination tokens until all pages are collected.
The agent sends the full extracted text to the LLM with a PII-detection prompt. The identified PII data is written to a file on the SFTP server.
The agent sends the extracted text to the LLM with a data-cleaning prompt to produce a sanitized version. The cleaned text is written to a file on the SFTP server.
The original PDF is moved to the destination S3 archive bucket.
An email notification is sent to configured recipients confirming processing completion.

Workflow diagram

The following diagram shows the main processing pipeline for the Document Compliance Agent.

--- config: flowchart: padding: 20 nodeSpacing: 80 --- flowchart LR classDef default fill:white, stroke:black, stroke-width:3px, rx:15px, ry:15px JSP@{ shape: hex, label: "
Document Compliance
Agent" } S3SRC[fab:fa-aws
Amazon S3
Source Bucket] TXTRACT[fab:fa-aws
Amazon Textract] LLM[fas:fa-brain
LLM] SFTP[fas:fa-server
SFTP Server] S3ARC[fab:fa-aws
Amazon S3
Archive Bucket] EMAIL[fas:fa-envelope
Email] JSP <-->|1. List and get PDFs| S3SRC JSP <-->|2. OCR request / extracted text| TXTRACT JSP <-->|3. PII detection prompt / PII data| LLM JSP <-->|4. Clean data prompt / cleaned text| LLM JSP -->|5. Write output files| SFTP JSP -->|6. Archive PDF| S3ARC JSP -->|7. Processing notification| EMAIL

Prerequisites

You need the following components to use this AI agent.

Harmony components

You must have a Jitterbit Harmony license with access to the following components:

Jitterbit Studio
Document Compliance Agent purchased as a license add-on

Supported endpoints

The AI agent connects to the following endpoints. You can accommodate other systems by modifying the project's endpoint configurations and workflows.

Large language model (LLM)

The agent uses Amazon Bedrock to access large language models for PII detection and data sanitization. Amazon Bedrock is a managed service that provides access to foundation models from providers including Anthropic, Amazon, and Meta. The project is configured to use Amazon Nova Lite by default. You can substitute another Bedrock-supported model by updating the model ID in the Bedrock activity configuration. You must have an AWS account with Amazon Bedrock access enabled in your region and the selected model enabled.

Amazon S3

The agent uses Amazon S3 as both the PDF source and the archive destination. You must have an AWS account with IAM credentials that have AmazonS3FullAccess permissions and two buckets configured: one for incoming PDF files and one for archiving processed files.

Amazon Textract

The agent uses Amazon Textract for asynchronous OCR extraction from PDF files. Your IAM credentials must include AmazonTextractFullAccess permissions. The source S3 bucket must have a resource policy that allows Amazon Textract to read from it (see Configure AWS resources).

SFTP

The agent writes processed output files (PII data and cleaned text) to an SFTP server. You must have an SFTP server accessible from Jitterbit with valid connection credentials.

Email

The agent sends processing notifications via SMTP email. The default configuration uses Gmail (smtp.gmail.com). You must have a sender email account with SMTP access enabled and, if using Gmail, an app password configured.

Installation, configuration, and operation

Follow these steps to install, configure, and operate this AI agent:

Download and install the project
Configure AWS resources
Configure project variables
Test connections
Deploy the project
Review project workflows
Trigger the project workflows

For troubleshooting guidance, see Troubleshooting.

Download and install the project

Follow these steps to install the Studio project for the AI agent:

Log in to the Harmony portal at https://login.jitterbit.com and open Marketplace.
Locate the AI agent named Document Compliance Agent. To locate the agent, use the search bar or, in the Filters pane under Type, select AI Agent to limit the display to AI agents.
Click the agent's Documentation link to open its documentation in a separate tab. Keep the tab open to refer back to after starting the project.
Click Start Project to open a configuration dialog.

Note

If you have not yet purchased the AI agent, Get this agent is displayed instead. Click it to open an informational dialog, then click Submit to have a representative contact you about purchasing the AI agent.
In the Create a New Project dialog, select an environment where the Studio project will be created, then click Create Project.
After the progress dialog indicates the project is created, use the dialog link Go to Studio or open the project directly from the Studio Projects page.

Configure AWS resources

Before configuring project variables, set up the required AWS resources.

Create an IAM user and access keys

In the AWS Management Console, open IAM and select Users in the left sidebar.
Select an existing user or click Create user to create a new one. Ensure the user will have permissions for Amazon S3, Amazon Textract, and Amazon Bedrock.
Open the user's Security credentials tab, scroll to Access keys, and click Create access key.
Select the appropriate use case, click Next, then copy and store the Access Key ID and Secret Access Key securely. The secret key is shown only once.
On the user's Permissions tab, click Add permissions and attach the following policies: AmazonBedrockFullAccess, AmazonS3FullAccess, AmazonTextractFullAccess.

Create S3 buckets

In the AWS Management Console, open S3 and click Create bucket.
Create the source bucket where PDF files will be placed for processing. Note the bucket name for the AmazonBucket project variable.
Create a second bucket to serve as the archive destination for processed files. Note its name for the DestinationAmazonBucket project variable.
Ensure both buckets are in the same AWS region.

Configure the source bucket policy

Amazon Textract requires read access to the source S3 bucket. Apply the following resource policy to grant that access, replacing <source-bucket-name> with your actual bucket name:

In Amazon S3, select the source bucket and open the Permissions tab.

Under Bucket policy, click Edit and paste the following policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "textract.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<source-bucket-name>",
                "arn:aws:s3:::<source-bucket-name>/*"
            ]
        }
    ]
}

Click Save changes.

Note

This policy grants Amazon Textract read-only access to the source bucket. No write permissions are granted.

Configure project variables

In the Studio project installed from Marketplace, set values for the following project variables.

To configure project variables, use the project's actions menu and select Project Variables to open the configuration drawer.

Amazon Web Services

Variable name	Description
`AmazonS3AccessKey`	AWS access key ID for authenticating Amazon S3 and Textract API calls.
`AmazonS3SecretKey`	AWS secret access key for authenticating Amazon S3 and Textract API calls.
`AWS_Region`	AWS region for Amazon S3, Textract, and Bedrock services (for example, `us-east-2`).
`AmazonBucket`	Name of the S3 source bucket where PDF files are placed for processing.
`DestinationAmazonBucket`	Name of the S3 archive bucket where processed files are moved after parsing.
`Textract_Base_URL`	Base URL for the Amazon Textract API endpoint. Update the region to match your AWS region (for example, `https://textract.us-east-2.amazonaws.com`).

SFTP

Variable name	Description
`sftp_URL`	URL of the SFTP server where processed output files are written.
`sftp_username`	Username for SFTP authentication.
`sftp_password`	Password for SFTP authentication.

Email

Variable name	Description
`From_Email`	Sender email address for processing notification messages.
`To_Email`	Recipient email address for processing notification messages.
`Email_Username`	Username for authenticating with the SMTP email server.
`Email_Password`	App password for the sender email account. For Gmail, generate an app password in your Google Account security settings.
`Email_Server`	SMTP server address for sending email notifications (for example, `smtp.gmail.com`).
`Email_Subject`	Subject line for notification emails.
`EmailMessage`	Body text for notification emails. Leave empty to use the default message.

Test connections

Test the endpoint configurations to verify connectivity using the defined project variable values.

To test connections, go to the design component palette's Project endpoints and connectors tab, hover over each endpoint, and click Test.

Deploy the project

Deploy the Studio project.

To deploy the project, use the project's actions menu and select Deploy.

Review project workflows

The Studio project contains one workflow that implements the Document Compliance Agent processing pipeline.

PDF Parser

Operation	Description
Initial Controller	Initializes processing variables and starts the pipeline.
List Files From Amazon S3	Retrieves a list of available PDF files from the source S3 bucket.
Read Files From Amazon S3	Reads PDF file content from Amazon S3.
Textract Pdf Data	Submits the PDF to Amazon Textract for asynchronous OCR processing.
Get Data from Job Id	Retrieves OCR results from Textract using the job ID.
Get Data from Next Token	Handles multi-page OCR results using Textract pagination tokens.
Prompt Bedrock for PII Data	Sends extracted text to the LLM to detect and extract PII.
Prompt Bedrock for Clean Data	Sends extracted text to the LLM to produce a PII-free version.
Move File to archive	Moves the processed PDF to the destination archive S3 bucket.

Initial Controller

The Initial Controller operation serves as the entry point for the workflow. It runs the Controller Script, which initializes the lineTexts and gv_extractedText variables to a clean state, then triggers the List Files From Amazon S3 operation.

List Files From Amazon S3

The List Files From Amazon S3 operation connects to the configured source S3 bucket using the Amazon S3 adapter and retrieves a list of available PDF file names. The file names are stored in a variable and logged for reference before the workflow proceeds to read each file.

Read Files From Amazon S3

The Read Files From Amazon S3 operation retrieves the binary content of each PDF from S3. The response payload is transformed and stored in a variable for submission to Amazon Textract.

Textract Pdf Data

The Textract Pdf Data operation submits the PDF data to Amazon Textract via HTTP for asynchronous OCR processing. The operation retrieves a job ID from the Textract response, logs it, waits one minute for the job to complete, then triggers the Get Data from Job Id operation.

Get Data from Job Id

The Get Data from Job Id operation sends the job ID to Amazon Textract to retrieve OCR results. The response is transformed to extract line-level text, which is appended to the global extracted-text variable. If a pagination token is present in the response, the operation branches to Get Data from Next Token; otherwise, it proceeds to the PII detection step.

Get Data from Next Token

The Get Data from Next Token operation handles multi-page Textract results by using the pagination token to fetch the remaining page data. Each page's line-level text is appended to the global text variable. The operation continues fetching pages until no more tokens are returned, then branches to the PII detection step.

Prompt Bedrock for PII Data

The Prompt Bedrock for PII Data operation sends the full extracted text to Amazon Nova Lite via Amazon Bedrock with a prompt to detect personally identifiable information. The response is parsed to extract PII data as JSON, which is logged and written to a file on the SFTP server.

Prompt Bedrock for Clean Data

The Prompt Bedrock for Clean Data operation sends the extracted text to Amazon Nova Lite via Amazon Bedrock with a prompt to produce a sanitized version with all PII removed. The response is post-processed using regex to remove any residual sensitive data, and the cleaned text is written to a file on the SFTP server.

Move File to archive

The Move File to archive operation moves the processed PDF from the source S3 bucket to the destination archive bucket, ensuring the file is not reprocessed on subsequent runs.

Trigger the project workflows

To run the Document Compliance Agent, deploy and run the Initial Controller operation. In Studio, hover over the operation and click the Deploy and Run icon in the top-right corner of the operation tile.

To automate the pipeline, configure operation schedules on the Initial Controller operation to run at your preferred frequency.

Troubleshooting

If you encounter issues, review the operation logs for detailed troubleshooting information.

For additional assistance, contact Jitterbit support.