SOP-202: Processing Legacy PDFs

<aside>

Objective: To provide a standardized procedure for converting batches of external or legacy PDFs (e.g., from partners, scanned archives) into text-searchable files that are ready for ingestion into the ADAM knowledge base pipeline.

Applies To:

Any PDF document that was not created following SOP-101.

Responsible Role:

Data Steward / Content Manager

</aside>

Process Overview

This SOP covers the end-to-end process of taking a collection of raw PDF files, processing them using the company's OCR tool, performing a quality check, and handing them off to the next stage of the data pipeline.

Phase 1: Preparation & Staging

Objective: To organize source files into a dedicated workspace before processing.

Step 1. Collect Source PDFs

Gather all the PDF files for a single ingestion batch (e.g., "Q2 Partner Research Papers," "Archived Grow Guides").

Step 2. Create a Working Folder

On your local machine or a designated network drive, create a new folder. Name this folder using the convention: YYYY-MM-DD_Batch-Description.

Example: 2025-06-23_Partner-Grow-Guides

Step 3. Stage the Files

Move all the collected source PDFs from Step 1 into this new working folder. This folder is now your input directory.

Phase 2: Automated Processing

Objective: To run the OCR tool on the staged files using the standard configuration.