PDF Data Extraction Package

<aside>

1. Intro

Hi. This document outlines the project to develop the new PDF Data Extraction Package. Its purpose is to provide you with all the necessary information to understand the project's history, goals, and technical requirements, and to hit the ground running. The goal is to create a robust and reliable tool for our data ingestion pipeline.

</aside>

<aside>

2. Project Overview & Objectives

To design, build, and test a robust Python package that reliably extracts clean text and structured table data from a variety of PDF documents (text-based, image-based/scanned, and hybrid). This package will serve as the core PDF processing component within the Unified Ingestion Engine.

Guiding Principles

To ensure the project stays aligned with its goals, please adhere to the following principles:

Data-First, Not Layout-First: The primary goal is to extract high-quality, machine-readable data (prose and tables), not to perfectly replicate the visual layout of the PDF.
Right Tool for the Job: Use specialized libraries for distinct tasks—PDF parsing, OCR, and table extraction—rather than a single tool that handles all tasks sub-optimally.
Intelligent Triage: The system must first identify the type of PDF to apply the correct extraction strategy, avoiding unnecessary and error-prone OCR. </aside>

<aside>

3. Background & Rationale for Redesign

The initial proof-of-concept, built on OCRmyPDF, successfully demonstrated basic OCR capabilities. However, it also revealed critical limitations:

Low-Quality Data Output: The text layer embedded by OCRmyPDF is often unstructured and contains significant artifacts that would require intensive and costly post-processing.
Failed Table Extraction: The tool flattens tables into unusable blocks of text, losing critical structured data.

Testing revealed that preserving a PDF's original structure post-OCR is counterproductive to the core objective: supplying ADAM with high-quality, reliable data. While extracting all PDF content and rebuilding a document's layout after an intensive cleaning process is technically feasible, the significant labor required may not justify the labor cost and align the product to the business objectives of GOEden — to free up human resources by utilizing AI and automation.

Therefore, a strategic shift is required: we must move from simply making PDFs searchable to a more intelligent content extraction process. This involves deconstructing the PDF into its constituent parts (text, tables, images) and rebuilding the data in a standardized, machine-readable format.

</aside>

<aside>

4. Proposed Architecture & Technology Stack

The new architecture will be a modular pipeline that intelligently processes PDFs based on their type.

Existing Artifacts to be Leveraged

The following modules, already developed, may be used for the new package:

Existing Module	Role in New Architecture	Recommended Action
PDF Triage Module	Initial Classifier: The first step in the pipeline.	Integrate Directly: This can be the first step to classify which PDFs will require undergoing the OCR engine and which don’t.
File Operations Module	Pipeline Utility: Manages file system tasks.	Utilize as needed for moving processed files or managing temporary artifacts (e.g., page images).
Multithread Batch Class	Performance Scaler: The mechanism for high-throughput processing.	Apply in Final Phase: Wrap the complete, single-file processing logic within this class to enable batch operations.
Google Tesseract Engine	Core OCR Service: The designated tool for converting page images to text.	Integrate Directly: Use via the `pytesseract` wrapper for direct, raw text output from images.

New Artifacts to be Developed

New Module	Purpose	Key Responsibilities
`PDF Extractor` Package	PDF COntent Extraction	To accurately parse PDF files to differentiate and extract content into distinct, structured elements like prose paragraphs and tables for the pipeline to process.

Responsibilities also include PDF-specific cleaning such as removing headers/footers and correcting common OCR artifacts. |

<aside>

Recommended Libraries

PyMuPDF

A powerful and fast library for PDF handling, including text and image extraction. This should be the primary tool for parsing text-based PDFs and extracting pages as images for OCR.

Img2Table

A specialized library to extract tables from images, preserving their structure. This is crucial for handling tables in scanned documents.

</aside>

<aside>

5. Getting Started

Clone the Repository: git clone <https://github.com/gabegtrrz/data-ingestion-automation-lib.git>

Create a Virtual Environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\\Scripts\\activate`

Install Dependencies: Update requirements.txt file listing the necessary packages not yet there (pytesseract, Img2Table, etc.) and install them:
```
pip install -r requirements.txt.
```
Review Existing Modules: Familiarize yourself with the code and functionality of the existing artifacts listed in section 4.1. </aside>

6. Proposed Development Roadmap & Key Tasks

Here is a recommended, phased approach to developing the PDF Extractor package.

Phase 1: Core Text Extraction

Integrate PDF Triage: Use the PDF Triage Module as the entry point to determine if a PDF is text-based, image-based, or hybrid. This is found from PDF-to-PDF OCR Package v1
Extract raw text content directly from text-based portions of the documents.
Extract specific image objects, including tables that are formatted as images, from the PDF pages.
Perform Optical Character Recognition (OCR) on the extracted images to produce machine-readable text. The Google Tesseract Engine may be useful here.

Phase 2: Structured Data (Table) Extraction

Detect and isolate regions within PDF pages that contain tables.
Extract structured data from the detected table regions and from images of tables.