<aside>

1. Intro

Hi. This document outlines the project to develop the new PDF Data Extraction Package. Its purpose is to provide you with all the necessary information to understand the project's history, goals, and technical requirements, and to hit the ground running. The goal is to create a robust and reliable tool for our data ingestion pipeline.

</aside>

<aside>

2. Project Overview & Objectives

To design, build, and test a robust Python package that reliably extracts clean text and structured table data from a variety of PDF documents (text-based, image-based/scanned, and hybrid). This package will serve as the core PDF processing component within the Unified Ingestion Engine.

Guiding Principles

To ensure the project stays aligned with its goals, please adhere to the following principles:

<aside>

3. Background & Rationale for Redesign

The initial proof-of-concept, built on OCRmyPDF, successfully demonstrated basic OCR capabilities. However, it also revealed critical limitations:

Testing revealed that preserving a PDF's original structure post-OCR is counterproductive to the core objective: supplying ADAM with high-quality, reliable data. While extracting all PDF content and rebuilding a document's layout after an intensive cleaning process is technically feasible, the significant labor required may not justify the labor cost and align the product to the business objectives of GOEden — to free up human resources by utilizing AI and automation.

Therefore, a strategic shift is required: we must move from simply making PDFs searchable to a more intelligent content extraction process. This involves deconstructing the PDF into its constituent parts (text, tables, images) and rebuilding the data in a standardized, machine-readable format.

</aside>

<aside>

4. Proposed Architecture & Technology Stack

The new architecture will be a modular pipeline that intelligently processes PDFs based on their type.

Existing Artifacts to be Leveraged

The following modules, already developed, may be used for the new package:

Existing Module Role in New Architecture Recommended Action
PDF Triage Module Initial Classifier: The first step in the pipeline. Integrate Directly: This can be the first step to classify which PDFs will require undergoing the OCR engine and which don’t.
File Operations Module Pipeline Utility: Manages file system tasks. Utilize as needed for moving processed files or managing temporary artifacts (e.g., page images).
Multithread Batch Class Performance Scaler: The mechanism for high-throughput processing. Apply in Final Phase: Wrap the complete, single-file processing logic within this class to enable batch operations.
Google Tesseract Engine Core OCR Service: The designated tool for converting page images to text. Integrate Directly: Use via the pytesseract wrapper for direct, raw text output from images.

New Artifacts to be Developed

New Module Purpose Key Responsibilities
PDF Extractor Package PDF COntent Extraction To accurately parse PDF files to differentiate and extract content into distinct, structured elements like prose paragraphs and tables for the pipeline to process.

Responsibilities also include PDF-specific cleaning such as removing headers/footers and correcting common OCR artifacts. |

<aside>

Recommended Libraries

PyMuPDF

A powerful and fast library for PDF handling, including text and image extraction. This should be the primary tool for parsing text-based PDFs and extracting pages as images for OCR.

Img2Table

A specialized library to extract tables from images, preserving their structure. This is crucial for handling tables in scanned documents.

</aside>

</aside>

<aside>

5. Getting Started

  1. Clone the Repository: git clone <https://github.com/gabegtrrz/data-ingestion-automation-lib.git>

  2. Create a Virtual Environment:

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\\Scripts\\activate`
    
  3. Install Dependencies: Update requirements.txt file listing the necessary packages not yet there (pytesseract, Img2Table, etc.) and install them:

    pip install -r requirements.txt.
    
  4. Review Existing Modules: Familiarize yourself with the code and functionality of the existing artifacts listed in section 4.1. </aside>

6. Proposed Development Roadmap & Key Tasks

Here is a recommended, phased approach to developing the PDF Extractor package.

Phase 1: Core Text Extraction

  1. Integrate PDF Triage: Use the PDF Triage Module as the entry point to determine if a PDF is text-based, image-based, or hybrid. This is found from PDF-to-PDF OCR Package v1
  2. Extract raw text content directly from text-based portions of the documents.
  3. Extract specific image objects, including tables that are formatted as images, from the PDF pages.
  4. Perform Optical Character Recognition (OCR) on the extracted images to produce machine-readable text. The Google Tesseract Engine may be useful here.

Phase 2: Structured Data (Table) Extraction

  1. Detect and isolate regions within PDF pages that contain tables.
  2. Extract structured data from the detected table regions and from images of tables.