<aside>
Hi. This document outlines the project to develop the new PDF Data Extraction Package. Its purpose is to provide you with all the necessary information to understand the project's history, goals, and technical requirements, and to hit the ground running. The goal is to create a robust and reliable tool for our data ingestion pipeline.
</aside>
<aside>
To design, build, and test a robust Python package that reliably extracts clean text and structured table data from a variety of PDF documents (text-based, image-based/scanned, and hybrid). This package will serve as the core PDF processing component within the Unified Ingestion Engine.
Guiding Principles
To ensure the project stays aligned with its goals, please adhere to the following principles:
<aside>
The initial proof-of-concept, built on OCRmyPDF
, successfully demonstrated basic OCR capabilities. However, it also revealed critical limitations:
OCRmyPDF
is often unstructured and contains significant artifacts that would require intensive and costly post-processing.Testing revealed that preserving a PDF's original structure post-OCR is counterproductive to the core objective: supplying ADAM with high-quality, reliable data. While extracting all PDF content and rebuilding a document's layout after an intensive cleaning process is technically feasible, the significant labor required may not justify the labor cost and align the product to the business objectives of GOEden — to free up human resources by utilizing AI and automation.
Therefore, a strategic shift is required: we must move from simply making PDFs searchable to a more intelligent content extraction process. This involves deconstructing the PDF into its constituent parts (text, tables, images) and rebuilding the data in a standardized, machine-readable format.
</aside>
<aside>
The new architecture will be a modular pipeline that intelligently processes PDFs based on their type.
The following modules, already developed, may be used for the new package:
Existing Module | Role in New Architecture | Recommended Action |
---|---|---|
PDF Triage Module | Initial Classifier: The first step in the pipeline. | Integrate Directly: This can be the first step to classify which PDFs will require undergoing the OCR engine and which don’t. |
File Operations Module | Pipeline Utility: Manages file system tasks. | Utilize as needed for moving processed files or managing temporary artifacts (e.g., page images). |
Multithread Batch Class | Performance Scaler: The mechanism for high-throughput processing. | Apply in Final Phase: Wrap the complete, single-file processing logic within this class to enable batch operations. |
Google Tesseract Engine | Core OCR Service: The designated tool for converting page images to text. | Integrate Directly: Use via the pytesseract wrapper for direct, raw text output from images. |
New Module | Purpose | Key Responsibilities |
---|---|---|
PDF Extractor Package |
PDF COntent Extraction | To accurately parse PDF files to differentiate and extract content into distinct, structured elements like prose paragraphs and tables for the pipeline to process. |
Responsibilities also include PDF-specific cleaning such as removing headers/footers and correcting common OCR artifacts. |
<aside>
Recommended Libraries
A powerful and fast library for PDF handling, including text and image extraction. This should be the primary tool for parsing text-based PDFs and extracting pages as images for OCR.
A specialized library to extract tables from images, preserving their structure. This is crucial for handling tables in scanned documents.
</aside>
</aside>
<aside>
Clone the Repository:
git clone <https://github.com/gabegtrrz/data-ingestion-automation-lib.git
>
Create a Virtual Environment:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\\Scripts\\activate`
Install Dependencies: Update requirements.txt
file listing the necessary packages not yet there (pytesseract
, Img2Table
, etc.) and install them:
pip install -r requirements.txt.
Review Existing Modules: Familiarize yourself with the code and functionality of the existing artifacts listed in section 4.1. </aside>
Here is a recommended, phased approach to developing the PDF Extractor
package.
Phase 1: Core Text Extraction
PDF Triage Module
as the entry point to determine if a PDF is text-based, image-based, or hybrid. This is found from PDF-to-PDF OCR Package v1Phase 2: Structured Data (Table) Extraction