Installation

Clone the Repository:

git clone <https://github.com/gabegtrrz/data-extraction-goeden.git>
cd data-extraction-goeden

git clone <https://github.com/gabegtrrz/data-extraction-goeden.git>
cd data-extraction-goeden

Install Python Libraries: It's highly recommended to use a virtual environment.

python -m venv venv
.\\\\venv\\\\Scripts\\\\activate  # On Windows
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt

python -m venv venv
.\\\\venv\\\\Scripts\\\\activate  # On Windows
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt

Prerequisites

To run this script, you'll need the following installed on your system:

This script relies on external Python libraries and internal modules.

External Libraries:
- ocrmypdf: The core engine used for performing OCR.
- pymupdf: Used by the triage module to analyze PDF content.
Internal Modules:
- triage.py: Contains the PdfTriage class for determining if a PDF needs OCR.
- file_operations.py: Contains the FileOps class for organizing files into output folders.