Clone the Repository:
git clone <https://github.com/gabegtrrz/data-extraction-goeden.git>
cd data-extraction-goeden
git clone <https://github.com/gabegtrrz/data-extraction-goeden.git>
cd data-extraction-goeden
Install Python Libraries: It's highly recommended to use a virtual environment.
python -m venv venv
.\\\\venv\\\\Scripts\\\\activate # On Windows
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt
python -m venv venv
.\\\\venv\\\\Scripts\\\\activate # On Windows
source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt
To run this script, you'll need the following installed on your system:
This script relies on external Python libraries and internal modules.
ocrmypdf
: The core engine used for performing OCR.pymupdf
: Used by the triage module to analyze PDF content.triage.py
: Contains the PdfTriage
class for determining if a PDF needs OCR.file_operations.py
: Contains the FileOps
class for organizing files into output folders.