PDF Triage (triage.py)

Overview

This module provides a specialized PdfTriage class designed to rapidly analyze and classify PDF files. Its sole purpose is to determine if a PDF requires Optical Character Recognition (OCR) or if it is already a text-searchable document that can be skipped.

By acting as an intelligent "gatekeeper" to the resource-intensive OCR process, this module is critical for creating an efficient batch-processing workflow, saving significant time and computational resources.

Core Components

1. `OcrRequirement` Enum

A simple and clear enumerator to represent the classification result for a given PDF.

OCR_REQUIRED: The file is primarily image-based or contains a mix of text and images, requiring OCR to become fully searchable.
OCR_NOT_REQUIRED: The file is predominantly text-based and can be skipped by the OCR process.
EMPTY_OR_CORRUPT: The file could not be opened, is unreadable, or contains no pages.

2. `PdfTriage` Class

This class contains the logic for analyzing and classifying the PDF files.

Purpose: To encapsulate the entire triage process.
Key Method: classify(pdf_path): This is the single entry point for the class. It takes the path to a PDF file and returns an OcrRequirement value based on its analysis.

How It Works: The Classification Logic

The classify method follows a precise, multi-step logic to make its determination very quickly.

File Integrity Check: The script first attempts to open the PDF file using the PyMuPDF library. If the file cannot be opened or is found to have zero pages, it is immediately classified as EMPTY_OR_CORRUPT.
Page-by-Page Analysis: If the file is valid, the script iterates through every single page.
Textual Page Test: For each page, it extracts any existing text and counts the number of characters. If the character count is greater than a set threshold (MIN_CHARS_FOR_TEXTUAL_PAGE, default: 150), the page is considered "textual." This check is extremely fast.

Overview

Core Components

1. OcrRequirement Enum

2. PdfTriage Class

How It Works: The Classification Logic

1. `OcrRequirement` Enum

2. `PdfTriage` Class