This module provides a specialized PdfTriage
class designed to rapidly analyze and classify PDF files. Its sole purpose is to determine if a PDF requires Optical Character Recognition (OCR) or if it is already a text-searchable document that can be skipped.
By acting as an intelligent "gatekeeper" to the resource-intensive OCR process, this module is critical for creating an efficient batch-processing workflow, saving significant time and computational resources.
OcrRequirement
EnumA simple and clear enumerator to represent the classification result for a given PDF.
OCR_REQUIRED
: The file is primarily image-based or contains a mix of text and images, requiring OCR to become fully searchable.OCR_NOT_REQUIRED
: The file is predominantly text-based and can be skipped by the OCR process.EMPTY_OR_CORRUPT
: The file could not be opened, is unreadable, or contains no pages.PdfTriage
ClassThis class contains the logic for analyzing and classifying the PDF files.
classify(pdf_path)
: This is the single entry point for the class. It takes the path to a PDF file and returns an OcrRequirement
value based on its analysis.The classify
method follows a precise, multi-step logic to make its determination very quickly.
PyMuPDF
library. If the file cannot be opened or is found to have zero pages, it is immediately classified as EMPTY_OR_CORRUPT
.MIN_CHARS_FOR_TEXTUAL_PAGE
, default: 150), the page is considered "textual." This check is extremely fast.