This module provides a specialized PdfTriage class designed to rapidly analyze and classify PDF files. Its sole purpose is to determine if a PDF requires Optical Character Recognition (OCR) or if it is already a text-searchable document that can be skipped.
By acting as an intelligent "gatekeeper" to the resource-intensive OCR process, this module is critical for creating an efficient batch-processing workflow, saving significant time and computational resources.
OcrRequirement EnumA simple and clear enumerator to represent the classification result for a given PDF.
OCR_REQUIRED: The file is primarily image-based or contains a mix of text and images, requiring OCR to become fully searchable.OCR_NOT_REQUIRED: The file is predominantly text-based and can be skipped by the OCR process.EMPTY_OR_CORRUPT: The file could not be opened, is unreadable, or contains no pages.PdfTriage ClassThis class contains the logic for analyzing and classifying the PDF files.
classify(pdf_path): This is the single entry point for the class. It takes the path to a PDF file and returns an OcrRequirement value based on its analysis.The classify method follows a precise, multi-step logic to make its determination very quickly.
PyMuPDF library. If the file cannot be opened or is found to have zero pages, it is immediately classified as EMPTY_OR_CORRUPT.MIN_CHARS_FOR_TEXTUAL_PAGE, default: 150), the page is considered "textual." This check is extremely fast.