Overview

This module provides a specialized PdfTriage class designed to rapidly analyze and classify PDF files. Its sole purpose is to determine if a PDF requires Optical Character Recognition (OCR) or if it is already a text-searchable document that can be skipped.

By acting as an intelligent "gatekeeper" to the resource-intensive OCR process, this module is critical for creating an efficient batch-processing workflow, saving significant time and computational resources.


Core Components

1. OcrRequirement Enum

A simple and clear enumerator to represent the classification result for a given PDF.

2. PdfTriage Class

This class contains the logic for analyzing and classifying the PDF files.


How It Works: The Classification Logic

The classify method follows a precise, multi-step logic to make its determination very quickly.

  1. File Integrity Check: The script first attempts to open the PDF file using the PyMuPDF library. If the file cannot be opened or is found to have zero pages, it is immediately classified as EMPTY_OR_CORRUPT.
  2. Page-by-Page Analysis: If the file is valid, the script iterates through every single page.
  3. Textual Page Test: For each page, it extracts any existing text and counts the number of characters. If the character count is greater than a set threshold (MIN_CHARS_FOR_TEXTUAL_PAGE, default: 150), the page is considered "textual." This check is extremely fast.