Overview

This module serves as a powerful and intelligent command-line tool for performing Optical Character Recognition (OCR) on a batch of PDF files. It leverages the robust ocrmypdf library, but enhances it with two key features:

  1. Intelligent Triage: It first analyzes each PDF to determine if OCR is actually necessary. Files that are already text-searchable are skipped, saving immense amounts of processing time.
  2. Parallel Processing: It uses Python's multiprocessing to run OCR on multiple files simultaneously, maximizing the use of available CPU cores to speed up the entire workflow.

The ultimate goal is to take a folder of mixed PDFs and efficiently convert only the non-searchable ones into fully searchable documents, which are then sorted into organized output folders.

Command-Line Usage

The script is executed from your terminal. The basic command requires specifying the input folder.

** Input path must be a folder/directory*

python ocr.py -i "C:/path/to/your/pdf_folder"

Core Arguments

These are the fundamental arguments for controlling the script's behavior.

Argument Purpose Example / Common Use
--input_pdf or -i Required. Specifies the source folder to process. -i "C:/path/to/pdfs"
-l <LANG>, --language LANG Specifies the language(s) for Tesseract OCR. Accuracy depends heavily on this. -l eng (English)<br>--language eng+fil (English + Filipino)
--workers N Sets the number of parallel worker processes. --workers 4 (Default: CPU cores minus 2)
--move Moves the original files to the categorized output folders instead of copying them. Add --move to enable. (Default: copy)

Advanced Usage & Arguments

For a complete list of ocrmypdf options, run ocrmypdf --help.

OCR Workflow Control

These arguments control if and how OCR is applied to pages.

Image Processing and Quality

These arguments manipulate images before they are sent to the OCR engine, which can dramatically improve accuracy.

PDF Output and Optimization

These arguments control the properties and format of the final output PDF.