This module serves as a powerful and intelligent command-line tool for performing Optical Character Recognition (OCR) on a batch of PDF files. It leverages the robust ocrmypdf
library, but enhances it with two key features:
The ultimate goal is to take a folder of mixed PDFs and efficiently convert only the non-searchable ones into fully searchable documents, which are then sorted into organized output folders.
The script is executed from your terminal. The basic command requires specifying the input folder.
** Input path must be a folder/directory*
python ocr.py -i "C:/path/to/your/pdf_folder"
These are the fundamental arguments for controlling the script's behavior.
Argument | Purpose | Example / Common Use |
---|---|---|
--input_pdf or -i |
Required. Specifies the source folder to process. | -i "C:/path/to/pdfs" |
-l <LANG> , --language LANG |
Specifies the language(s) for Tesseract OCR. Accuracy depends heavily on this. | -l eng (English)<br>--language eng+fil (English + Filipino) |
--workers N |
Sets the number of parallel worker processes. | --workers 4 (Default: CPU cores minus 2) |
--move |
Moves the original files to the categorized output folders instead of copying them. | Add --move to enable. (Default: copy) |
For a complete list of ocrmypdf
options, run ocrmypdf --help
.
These arguments control if and how OCR is applied to pages.
These arguments manipulate images before they are sent to the OCR engine, which can dramatically improve accuracy.
These arguments control the properties and format of the final output PDF.