Limitations of the PDF-to-PDF Package in the Ingestion Pipeline Context

Accuracy & Cleaning Limitations: The OCR engine may have difficulty with special characters, complex tables, and text embedded in artistic elements or logos.
- This first version does not have the capability of extracting text and cleaning it. It simply makes the document searchable, but it loses context of the text within the 2D space.
Complex column layouts may result in text being displaced or misaligned in the OCR output.
Tabular Data: Transforming tables into structured, searchable tabular data is challenging for OCR and this script cannot accurately preserve table structures, leading to unstructured text output.
Context Loss: With the aforementioned limitations above, text is indeed searchable by human users, but prone to significantly losing context when read by a machine.