get_textpage_ocr()

<aside>

PyMuPDF's OCR feature is designed to work with Tesseract-OCR, but it doesn't use a specific, bundled version of Tesseract. Instead, it relies on your system's installed Tesseract OCR engine.

</aside>

get_textpage_ocr(flags=3, language='eng', dpi=72, full=False, tessdata=None)

Optical Character Recognition (OCR) technology can be used to extract text data for documents where text is in a raster image format throughout the page.

Use this method to OCR a page for text extraction. This method returns a TextPage for the page that includes OCRed text.

MuPDF will invoke Tesseract-OCR if this method is used.

Otherwise this is a normal TextPage object.

Parameters:

• flags (int) – indicator bits controlling the content available for subsequent test extractions and searches – see the parameter of Page.get_text(). • language (str) – the expected language(s). Use “+”-separated values if multiple languages are expected, “eng+spa” for English and Spanish. • dpi (int) – the desired resolution in dots per inch. Influences recognition quality (and execution time). • full (bool) – whether to OCR the full page, or just the displayed images. • tessdata (str) – The name of Tesseract’s language support folder tessdata. If omitted, this information must be present as environment variable TESSDATA_PREFIX. Can be determined by function get_tessdata(). Note This method does not support a clip parameter – OCR will always happen for the complete page rectangle.Returns: a TextPage. Execution may be significantly longer than Page.get_textpage(). For a full page OCR, all text will have the font “GlyphlessFont” from Tesseract. In case of partial OCR, normal text will keep its properties, and only text coming from images will have the GlyphlessFont. Note OCRed text is only available to PyMuPDF’s text extractions and searches if their TextPage parameter specifies the output of this method. This Jupyter notebook walks through an example for using OCR textpages.

•

flags (int) – indicator bits controlling the content available for subsequent test extractions and searches – see the parameter of Page.get_text().
•

language (str) – the expected language(s). Use “+”-separated values if multiple languages are expected, “eng+spa” for English and Spanish.
•

dpi (int) – the desired resolution in dots per inch. Influences recognition quality (and execution time).
•

full (bool) – whether to OCR the full page, or just the displayed images.
•

tessdata (str) – The name of Tesseract’s language support folder tessdata. If omitted, this information must be present as environment variable TESSDATA_PREFIX. Can be determined by function get_tessdata().