Text Extraction Flags

Option bits controlling the amount of data, that are parsed into a TextPage

For the PyMuPDF programmer, some combination (using Python’s | operator, or simply use +) of these values are aggregated in the flags integer, a parameter of all text search and text extraction methods. Depending on the individual method, different default combinations of the values are used. Please use a value that meets your situation. Especially make sure to switch off image extraction unless you really need them. The impact on performance and memory is significant!

TEXT_PRESERVE_LIGATURES1 – If set, ligatures are passed through to the application in their original form. Otherwise ligatures are expanded into their constituent parts, e.g. the ligature “ffi” is expanded into three eparate characters f, f and i. Default is “on” in PyMuPDF. MuPDF supports the following 7 ligatures: “ff”, “fi”, “fl”, “ffi”, “ffl”, , “ft”, “st”.

TEXT_PRESERVE_WHITESPACE2 – If set, whitespace is passed through. Otherwise any type of horizontal whitespace (including horizontal tabs) will be replaced with space characters of variable width. Default is “on” in PyMuPDF.

TEXT_PRESERVE_IMAGES4 – If set, then images will be stored in the TextPage. This causes the presence of (usually large!) binary image content in the output of text extractions of types “blocks”, “dict”, “json”, “rawdict”, “rawjson”, “html”, and “xhtml” and is the default there. If used with “blocks” however, only image metadata will be returned, not the image itself.

TEXT_INHIBIT_SPACES8 – If set, Mupdf will not try to add missing space characters where there are large gaps between characters. In PDF, the creator often does not insert spaces to point to the next character’s position, but will provide the direct location address. The default in PyMuPDF is “off” – so spaces will be generated.

TEXT_DEHYPHENATE16 – Ignore hyphens at line ends and join with next line. Used internally with the text search functions. However, it is generally available: if on, text extractions will return joined text lines (or spans) with the ending hyphen of the first line eliminated. So two separate spans “first meth-” and “od leads to wrong results” on different lines will be joined to one span “first method leads to wrong results” and correspondingly updated bboxes: the characters of the resulting span will no longer have identical y-coordinates.

TEXT_PRESERVE_SPANS32 – Generate a new line for every span. Not used (“off”) in PyMuPDF, but available for your use. Every line in “dict”, “json”, “rawdict”, “rawjson” will contain exactly one span.

TEXT_MEDIABOX_CLIP64 – Characters entirely outside a page’s mediabox or contained in other “clipped” areas will be ignored. This is default in PyMuPDF.

TEXT_USE_CID_FOR_UNKNOWN_UNICODE128 – Use raw character codes instead of U+FFFD. This is the default for text extraction in PyMuPDF. If you want to detect when encoding information is missing or uncertain, toggle this flag and scan for the presence of U+FFFD (= chr(0xfffd)) code points in the resulting text.

TEXT_COLLECT_STRUCTURE256 – Not supported.

TEXT_ACCURATE_BBOXES512 – Ignore metric values of all fonts when computing character boundary boxes – most prominently the ascender and descender values. Instead, follow the drawing commands of each character’s glyph and compute its rectangle hull. This is the smallest rectangle wrapping all points used for drawing the visual appearance - see the Shape class for understanding the background. This will especially result in individual character heights. For instance a (white) space will have a bbox of height 0 (because nothing is drawn) – in contrast to the non-zero boundary box generated when using font metrics. This option may be useful to cope with getting meaningful boundary boxes even for fonts containing errors. Its use will slow down text extraction somewhat because of the incurred computational effort. Note that this has no effect by default - one must also disable the global quad corrections setting with pymupdf.TOOLS.unset_quad_corrections(True).

TEXT_COLLECT_VECTORS1024 – Not supported.

TEXT_IGNORE_ACTUALTEXT2048 – Ignore built-in differences between text appearing in e.g. PDF viewers versus text stored in the PDF. See Adobe PDF References, page 615 for background. If set, the stored (“replacement” text) is ignored in favor of the displayed text.

TEXT_SEGMENT4096 – Attempt to segment page into different regions.

The following constants represent the default combinations of the above for text extraction and searching:

TEXTFLAGS_TEXTTEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_USE_CID_FOR_UNKNOWN_UNICODE

TEXTFLAGS_WORDSTEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_USE_CID_FOR_UNKNOWN_UNICODE

TEXTFLAGS_BLOCKSTEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_USE_CID_FOR_UNKNOWN_UNICODE

TEXTFLAGS_DICTTEXT_PRESERVE_LIGATURES | TEXT_PRESERVE_WHITESPACE | TEXT_MEDIABOX_CLIP | TEXT_PRESERVE_IMAGES | TEXT_USE_CID_FOR_UNKNOWN_UNICODE