A Python library for extracting and analyzing text, images, and metadata from PDF documents.
Pdfminer.six is a Python library for extracting information from PDF documents, focusing on text data retrieval and analysis. It parses PDFs directly from their source code to obtain text along with its exact location, font, and color attributes. The tool also supports extracting images, HTML, hOCR, and metadata, making it versatile for document processing tasks.
Developers and data scientists working with PDF document parsing, text extraction, and data analysis in Python environments.
It offers a modular, extensible architecture that allows customization beyond standard text extraction, with robust support for PDF-1.7 specifications, CJK languages, and various compressions and encryptions.
Community maintained fork of pdfminer - we fathom PDF
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports PDF-1.7 specification, various font types (Type1, TrueType, Type3, CID), and embedded image formats like JPG and PNG, making it versatile for diverse PDFs.
Built with a modular architecture allowing easy replacement of components for custom interpreters or rendering devices, as stated in the README.
Extracts text with exact location, font, and color, and supports CJK languages and vertical writing, enabling precise document analysis.
Decodes multiple compressions (e.g., FlateDecode) and supports RC4 and AES encryption, useful for secure or complex PDFs.
The README admits only 'almost' full support for PDF-1.7, which can lead to compatibility issues with some PDF documents.
Limited maintainer availability means slower issue resolution and reliance on community contributions, as noted in the contributing section.
Requires extra dependencies for image extraction (e.g., 'pip install pdfminer.six[image]'), adding complexity to installation and deployment.