A Java library and command-line tool for extracting tables from PDF files.
tabula-java is a Java library and command-line application that extracts tables from PDF files. It solves the problem of accessing structured tabular data trapped in PDF documents by converting them into usable formats like CSV, JSON, or TSV. The library powers the open-source Tabula application and can be integrated into JVM-based projects for automated data extraction.
Data analysts, researchers, and developers who need to programmatically extract tabular data from PDF reports, academic papers, or financial documents. It's particularly useful for those working with JVM languages (Java, Scala, JRuby) or command-line automation scripts.
Developers choose tabula-java for its accurate table detection algorithms, flexible command-line interface, and robust Java API. Unlike generic PDF text extractors, it specifically identifies table structures and preserves their layout, making it a reliable open-source alternative to proprietary PDF data extraction tools.
Extract tables from PDF files
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Implements both lattice and stream extraction modes to handle various table layouts, as evidenced by the CLI options for forcing specific extraction methods like -l for lattice and -t for stream.
Offers a command-line tool for batch processing and a Java API for programmatic use, making it adaptable to different workflows, with examples provided in the README for Java integration.
Includes a -b option to process all PDFs in a directory, facilitating efficient extraction from multiple files without manual repetition, as highlighted in the command-line usage.
Exports extracted tables to CSV, TSV, or JSON formats via the -f option, enabling easy data manipulation in common analysis tools.
The README explicitly warns that JVM start-up time is a significant cost, making the CLI slower for individual extractions and necessitating workarounds like drip or bindings for performance.
Limited to text-based PDFs and cannot handle scanned documents, which is a critical gap for many real-world PDF extraction scenarios requiring image-to-text conversion.
Command-line options like -n and -r are deprecated, indicating potential instability or confusion in the interface, as noted in the usage examples.
Requires Maven and manual compilation to build from source, which can be a hurdle for developers wanting to customize or contribute, as described in the 'Building from Source' section.