A Ruby library for extracting text and metadata from various document formats using Apache Tika.
Yomu is a Ruby library that extracts text and metadata from various document formats, including Microsoft Office, OpenDocument, Apple iWorks, RTF, and PDF files. It solves the problem of parsing diverse file types by providing a unified interface using the Apache Tika toolkit, simplifying document processing in Ruby applications.
Ruby developers who need to parse and extract content from multiple document formats in their applications, such as those handling file uploads, content management systems, or data processing pipelines.
Developers choose Yomu for its seamless integration with Apache Tika, offering reliable format support and a clean Ruby API without needing to manage Tika directly, making document parsing straightforward and consistent.
Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages Apache Tika to support dozens of formats including Office, PDF, and iWorks, as listed in the README's format section, providing a one-stop solution for diverse document types.
Offers a clean interface with methods like `Yomu.read` and `Yomu.new`, abstracting Tika's complexity and making it easy to extract text, metadata, and MIME types in Ruby code.
Can read from local files, URLs, streams, or any object responding to `read`, as shown in the usage examples, ideal for integrating with web frameworks like Rails or Sinatra for file uploads.
Extracts document metadata as a hash and identifies MIME types using Tika's detection, reducing the need for additional libraries and simplifying content analysis workflows.
Requires a working JRE, as stated in the installation section, which adds deployment complexity and may not be suitable for lightweight or Java-free Ruby environments.
Relies on Apache Tika, which can be memory-intensive and slower than native Ruby libraries, potentially impacting high-throughput applications or real-time processing needs.
As a wrapper, it depends on Tika's capabilities and version updates, so users have less control over parsing behavior and must wait for Yomu updates to fix Tika-related bugs.