A high-performance, functional tabular data processing library for Clojure, similar to Python's Pandas or R's data.table.
tech.ml.dataset is a Clojure library for tabular data processing that provides high-performance, functional alternatives to tools like Python's Pandas or R's data.table. It solves data-intensive problems on the JVM with efficient columnar storage and immutable datasets. The library focuses on pragmatic data work with abstractions that simplify implementing real-world solutions.
Clojure developers and data scientists working with tabular data on the JVM who need efficient, functional alternatives to Python or R tools. It's also suitable for Java developers via its Java API.
Developers choose tech.ml.dataset for its functional design, which makes data transformations easier to reason about, and its high performance through memory-efficient columnar storage. It provides a pragmatic, JVM-native alternative to popular data processing libraries.
A Clojure high performance data processing system
Uses columnar storage with primitive arrays and packed datetime types to significantly reduce memory footprint, as highlighted in performance benchmarks linked in the README.
Datasets are immutable, making data transformations predictable and easier to debug compared to mutable alternatives like Pandas, which aligns with the library's functional design philosophy.
Optimized for speed with independent benchmarks showing it competes well against tools like data.table and Pandas, as referenced in the related projects section.
Includes a full Java API and sample program, allowing seamless integration into Java-based applications without requiring deep Clojure knowledge.
The README acknowledges that an alternative API, tablecloth, offers some important extra features, indicating TMD may lag in advanced capabilities or newer innovations.
Compared to Python's Pandas, the Clojure data science ecosystem is smaller, which can mean fewer tutorials, community support, and third-party integrations for complex workflows.
Requires familiarity with Clojure and functional programming paradigms, posing a significant barrier for teams accustomed to imperative languages like Python or R, despite the Java API.
tech.ml.dataset is an open-source alternative to the following products:
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.