An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.
Geni is a Clojure dataframe library that runs on Apache Spark, providing an idiomatic interface for large-scale data processing and machine learning. It solves the problem of cumbersome Java/Scala interop by offering a Clojure-native API that leverages Spark's distributed computing capabilities for efficient data wrangling and ML workflows.
Clojure developers and data engineers who need to perform distributed data processing, ETL pipelines, or machine learning tasks using Apache Spark without leaving the Clojure ecosystem.
Developers choose Geni because it eliminates the friction of Spark's Java/Scala APIs, offering a seamless Clojure experience with dynamic argument handling, functional composition via threading macros, and full access to Spark's features for distributed computing.
A Clojure dataframe library that runs on Spark
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses Clojure's threading macro (`->`) for composing Spark operations, replacing Scala's method chaining with a more readable, functional style as demonstrated in the data wrangling examples.
Allows mixed-type arguments like columns, strings, and keywords in single function calls, reducing boilerplate and increasing flexibility, as highlighted in the overview.
Provides full access to Spark's data wrangling, SQL operations, ML pipelines, and optional XGBoost support, enabling complex distributed workflows without Java/Scala interop.
Supports running on Dataproc, Kubernetes, or locally with database connectors, simplifying deployment across various environments as documented in the resources.
Requires a long list of provided dependencies including Spark, Arrow, and database drivers, which complicates project setup and increases resource usage, as shown in the installation section.
The Clojure wrapper adds an abstraction over Spark, potentially introducing minor overhead compared to native Scala or Java APIs, though benchmarks are provided.
Being Clojure-specific, it has a smaller community and fewer third-party extensions compared to PySpark or Scala Spark, which might limit tooling and support.