Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.
Kotlin for Apache Spark is an open-source library that provides Kotlin bindings and extensions for Apache Spark. It enables developers to write Spark applications using Kotlin's concise syntax, null safety, and functional features, bridging the gap between Kotlin and Spark's native Scala/Java APIs. The project aims to make Spark more accessible to Kotlin developers while maintaining full compatibility with Spark's ecosystem.
Kotlin developers working with big data processing, data engineers, and data scientists who prefer Kotlin's modern language features and want to leverage Apache Spark for distributed computations.
Developers choose this library because it provides an idiomatic Kotlin API for Spark, reducing boilerplate and improving type safety compared to using Spark's Java API directly. It offers seamless integration with Kotlin features like data classes, lambdas, and null safety, along with extras like Jupyter notebook support and enhanced UDF creation.
This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables direct use of data classes, lambda expressions, and method references in Spark operations, reducing boilerplate code. The README shows examples like dsOf("a" to 1) for creating Datasets with Kotlin Pairs.
Provides aliases such as leftJoin that enforce nullability, returning Dataset<Pair<LEFT, RIGHT?>> to prevent NullPointerException in distributed computations. This is highlighted in the Null safety section with practical examples.
Offers Kotlin-esque functions like withSparkStreaming for Spark Streaming, automating context management and checkpointing. The README includes a full streaming example with automatic JavaStreamingContext handling.
Supports seamless use in Kotlin Jupyter notebooks with %use spark magic for automatic Spark session initialization and HTML rendering of Datasets, as detailed in the Jupyter section with configuration examples.
Requires precise alignment of Spark, Scala, and library versions, with artifact names like kotlin-spark-api_3.3.2_2.13, making setup error-prone. The README's configuration section emphasizes this dependency matching.
Some Spark functions have renamed versions (e.g., reduceGroupsK instead of reduceGroups) due to overload resolution ambiguity, which can confuse developers familiar with standard Spark APIs. This is admitted in the Overload Resolution Ambiguity section.
While it bridges Kotlin and Spark, integration with third-party Spark libraries or advanced Spark features may require additional workarounds, as the project focuses on core API compatibility rather than full ecosystem coverage.