The "Awesome Apache Spark" project is a curated resource list designed to support developers and data engineers using Apache Spark, a powerful open-source unified analytics engine for large-scale data processing. This list includes a variety of resources such as libraries, frameworks, tutorials, and tools that facilitate data processing, machine learning, and stream processing. It benefits both beginners and experienced professionals by providing essential information and tools to enhance productivity and efficiency in data workflows. Whether you're looking to optimize your data processing tasks or explore advanced analytics capabilities, this collection offers valuable insights and resources to help you succeed with Apache Spark.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
The "Awesome Public Datasets" project is a curated collection of publicly available datasets across various domains, including government, healthcare, finance, and social sciences. This list features datasets in multiple formats, along with links to tools and platforms that facilitate data analysis and visualization. It is an invaluable resource for researchers, data scientists, and students looking to access high-quality data for their projects or studies. By providing a wide array of datasets, this collection empowers users to explore, analyze, and derive insights from real-world data. Dive in to discover the wealth of information available for your next data-driven endeavor!
The "Awesome Big Data" project is a curated collection of resources focused on big data technologies and practices that enable the processing and analysis of vast amounts of data. This list encompasses a variety of categories, including frameworks, tools, libraries, databases, and tutorials that cater to both beginners and experienced data professionals. Users can explore resources related to data storage, processing, analytics, and visualization, making it an invaluable asset for data scientists, engineers, and researchers. Whether you're looking to enhance your big data skills or find the right tools for your projects, this collection provides a comprehensive guide to navigating the big data landscape.
The "Awesome Data Engineering" project is a curated collection of resources aimed at supporting professionals in the field of data engineering, which involves the design and construction of systems for collecting, storing, and analyzing data. This list encompasses a variety of categories, including data pipelines, ETL tools, data warehousing solutions, frameworks, and best practices, as well as tutorials and community resources. Whether you are a beginner looking to understand the fundamentals or an experienced engineer seeking advanced techniques, this list offers valuable insights and tools to enhance your data engineering projects. Dive into this collection to discover the tools and methodologies that can streamline your data workflows and improve your data management capabilities.
The "Awesome Network Analysis" project is a curated collection of resources focused on the study and analysis of networks, which are structures made up of interconnected elements. This list encompasses a variety of tools, libraries, datasets, and tutorials that facilitate the exploration of network theory, graph analysis, and visualization techniques. It serves as a valuable resource for researchers, data scientists, and enthusiasts interested in understanding complex systems, social networks, and data relationships. Whether you are a beginner looking to grasp the basics or an experienced analyst seeking advanced methodologies, this collection provides essential tools and insights to enhance your network analysis projects.
Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.
.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.
An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.
A library for writing Apache Spark applications in Haskell, enabling resilient analytics that scale to thousands of nodes.
An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.
An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.
A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.
A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.
A collection of libraries for large-scale data processing in Hadoop ecosystems, including Spark, Pig, and incremental MapReduce.
A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.
A library for parsing and querying XML data with Apache Spark SQL and DataFrames.
A connector that enables Apache Spark to read from and write to Apache Cassandra databases for distributed data processing.
Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.
An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.
An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.
A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.