Showing 36 of 88 projects
A curated list of awesome ETL frameworks, libraries, and software for data integration and pipeline development.
A high-performance Go driver for ClickHouse offering both native and standard database/sql interfaces.
High-performance datastore optimized for time series and tick data storage and retrieval.
Fast tool for comparing datasets within or across SQL databases to identify differences.
A curated list of awesome streaming frameworks, applications, readings, and resources for stream processing.
A declarative code-first data integration engine that unlocks 600+ APIs and databases, eliminating the need to write and maintain custom API integrations.
A Python library for defining portable, modular, and testable data transformation DAGs with built-in lineage and metadata.
A lightweight Python library for creating portable, expressive, and testable data transformation DAGs with built-in lineage and metadata.
A Python framework and Rust-based distributed processing engine for stateful event and stream processing.
A curated list of awesome Apache Spark packages, libraries, and resources for data engineers and scientists.
A Ruby framework for writing reliable, concise, and maintainable ETL (Extract-Transform-Load) data processing jobs.
A Python-powered SQL lineage analysis tool that extracts source and target tables from SQL commands without deep parser knowledge.
An open-source Reverse ETL platform for syncing data from warehouses to business tools like Salesforce, HubSpot, and Slack.
A Python framework for building real-time data pipelines and event-driven microservices on Apache Kafka using a Streaming DataFrame API.
Base classes for writing Apache Spark tests in Scala and Python, simplifying test setup and teardown.
Python library providing clean, chainable functions for data cleaning and manipulation with pandas DataFrames.
A transactional catalog for data lakes with Git-like semantics, enabling version control and branching for data assets.
A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.
Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.
A polyglot workflow automation platform that orchestrates self-contained agents written in any language, enabling periodic execution, polling, and event-driven orchestration.
A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.
A PySpark testing library providing fast helper methods with descriptive, color-coded error messages for DataFrame and column comparisons.
A Scala library providing essential Spark extensions, helper methods, and custom transformations to maximize developer productivity.
An open-source, cloud-native streaming database designed for real-time data processing and IoT applications.
A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.
A PySpark library providing helper methods for DataFrame validation, column transformations, and schema utilities to boost developer productivity.
A command-line tool for launching Apache Spark clusters on AWS EC2 with fast, configurable deployments.
A high-performance data profiler for discovering and validating complex patterns in datasets, enabling data cleaning and quality analysis.
A high-performance data profiler for discovering and validating complex patterns like functional dependencies, inclusion dependencies, and association rules.
A visualization framework for Apache Pig workflows that combines graphical depictions with real-time execution information.
A fast Apache Spark testing helper library with beautifully formatted error messages for Scala applications.
A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.
A Python data validation toolkit that finds data quality issues and generates beautiful, shareable reports for team collaboration.
WarcDB is an SQLite-based file format that makes web crawl data easier to share and query.
A free, open-source alternative to Spark UI and Spark History Server with enhanced CPU and memory metrics visualizations.
A curated list of the best resources, tools, libraries, and documentation for the Apache Cassandra database ecosystem.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.