Data Engineering

#database#apache#developer-tools

CassandraHTML

A curated list of the best resources, tools, libraries, and documentation for the Apache Cassandra database ecosystem.

Stars319

Forks61

#brewer-theorem#database-architecture#cap-theorem

NoSQL Guides

A curated collection of resources and guides for understanding, selecting, and using NoSQL databases effectively.

Stars301

Forks30

Last commit4 years ago

More SQL Parsing!Python

A Python SQL parser that converts SQL queries into JSON-izable parse trees for translation to non-SQL datastores.

#ast-generation#sql-flavors#python-library

Stars294

Forks56

Last commit8 months ago

GeniClojure

An idiomatic Clojure dataframe library that runs on Apache Spark, providing a seamless interface for data processing and machine learning.

#apache-spark#high-performance-computing#spark

Stars294

Forks26

#unit-testing#klarna-featured#hive-sql

HiveRunnerJava

An open-source unit test framework for Hive SQL queries, enabling TDD without installed dependencies via JUnit 4 and 5.

Stars262

Forks78

Last commit1 year ago

Csvlint.goMakefile

A Go library and CLI tool for validating CSV files against RFC 4180 standards.

#csv-validation#go-library#command-line-tool

Stars208

Forks21

Last commit8 months ago

DQOpsJava

A DataOps-friendly data quality monitoring platform with customizable checks, dashboards, and incident management for multiple data sources.

#data-quality-report#data-observability#data-quality-checks

Stars194

Forks37

Last commit6 months ago

HBase

A curated list of awesome HBase projects, clients, frameworks, tools, and resources.

#data-storage#data-integration#hbase

Stars180

Forks41

Last commit2 months ago

StreamlineJava

A visual development platform for building, deploying, and managing streaming analytics applications with multiple engine bindings.

#stream-processing#flink#storm

Stars167

Forks95

#apache-spark#data-engineering#gcp

Spark-BigQueryScala

A Spark library for reading from and writing to Google BigQuery using DataFrames and SQL.

Stars156

Forks50

Last commit6 years ago

elusionRust

A Rust DataFrame and data engineering library with PySpark/SQL-like syntax, built for business data pipelines with Microsoft stack integration.

#pyspark-alternative#sql-like#data-science

Stars143

Forks4

Last commit3 months ago

The "Database as Code" Manifesto

A manifesto advocating for treating database interactions, queries, and lifecycle management as plain code with SQL as the primary language.

#database#version-control#devops

Stars117

Forks5

Last commit3 months ago

spark-connect-rsRust

An experimental Rust client for Apache Spark Connect, providing a DataFrame API to interact with Spark clusters.

#spark-connect#apache-spark#spark

Stars116

Forks24

Last commit1 year ago

DatajobPython

A Python framework for building and deploying serverless data and ML pipelines on AWS using AWS CDK.

#glue#sagemaker#stepfunctions

Stars111

Forks19

#database-driver#swoole#tdengine-client

php-tdenginePHP

A PHP client extension for the TDengine big data engine, with Swoole coroutine support.

Stars77

Forks9

#unit-testing#apache-hive#data-engineering

BeetestJava

A simple utility for testing Apache Hive scripts locally without requiring Java development skills.

Stars73

Forks23

Last commit9 years ago

ByteHubPython

An easy-to-use Python feature store for machine learning, optimized for timeseries data and built on Dask.

#parquet#data-science#data-engineering

Stars61

Forks4

Last commit5 years ago

ParquetPHP

A pure PHP library for reading and writing Parquet columnar storage files without external dependencies.

#parquet#file-format#data-engineering

Stars60

Forks3

Last commit11 days ago

JuliaJulia

A Julia client interface for reading from and writing to the TypeDB knowledge graph database.

#typedb-client#julia#typedb-osi

Stars54

Forks4

Last commit4 years ago

cl-influxdbCommon Lisp

A native Common Lisp interface for the InfluxDB time series database.

#quicklisp#data-engineering#monitoring

Stars24

Forks3

Last commit9 years ago

cl-influxdbCommon Lisp

A native Common Lisp interface for the InfluxDB time series database.

#quicklisp#data-engineering#monitoring

Stars24

Forks3

Last commit9 years ago

ProveroPython

A vendor-neutral, declarative data quality engine that defines checks in YAML and runs anywhere.

#data-testing#airflow#yaml

Stars17

Forks2

Last commit4 days ago

SnackFSScala

A lightweight, HDFS-compatible file system built over Cassandra with a fat driver design for easy deployment.

#distributed-filesystem#hdfs-compatible#storage

Stars13

Forks5

Last commit11 years ago

Cassandra.Lunch

A weekly online meetup and resource hub for Apache Cassandra topics, featuring talks, tutorials, and community discussions.

#airflow#devops#akka

Stars10

Forks9

#batch-processing#deduplication#zero-dependencies

datatraxGo

A pure Go toolkit for data engineering and classic machine learning with zero external dependencies.

Stars10

#crypto#eos#cryptocurrency

Last commit2 months ago

eos-etlPython

Extract, transform, and load (ETL) scripts for exporting and streaming EOS blockchain data.

Stars8

Forks7

#parquet#high-performance#simd

koala-diffPython

A blazingly fast data comparison tool for Python that instantly compares massive CSV/Parquet datasets, powered by Rust.

Stars7

#parquet#network-traffic#pcap

Last commit4 months ago

pcaptoparquetPython

A Python package for converting PCAP network capture files to Parquet, CSV, or JSON formats.

Stars6

Forks1

Last commit8 months ago

crdt-mergePython

A CRDT-based merge library that guarantees mathematical convergence for DataFrames, JSON, ML models, and distributed agents.

#federated-learning#python-library#collaborative-editing

Stars5

#shell-integration#workflow-automation#oh-my-zsh-zsh-plugin-databricks-cli-productivity

Last commit13 days ago

databricksShell

A Zsh plugin that enhances Databricks CLI with convenient aliases, profile management, and job run analysis using the 'dbrs' prefix.

Stars3