Bigdata

21 projects

Showing 21 of 21 projects

A high-performance, S3-compatible distributed object storage system built in Rust, optimized for data lakes and AI workloads.

#openstack-swift#filesystem#amazon-s3

An enterprise distributed database ecosystem that enhances heterogeneous databases with sharding, scalability, and security via JDBC and Proxy access layers.

#jdbc-driver#database#sql-federation

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-science#distributed-systems

Stars14.5k

Forks2.6k

Last commit2 months ago

Big Data

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-storage#open-source

Stars14.5k

Forks2.6k

Last commit2 months ago

DatabendRust

An open-source enterprise data warehouse built in Rust for AI agents, analytics, vector search, and full-text search.

#ai#database#serverless

A high-performance Python DataFrame library for lazy out-of-core processing and visualization of billion-row datasets at interactive speeds.

#out-of-core#python-dataframe#apache-arrow

Stars8.5k

Forks603

Last commit3 months ago

Apache HudiJava

An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.

#apache-flink#upsert-delete#stream-processing

Stars6.2k

Forks2.5k

Last commit2 days ago

.NET for Apache SparkC#

.NET for Apache Spark provides high-performance .NET APIs for Apache Spark, enabling C# and F# developers to work with structured and streaming data.

#apache-spark#spark#dataframe

Stars2.1k

Forks332

Last commit2 months ago

PoliJava

An easy-to-use, self-hosted SQL reporting application for creating interactive business intelligence dashboards.

#reporting#self-hosted-bi#sql-reporting

Stars2.0k

Forks336

Last commit3 years ago

GenieJava

A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.

#data-orchestration#spark#netflixoss

Stars1.8k

Forks375

Last commit7 days ago

OptimusPython

A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.

#data-cleaning#cudf#spark

Stars1.5k

Forks232

Last commit1 year ago

Kube BatchGo

A Kubernetes batch scheduler for high-performance workloads like AI/ML, BigData, and HPC.

#high-performance-computing#batch-scheduler#kubernetes

Stars1.1k

Forks260

Last commit3 years ago

LivyScala

A REST interface for interacting with Apache Spark from anywhere, enabling remote code execution and job submissions.

#apache-spark#spark#interactive-computing

Stars958

Forks625

Last commit12 days ago

GearpumpScala

A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.

#stream-processing#akka#cluster-computing

Stars756

Forks150

Last commit7 days ago

BigARTMC++

A fast, open-source platform for topic modeling using Additive Regularization of Topic Models (ARTM).

#additive-regularization#sparse-modeling#python-library

Stars675

Forks119

Last commit5 months ago

Kotlin for Apache SparkKotlin

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

athenacliPython

A command-line interface for AWS Athena with auto-completion and syntax highlighting.

#dbcli#autocompletion#database

Stars228

Forks36

Last commit2 months ago

FastWARCRust

A collection of robust and fast Python tools for parsing, extracting, and analyzing web archive data, including a high-performance WARC parser.

#cython#batch-processing#content-extraction

Stars144

Forks18

Last commit1 month ago

NewLife.XCodeC#

20 年演进的 .NET 高性能数据中间件，聚焦极致性能、海量数据、自动建模/迁移、多级缓存、自动分表分库，支持 MySQL/SQLite/SqlServer/Oracle/Postgresql/达梦等

#orm#bigdata

Stars102

Forks39

Last commit2 days ago

NebulaStreamC++

An end-to-end data management system for IoT, optimizing stream processing across cloud, edge, and sensor deployments.

#stream-processing#sql-engine#streamprocessing

Stars88

Forks42

Last commit2 days ago

Docker for beginnersJupyter Notebook

A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.

#google-colab#mapreduce-bash#apache-spark

Stars87

Forks27

Last commit2 months ago

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub