Big Data

258 projects

Showing 36 of 258 projects

An organized reading list of patterns, case studies, and articles on building scalable, reliable, and performant large-scale systems.

#devops#case-studies#reliability

Stars72.6k

Forks7.0k

Last commit6 months ago

SeaweedFSGo

A distributed storage system for object storage (S3), file systems, and Iceberg tables, optimized for billions of files with O(1) disk access.

#high-scalability#tiered-file-system#distributed-storage

Stars33.6k

Forks2.9k

Last commit

Apache Kafka StreamsJava

A distributed event streaming platform for building high-performance data pipelines, streaming analytics, and data integration.

#stream-processing#message-queue#data-integration

Stars33.3k

Forks15.4k

Last commit6 hours ago

data-science-ipython-notebooksPython

A comprehensive collection of data science Python notebooks covering deep learning, machine learning, big data, visualization, and essential tools.

#data-science#deep-learning#python

A scalable, portable, and distributed gradient boosting library for efficient machine learning across multiple languages and platforms.

#parallel-computing#gbdt#data-science

Stars28.6k

Forks8.9k

Last commit2 days ago

gunJavaScript

A decentralized graph database and synchronization protocol for building real-time, offline-first applications with end-to-end encryption.

#decentralized-database#crypto#graph

Stars19.1k

Forks1.2k

Last commit4 months ago

ScyllaDBC++

A high-performance NoSQL database compatible with Apache Cassandra and Amazon DynamoDB, built on a shared-nothing architecture.

#real-time-database#database#seastar

Stars15.7k

Forks1.5k

Last commit5 hours ago

awesome-bigdata

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-science#distributed-systems

Stars14.5k

Forks2.6k

Last commit2 months ago

Big Data

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-storage#open-source

Stars14.5k

Forks2.6k

Last commit2 months ago

JuiceFSGo

A high-performance distributed POSIX file system for cloud-native environments, storing data in object storage and metadata in databases.

#filesystem#data-storage#high-performance

Stars14.2k

Forks1.3k

Last commit19 hours ago

DruidJava

A high-performance real-time analytics database designed for fast queries and ingest to reduce time to insight.

#apache#high-performance#real-time-analytics

Stars14.0k

Forks3.8k

Last commit2 days ago

TrinoJava

A fast distributed SQL query engine for big data analytics, enabling interactive queries across diverse data sources.

#database#distributed-systems#query-engine

Stars13.1k

Forks3.7k

Last commit2 days ago

PredictionIOScala

An open source machine learning server for developers and data scientists, supporting event collection, algorithm deployment, and REST API queries.

#event-collection#spark#hbase

Stars12.5k

Forks1.9k

Last commit5 years ago

kafka-managerScala

A web-based tool for managing Apache Kafka clusters, enabling cluster inspection, topic management, and partition operations.

#devops#distributed-systems#kafka

Stars11.9k

Forks2.5k

Last commit3 years ago

Quickwit-oss/quickwitRust

A cloud-native search engine optimized for observability data like logs and traces, offering sub-second search on cloud storage.

#open-source#observability#logs

Stars11.4k

Forks567

Last commit2 days ago

cythonCython

The most widely used Python to C compiler

#cython#cpython#c

Stars10.8k

Forks1.6k

Last commit2 days ago

modinPython

A drop-in replacement for pandas that scales data analysis workflows to use all CPU cores and handle out-of-memory datasets.

#parallel-computing#distributed#data-science

Stars10.4k

Forks676

Last commit5 months ago

Apache IcebergJava

A high-performance table format for huge analytic datasets, enabling multiple engines to safely work with the same tables simultaneously.

#apache-flink#hacktoberfest#apache-spark

Stars9.1k

Forks3.4k

Last commit2 days ago

datafusionRust

An extensible SQL query engine written in Rust, using Apache Arrow as its in-memory format for building fast database and analytic systems.

#columnar-database#apache-arrow#dataframe

Stars9.0k

Forks2.3k

Last commit2 days ago

Delta LakeScala

An open-source storage framework that enables building a Lakehouse architecture with ACID transactions and scalable metadata handling.

#apache-spark#parquet#data-versioning

Stars8.9k

Forks2.1k

Last commit2 days ago

awesome-data-engineering

A curated list of data engineering tools, frameworks, databases, and resources for software developers.

#stream-processing#workflow-orchestration#awesome-list

Stars8.9k

Forks1.6k

Last commit5 days ago

vaexPython

A high-performance Python DataFrame library for lazy out-of-core processing and visualization of billion-row datasets at interactive speeds.

#out-of-core#python-dataframe#apache-arrow

Stars8.5k

Forks603

Last commit3 months ago

BigCacheGo

A fast, concurrent, evicting in-memory cache for Go designed to store gigabytes of data with minimal GC overhead.

#eviction-cache#hacktoberfest#http-server

Stars8.1k

Forks610

Last commit3 days ago

h2oJupyter Notebook

An open-source, in-memory platform for distributed and scalable machine learning with support for a wide range of algorithms and big data technologies.

#h2o#ensemble-learning#random-forest

Stars7.5k

Forks2.0k

Last commit2 days ago

MolochC

An open-source, large-scale network packet capture, indexing, and analysis system for security and network monitoring.

#pcap#network-forensics#pcap-indexing

Stars7.4k

Forks1.2k

Last commit2 days ago

ArkimeC

An open-source, large-scale network packet capture, indexing, and analysis system with a web interface.

#pcap#network-forensics#pcap-indexing

Stars7.4k

Forks1.2k

Last commit2 days ago

AlluxioJava

A distributed caching platform that bridges computation frameworks and storage systems for large-scale analytics and ML workloads.

#data-orchestration#spark#memory-speed

Stars7.2k

Forks2.9k

Last commit1 year ago

Feast - A Feature Store for ML for GCP by Gojek/GooglePython

An open-source feature store for managing and serving machine learning features for training and online inference.

#features#batch-processing#data-science

Seamless multi-primary syncing database with an intuitive HTTP/JSON API, designed for reliability

#database#couchdb#http

Stars6.9k

Forks1.1k

Last commit2 days ago

HazelcastJava

A unified real-time data platform combining stream processing with a fast data store for instant action on data-in-motion.

#stream-processing#hacktoberfest#hazelcast

Stars6.6k

Forks1.9k

Last commit2 days ago

Apache HudiJava

An open data lakehouse platform for incremental data processing with upserts, deletes, and time-travel queries.

#apache-flink#upsert-delete#stream-processing

Stars6.2k

Forks2.5k

Last commit2 days ago

SqlSugarC#

A high-performance, multi-database compatible .NET ORM framework with low-code features and enterprise-ready solutions.

#orm#database#high-performance

Stars5.8k

Forks1.4k

Last commit2 days ago

JanusGraphJava

An open-source, distributed graph database optimized for storing and querying large graphs with billions of vertices and edges.

#tinkerpop#graph#hbase

Stars5.8k

Forks1.2k

Last commit11 hours ago

MesosC++

A cluster manager that provides efficient resource isolation and sharing across distributed applications on a shared pool of nodes.

#resource-isolation#container-orchestration#distributed-systems

Stars5.4k

Forks1.7k

Last commit2 months ago

IgniteJava

A distributed database for high-performance computing with in-memory speed, ACID compliance, and ANSI SQL support.

#iot#data-grid#mapreduce

Stars5.1k

Forks1.9k

Last commit2 days ago

vue-virtual-scroll-listJavaScript

A Vue component for rendering large lists with high performance using virtual scrolling.

#virtual-scrolling#data-rendering#large-lists

Stars4.5k

Forks596

Last commit2 years ago

Page 1 of 8Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub