Data Pipeline

131 projects

Showing 36 of 131 projects

A distributed event streaming platform for building high-performance data pipelines, streaming analytics, and data integration.

#stream-processing#message-queue#data-integration

Stars33.3k

Forks15.4k

Last commit11 hours ago

vectorRust

A high-performance, end-to-end observability data pipeline for collecting, transforming, and routing logs and metrics.

#stream-processing#hacktoberfest#pipelines

Stars22.2k

Forks2.2k

Last commit11 hours ago

VectorRust

A high-performance, end-to-end observability data pipeline for collecting, transforming, and routing logs and metrics.

#stream-processing#hacktoberfest#pipelines

Stars22.2k

Forks2.2k

Last commit11 hours ago

privacy-preserving ML

A curated list of awesome open-source libraries for deploying, monitoring, versioning, and scaling production machine learning systems.

#explainability#deep-learning#interpretability

Stars20.8k

Forks2.6k

Last commit9 days ago

Awesome Production Machine Learning

A curated list of awesome open-source libraries for deploying, monitoring, versioning, and scaling production machine learning systems.

#ai-infrastructure#open-source#explainability

Stars20.8k

Forks2.6k

Last commit9 days ago

Telegraf PostgreSQL pluginGo

A plugin-driven agent for collecting, processing, aggregating, and writing metrics, logs, and arbitrary data.

#plugin-system#observability#logs

Stars17.7k

Forks5.8k

Last commit13 hours ago

LogstashJava

A server-side data processing pipeline that ingests, transforms, and ships logs and events from multiple sources.

#event-processing#jruby#server-side

Stars14.9k

Forks3.5k

Last commit10 hours ago

awesome-bigdata

A curated list of awesome big data frameworks, resources, and tools across various categories.

#database#data-science#distributed-systems

Stars14.5k

Forks2.6k

Last commit2 months ago

FluentdRuby

An open-source log collector that unifies logging infrastructure by collecting events from various sources and routing them to multiple destinations.

#event-processing#devops#observability

Stars13.6k

Forks1.4k

Last commit2 days ago

Debezium (k)Java

A low-latency platform for change data capture (CDC) that streams row-level changes from databases to applications.

#database#event-driven-architecture#cqrs

Stars12.9k

Forks3.0k

Last commit15 hours ago

Machine Learning InterviewsHTML

A practical booklet covering the four main steps of designing machine learning systems with 27 interview questions.

#data-science#machine-learning-production#production-ml

A high-performance, resilient stream processor that connects various sources and sinks, performs data transformations, and guarantees at-least-once delivery.

#stream-processing#declarative-config#cqrs

Stars8.7k

Forks952

Last commit9 hours ago

BenthosGo

A high-performance, declarative stream processor that connects various sources and sinks with built-in data transformation capabilities.

#stream-processing#cqrs#message-queue

Stars8.7k

Forks952

Last commit9 hours ago

KreuzbergRust

A polyglot document intelligence framework with a Rust core for extracting text, metadata, and structured data from 91+ file formats.

#text-extraction#document-intelligence#batch-processing

Stars8.7k

Forks526

Last commit4 hours ago

Pentaho Data Integration (.3k)Java

An open-source ETL (Extract, Transform, Load) tool for data integration and migration.

#plugin-system#data-integration#business-intelligence

Stars8.4k

Forks3.6k

Last commit5 hours ago

snowplowScala

Open-source customer data infrastructure that collects, validates, and enriches behavioral event data for AI and analytics.

#snowplow-events#data-warehouse-integration#event-tracking

Stars7.0k

Forks1.2k

Last commit27 days ago

CloudQueryGo

Open-source data pipelines to sync cloud infrastructure metadata from AWS, Azure, GCP, and 70+ sources into your data warehouse.

#sql-queryable#multi-cloud#apache-arrow

Stars6.5k

Forks548

Last commit6 days ago

CloudQueryGo

Open-source data pipelines for cloud asset inventory, CSPM, FinOps, and vulnerability management across AWS, Azure, GCP, and 70+ sources.

#sql-queryable#multi-cloud#apache-arrow

Stars6.5k

Forks548

Last commit6 days ago

Apache NiFi (k)Java

An easy-to-use, powerful, and reliable system to process and distribute data across cybersecurity, observability, and AI pipelines.

#hacktoberfest#apache#observability

Stars6.2k

Forks3.0k

Last commit13 hours ago

kcat (.7k)C

A lightweight command-line tool for producing, consuming, and inspecting Apache Kafka messages, similar to netcat for Kafka.

#devops#message-queue#command-line-tool

Stars5.8k

Forks500

Last commit2 years ago

kafkacatC

A lightweight, non-JVM command-line tool for producing, consuming, and inspecting Apache Kafka messages.

#devops#message-queue#command-line-tool

Stars5.8k

Forks500

Last commit2 years ago

fluvioRust

A distributed data streaming engine with stateful stream processing for building responsive data-intensive applications.

#stream-processing#event-driven#webassembly

Stars5.2k

Forks532

Last commit7 days ago

Clidey WhoDBGo

A lightweight, fast, and beautiful database management tool with AI-powered chat interface for PostgreSQL, MySQL, SQLite, MongoDB, Redis, and more.

#data-lineage#database#explorer

Stars4.9k

Forks222

Last commit14 hours ago

RudderStackGo

An open-source, privacy-focused customer data platform (CDP) that collects, processes, and routes event data to warehouses and tools.

#event-collection#segment-alternative#warehouse-management

Stars4.5k

Forks57

Last commit6 hours ago

DotnetSpiderC#

A lightweight, efficient, and fast high-level web crawling and scraping framework for .NET.

#web-crawling#distributed#redis

Stars4.1k

Forks1.1k

Last commit3 months ago

tengoGo

A fast, embeddable scripting language for Go applications, compiled to bytecode and executed on a stack-based VM.

#programming-language#compiler#rules-engine

Stars3.8k

Forks334

Last commit2 months ago

ingestrGo

A CLI tool to copy data between any databases and platforms with a single command, no code required.

#dlt#mssql#no-code

Stars3.8k

Forks143

Last commit16 hours ago

databusJava

A source-agnostic distributed change data capture system for reliably capturing and streaming primary data changes.

#linkedin#oracle#change-data-capture

Stars3.7k

Forks737

Last commit2 years ago

DaguGo

Local-first workflow engine for ops automation and AI-assisted operations. Open source and self-hostable: single binary, no DBMS. Define DAGs in declarative YAML. Built-in MCP server so AI agents can manage your DAGs.

#task-automation#devops#task-scheduler

A local-first, single-binary workflow orchestration engine that runs declarative DAGs from laptop to distributed cluster.

#task-automation#devops#task-scheduler

Stars3.7k

Forks295

Last commit3 hours ago

Ensemble-StrategyPython

An AI-native modular infrastructure for quantitative trading, featuring a weight-centric architecture for building, testing, and deploying algorithmic strategies.

#backtesting#algorithmic-trading#finrl

Stars3.5k

Forks1.0k

Last commit