Data Pipeline

131 projects

Showing 36 of 131 projects

go-streamsGo

A lightweight and efficient stream processing library for Go, providing a declarative DSL to build data pipelines.

#stream-processing#pulsar#redis

Stars2.2k

Forks174

Last commit6 months ago

doitPython

CLI task management & automation tool

#workflow-management#workflow#data-science

Stars2.1k

Forks193

Last commit5 months ago

rust-rdkafkaRust

A fully asynchronous, futures-based Apache Kafka client library for Rust built on librdkafka.

#stream-processing#futures#message-queue

Stars2.0k

Forks356

Last commit8 days ago

Elasticsearch HadoopJava

Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.

#apache-spark#mapreduce#data-integration

Stars2.0k

Forks1.0k

Last commit15 hours ago

SecorJava

A fault-tolerant service that persists Kafka log data to cloud storage like S3, GCS, Azure Blob Storage, and OpenStack Swift.

#distributed-systems#data-archiving#hadoop-ecosystem

Stars1.9k

Forks537

Last commit4 months ago

EmbulkJava

A parallel bulk data loader that transfers data between various storages, databases, NoSQL, and cloud services via plugins.

#gradle#jruby#data-transfer

Stars1.8k

Forks204

Last commit1 month ago

KibaRuby

A Ruby framework for writing reliable, concise, and maintainable ETL (Extract-Transform-Load) data processing jobs.

#rubydatascience#etl-ruby#ruby-gem

Stars1.8k

Forks90

Last commit6 months ago

GenieJava

A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.

#data-orchestration#spark#netflixoss

Stars1.8k

Forks375

Last commit10 days ago

underscore-cliJavaScript

A command-line utility for processing JSON and JavaScript data, inspired by Perl and Unix tools like sed and awk.

#cli-tool#nodejs#json-processing

Stars1.7k

Forks80

Last commit5 years ago

JoltJava

A Java library for declarative JSON-to-JSON transformations using JSON-based specifications.

#java-library#declarative-spec#data-restructuring

Stars1.7k

Forks351

Last commit29 days ago

MultiwovenRuby

An open-source Reverse ETL platform for syncing data from warehouses to business tools like Salesforce, HubSpot, and Slack.

#open-source#reverse-etl#data-integration

Stars1.7k

Forks92

Last commit9 hours ago

BruinGo

A unified data pipeline tool for ingestion, transformation with SQL/Python/R, and data quality checks across major platforms.

#data-modeling#data-quality#python

Stars1.7k

Forks85

Last commit18 hours ago

mongo-hadoopJava

A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.

#mapreduce#bson#spark

Stars1.6k

Forks588

Last commit4 years ago

smarter_csvRuby

A high-performance CSV ingestion and generation library for Ruby with C acceleration, designed for real-world data with intelligent defaults.

#sidekiq#csv-processing#batch-processing

A Reactive Streams connector for Apache Kafka built on Akka Streams, enabling back-pressured integration for Java and Scala.

#stream-processing#akka#kafka-connector

Stars1.4k

Forks370

Last commit3 days ago

DeepOSMPython

Train neural networks with OpenStreetMap data and satellite imagery to classify roads and map features.

#geospatial#deep-learning#neural-networks

Stars1.3k

Forks184

Last commit9 years ago

Dataflow TemplatesJava

A collection of pre-built Google Cloud Dataflow templates for common data import/export, backup, and bulk API operations.

#stream-processing#batch-processing#data-integration

A high-performance Rust stream processing engine with integrated AI capabilities for real-time data processing and intelligent analysis.

#stream-processing#event-driven#ai-integration

Stars1.3k

Forks46

Last commit8 days ago

Hazelcast JetJava

An open-source, in-memory, distributed batch and stream processing engine for Java applications.

#stream-processing#event-processing#hacktoberfest

Stars1.1k

Forks203

Last commit1 year ago

s3-lambdaJavaScript

Run lambda functions over S3 objects with concurrency control for data pipelining and analytics.

#batch-processing#serverless-patterns#s3

Stars1.1k

Forks47

Last commit9 years ago

s3renityJavaScript

Run lambda functions over S3 objects with concurrency control for data pipelining and analytics.

#cloud-storage#s3#nodejs

Stars1.1k

Forks47

Last commit9 years ago

magrittr <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

An R package providing the %>% pipe operator to improve code readability by structuring data operations left-to-right.

#functional-programming#pipe-operator#r-package

A scalable n:m message multiplexer written in Go for routing messages from multiple sources to multiple destinations.

#stream-processing#log-router#message-queue

Stars940

Forks74

Last commit9 months ago

jedisct1/flowggerRust

A fast, secure, and standalone log collector written in Rust that parses, validates, and forwards log data.

#graylog#syslog#observability

Stars882

Forks60

Last commit1 year ago

camusJava

LinkedIn's previous generation Kafka to HDFS pipeline for batch data ingestion.

#batch-processing#linkedin#kafka

Stars881

Forks451

Last commit5 years ago

suroJava

A distributed data pipeline service for collecting, aggregating, and dispatching large volumes of application events and log data.

#message-queue#netflixoss#distributed-systems

Stars796

Forks170

Last commit3 years ago

VASTC++

A data pipeline engine for security teams to collect, transform, enrich, and route telemetry data at scale.

#stream-processing#security-analytics#siem

Stars750

Forks106

Last commit17 hours ago

easy-batchJava

A simple, lightweight batch processing framework for Java designed for ETL jobs.

#batch-processing#batch#file-processing

Stars622

Forks197

Last commit3 years ago

StreamizC#

A .NET stream processing library for Apache Kafka, providing a Kafka Streams-like API for building real-time applications.

#stream-processing#event-driven#kafka-streams-dotnet

Stars541

Forks81

Last commit12 hours ago

data-pipeline-samplesPython

Sample AWS Data Pipeline templates for automating data movement and transformation workflows.

#devops#workflow-automation#infrastructure-as-code

Stars472

Forks259

Last commit6 years ago

LipstickJavaScript

A visualization framework for Apache Pig workflows that combines graphical depictions with real-time execution information.

#hadoop-ecosystem#data-engineering#big-data

Stars466

Forks133

Last commit3 years ago

TributaryPython

A Python library for constructing reactive dataflow graphs and streaming computations as data models.

#real-time-processing#data-modeling#python-data-streams

Stars465

Forks39

Last commit1 month ago

DMGo

A unified data replication platform for TiDB, providing MySQL/MariaDB migration and change data capture to downstream systems.

#tidb#change-data-capture#data-replication

A serverless toolkit for routing, normalizing, and enriching security event and audit logs in AWS.

#aws-serverless#observability#log-enrichment

Stars403

Forks35

Last commit6 months ago

DrayGo

A RESTful engine for orchestrating sequential Docker container workflows, marshaling data between steps.

#devops#batch-processing#redis

Stars385

Forks37

Last commit6 years ago

amazon-kinesis-client-pythonPython

A Python interface to the Amazon Kinesis Client Library for building distributed applications that process streaming data reliably at scale.

#stream-processing#kinesis#python-library

Stars376

Forks228

Last commit

PreviousPage 2 of 4

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub