Categories Alternatives Stacks Self-Hosted Explore

© 2026 Open-Awesome. Curated for the developer elite.

Terms Privacy About GitHub RSS

Page 3 - Data Pipeline Open Source Projects | Open Awesome

Home
Tags
Data Pipeline

Data Pipeline

131 projects

Showing 36 of 131 projects

amazon-kinesis-connectorsJava

A Java library for building data pipelines that connect Amazon Kinesis streams to AWS and non-AWS services like DynamoDB, Redshift, S3, and Elasticsearch.

#stream-processing#java-library#batch-processing

A Kotlin library for extracting path-based code representations and ASTs from multiple languages to prepare code for machine learning models.

#multi-language#research-tool#ast-extraction

Last commit8 months ago

Detection and Response Pipeline

A curated reference hub of tools and real-world examples for designing effective threat detection and response pipelines.

#security-reference#self-hosted-security#security-automation

Last commit2 years ago

A Go library for declarative JSON-to-JSON transformations using JSON specifications.

#jsonpath#golang-library#json-format

Last commit4 years ago

scicloj.mlClojure

An idiomatic Clojure machine learning library providing a unified interface for classification, regression, and unsupervised models.

#metamorph#tech-ml-dataset#hyperparameter-optimization

Last commit8 months ago

docker-logstashShell

Docker image for Logstash 1.4.5 with optional Elasticsearch 1.7.0 and Kibana 3.1.2 integration.

#elk-stack#logstash#monitoring

Last commit10 years ago

A Go-based toolkit for fast ETL and feature extraction on Hadoop, optimized for rapid development and execution.

#hive#pig#feature-extraction

Last commit11 years ago

An R package providing a toolbox of pipeline-friendly functions for manipulating and querying non-tabular data stored in list objects.

#functional-programming#r-package#r-language

Last commit3 years ago

A monolith codebase that powers the Iowa Environmental Mesonet's environmental data ingest, processing, and web services.

#iowa-mesonet#environmental-data#meteorology

Last commit1 day ago

An R package providing multiple pipeline styles (operator, object, function) for readable function chaining and data transformation.

#workflow-tools#r-package#r-language

Last commit10 years ago

geojsonio-cliJavaScript

A CLI tool to send GeoJSON files to geojson.io for instant visualization and editing.

#geojson#geospatial#nodejs

Last commit8 years ago

A Go library for building data processing workflows and pipelines with functional operations, cycles, and fan-out capabilities.

#pipeline-framework#stream-processing#functional-programming

Last commit8 days ago

A visual development platform for building, deploying, and managing streaming analytics applications with multiple engine bindings.

#stream-processing#flink#storm

Last commit2 years ago

aws-pdf-textract-pipelineTypeScript

Serverless data pipeline for crawling PDFs from the web and extracting structured data using AWS Textract.

#lambda#web-crawling#aws-textract

Last commit2 years ago

s3-multipartPython

Python utilities for parallel uploads and downloads to Amazon S3 using multipart uploads and range requests.

#cli-tool#python#file-transfer

Last commit10 years ago

ros2_data_collectionC++

Collect, validate, and send ROS 2 data to build APIs and dashboards with reliable data pipelines.

#robotics#ros2#api-builder

Last commit21 hours ago

Covid-19 GooglePython

An open-source data pipeline that aggregates and standardizes heterogeneous public COVID-19 data from multiple global sources.

#data-standardization#python#covid-19

Last commit5 years ago

A universal data converter that translates JSON, BSON, YAML, CSV, XML, and MT940 to any format using Go templates.

#bson#lua-scripting#go-templates

Last commit8 months ago

A Python framework for building and deploying serverless data and ML pipelines on AWS using AWS CDK.

#glue#sagemaker#stepfunctions

Last commit3 years ago

An open-source framework for developing large-scale anomaly detection models using Apache Spark.

#statistical-models#apache-spark#security-analytics

Last commit6 years ago

fluent-plugin-influxdbRuby

A buffered output plugin for Fluentd that sends time-series data to InfluxDB.

#ruby-gem#monitoring#fluentd-plugin

Last commit1 year ago

/context-primeTypeScript

A modern analytics pipeline for tracking and analyzing GitHub contributions across repositories with AI-powered summaries and leaderboards.

#mmo#scoring#nextjs

Last commit23 hours ago

logstash-input-dynamodbRuby

A Logstash input plugin that reads data from DynamoDB tables via table scans and DynamoDB Streams for near real-time data processing.

#stream-processing#logstash-plugin#ruby-gem

kafka-sparkstreaming-cassandraJupyter Notebook

A Docker container providing a complete streaming environment for experimenting with Kafka, Spark Streaming, and Cassandra.

#apache-spark#experimentation#real-time-processing

Flume MongoDB SinkJava

A Flume NG sink that writes event data to MongoDB with flexible configuration for collection mapping and data transformation.

#batch-processing#log-aggregation#java

Last commit3 years ago

kinesis-poster-workerPython

A multi-threaded Python example demonstrating how to produce and consume records from Amazon Kinesis streams.

#stream-processing#multi-threading#aws-sdk

Last commit11 years ago

history-toolsC++

fill-pg is a tool that populates PostgreSQL with blockchain data from EOSIO's State History Plugin for monitoring applications.

#database#blockchain-indexing#eosio

Last commit4 years ago

kinesis-log4j-appenderJava

A Log4J appender for publishing Java application logs to Amazon Kinesis streams with buffering and retry capabilities.

#aws-sdk-java#log4j-appender#amazon-kinesis

Last commit8 years ago

A Go package providing a simplistic implementation of pipelines using goroutines for concurrent data processing.

#stream-processing#pipeline#golang-library

Last commit4 years ago

psql-streamerGo

Streams PostgreSQL database events to Kafka using logical replication and can also consume events from Kafka.

#logical-replication#event-driven#replication

Last commit6 years ago

vstreamJavaScript

A Node.js module for instrumenting streams to provide debugging, monitoring, and provenance tracking in data pipelines.

#object-mode-streams#backpressure-visualization#performance-analysis

Last commit4 years ago

logstash-output-influxdbRuby

A Logstash output plugin that sends metrics from Logstash pipelines to InfluxDB time-series databases.

#logstash-plugin#jruby#metrics-collection

Last commit1 month ago

A blazing-fast, fully-automated tool for loading large CSV files into MySQL or PostgreSQL databases with parallel processing.

#csv-reader#database#elixir

Last commit6 days ago

A Python framework for building serverless data pipelines on AWS with Airflow-like elegance.

#step-functions#pydantic#workflow-orchestration

Last commit2 years ago

ordered-concurrentlyGo

A Go library for concurrent processing that returns output in the same order as the input via channels.

#parallel-computing#queue-processing#concurrent

Last commit3 years ago

A modular framework for ingesting and processing Algorand blockchain data into external applications.

#indexer#plugin-system#real-time-data

Last commit29 days ago

PreviousPage 3 of 4

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub

5 years ago

4 years ago

7 years ago

#Stream Processing30

#Data Integration18

#Data Processing14

#Distributed Systems13

#Data Ingestion12