Etl

106 projects

Showing 36 of 106 projects

A C++20 library for fast serialization, deserialization, and validation using reflection, supporting JSON, Avro, CSV, Parquet, and more.

#validation#c++20#data-formats

Stars1.9k

Forks185

Last commit15 hours ago

EmbulkJava

A parallel bulk data loader that transfers data between various storages, databases, NoSQL, and cloud services via plugins.

#gradle#jruby#data-transfer

Stars1.8k

Forks204

Last commit1 month ago

KibaRuby

A Ruby framework for writing reliable, concise, and maintainable ETL (Extract-Transform-Load) data processing jobs.

#rubydatascience#etl-ruby#ruby-gem

Stars1.8k

Forks90

Last commit6 months ago

MultiwovenRuby

An open-source Reverse ETL platform for syncing data from warehouses to business tools like Salesforce, HubSpot, and Slack.

#open-source#reverse-etl#data-integration

Stars1.7k

Forks92

Last commit9 hours ago

csvqGo

A command-line tool that provides an SQL-like query language for reading, updating, and deleting CSV records.

#spreadsheet-alternative#csv-processing#command-line-tool

Stars1.6k

Forks68

Last commit2 years ago

OptimusPython

A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.

#data-cleaning#cudf#spark

Stars1.5k

Forks232

Last commit1 year ago

Apache InLong (.4k)Java

A one-stop, full-scenario integration framework for massive data, supporting data ingestion, synchronization, and subscription.

#massive-data-integration#stream-processing#batch-processing

Stars1.5k

Forks571

Last commit2 days ago

pg_timetableGo

A standalone, database-driven job scheduler for PostgreSQL with advanced features like task chains, YAML configuration, and built-in operations.

#task-automation#database#cron-alternative

Stars1.4k

Forks73

Last commit17 hours ago

amphi-etlTypeScript

A visual, low-code data preparation tool that generates Python code for ETL, reporting, and AI-assisted workflows.

#jupyterlab-extension#analytics-automation#datatransformation

A command-line tool to import CSV and JSON files into PostgreSQL with automatic table generation.

#command-line-tool#database-migration#postgresql

Stars1.3k

Forks127

Last commit5 years ago

DexJavaScript

A Java/Groovy/JavaFX data visualization tool for ETL, machine learning, and publishing web visualizations.

#desktop-application#datavis#dataviz

Stars1.3k

Forks307

Last commit7 years ago

Dataflow TemplatesJava

A collection of pre-built Google Cloud Dataflow templates for common data import/export, backup, and bulk API operations.

#stream-processing#batch-processing#data-integration

A logical replication extension for PostgreSQL that enables high-performance, cross-version data replication and upgrades.

#logical-replication#cross-version-upgrades#data-synchronization

Stars1.2k

Forks177

Last commit11 days ago

omniparserGo

A native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, and custom formats.

#schema-driven#fixed-length#x12

Stars1.1k

Forks81

Last commit1 year ago

s3renityJavaScript

Run lambda functions over S3 objects with concurrency control for data pipelining and analytics.

#cloud-storage#s3#nodejs

Stars1.1k

Forks47

Last commit9 years ago

sparklyrR

An R interface for Apache Spark that enables distributed data processing, machine learning, and SQL queries using familiar R syntax.

#apache-spark#distributed#dplyr

Stars971

Forks308

Last commit22 days ago

camusJava

LinkedIn's previous generation Kafka to HDFS pipeline for batch data ingestion.

#batch-processing#linkedin#kafka

Stars881

Forks451

Last commit5 years ago

Cinchoo ETLC#

A simple, fast, and flexible ETL framework for .NET with built-in readers and writers for CSV, JSON, XML, Parquet, and more.

#parquet#cinchoo-etl#flat

Stars859

Forks141

Last commit1 month ago

WexflowC#

A cross-platform workflow automation engine for developers and sysadmins to automate file operations, system tasks, and scheduled jobs.

#wexflow#devops#task-scheduler

Stars838

Forks190

Last commit6 days ago

Mongo-SparkJava

Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.

#apache-spark#connector#spark

Stars730

Forks320

Last commit3 days ago

koopJavaScript

A JavaScript toolkit for translating, querying, and integrating geospatial data from any API into multiple formats.

#spatial-api#api#arcgis

Stars709

Forks135

Last commit3 months ago

easy-batchJava

A simple, lightweight batch processing framework for Java designed for ETL jobs.

#batch-processing#batch#file-processing

Stars622

Forks197

Last commit3 years ago

aws-lambda-redshift-loaderJavaScript

An AWS Lambda function that automatically loads files from S3 into Amazon Redshift clusters with zero server administration.

#aws-cloudformation#batch-processing#serverless

A Rust-based data transfer suite for ultra-fast replication between MySQL, PostgreSQL, Redis, MongoDB, Kafka, and ClickHouse.

#redis#data-transfer#disaster-recovery

Stars586

Forks96

Last commit19 hours ago

SmartCodeC#

A .NET Core code generation and ETL tool that builds projects from data sources using configurable templates and tasks.

#database-first#code-generator#smartcode

Stars578

Forks169

Last commit2 years ago

PigPenClojure

A Clojure library for writing map-reduce queries that compile to Apache Pig or Cascading, enabling distributed data processing with Clojure syntax.

#cascading#clojure#big-data

Stars564

Forks52

Last commit3 years ago

IntegrationMarkdown

A curated list of awesome system integration software, patterns, and resources.

#api-gateway#bpm#workflow

Stars546

Forks89

Last commit3 days ago

Spark XMLScala

A library for parsing and querying XML data with Apache Spark SQL and DataFrames.

#apache-spark#dataframe#xml-parser

Stars513

Forks223

Last commit1 year ago

Kotlin for Apache SparkKotlin

Kotlin bindings and extensions for Apache Spark, enabling idiomatic Kotlin development with data classes, lambdas, and null safety.

#apache-spark#spark#nullability

Stars481

Forks37

Last commit1 month ago

sparkllingClojure

A fast, fully-featured, and developer-friendly Clojure API for Apache Spark.

#apache-spark#functional-programming#data-engineering

Stars447

Forks68

Last commit4 years ago

SmooksJava

An extensible Java framework for building event-driven applications that break up XML and non-XML data into chunks for data integration

#stream-processing#event-driven#pipelines

Stars417

Forks356

Last commit8 months ago

QuackOSMPython

An open-source Python and CLI tool for reading OpenStreetMap PBF files using DuckDB and exporting to GeoParquet.

#osm-pbf#geospatial#python-cli

A unified platform for big data stream and batch processing on Hadoop YARN with enterprise-grade operability.

#stream-processing#batch-processing#real-time-analytics

Stars350

Forks170

Last commit5 years ago

mongo_fdwC

A PostgreSQL foreign data wrapper that enables querying and manipulating MongoDB data directly from PostgreSQL.

#database-integration#nosql#foreign-data-wrapper

Stars342

Forks77

Last commit21 hours ago

amazon-kinesis-connectorsJava

A Java library for building data pipelines that connect Amazon Kinesis streams to AWS and non-AWS services like DynamoDB, Redshift, S3, and Elasticsearch.

#stream-processing#java-library#batch-processing

Stars327

Forks186

Last commit

neo4j-spark-connectorScala

A bi-directional connector enabling Apache Spark to read from and write to Neo4j graph databases using Spark DataSource APIs.

#hacktoberfest#apache-spark#neo4j-driver

Stars322

Forks119

Last commit19 hours ago

PreviousPage 2 of 3Next

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub