Data Processing

#rubydatascience#etl-ruby#ruby-gem

KibaRuby

A Ruby framework for writing reliable, concise, and maintainable ETL (Extract-Transform-Load) data processing jobs.

Stars1.8k

Forks90

Last commit6 months ago

OpenXLSXC++

A C++ library for reading, writing, creating, and modifying Microsoft Excel .xlsx files.

#library#spreadsheet#cpp17

Stars1.8k

Forks390

Last commit1 month ago

jqlRust

A fast, lightweight JSON Query Language CLI tool built with Rust for querying and transforming JSON data.

#json-query#shell-integration#tool

Stars1.7k

Forks32

Last commit4 months ago

DiscoErlang

A distributed map-reduce framework for parallel computations over large datasets on unreliable computer clusters.

#parallel-computing#cluster-computing#fault-tolerance

Stars1.6k

Forks242

Last commit8 years ago

flowElixir

A computational parallel flow library for Elixir built on top of GenStage for parallel processing of collections.

#stream-processing#functional-programming#parallel-computing

Stars1.6k

Forks89

#penetration-testing#security-tools#password-analysis

hashcat-utilsC

A collection of small, chainable command-line utilities for advanced password cracking operations.

Stars1.6k

Forks410

Last commit8 months ago

OptimusPython

A Python library for agile data preparation workflows that works with Pandas, Dask, cuDF, Dask-cuDF, Vaex, and PySpark.

#data-cleaning#cudf#spark

Forks232

#browserify#csv-spectrum#cli-tool

csv-parserJavaScript

A high-performance streaming CSV parser for Node.js that converts CSV to JSON at up to 90,000 rows per second.

#data-cleaning#hacktoberfest#python-library

Forks142

Last commit2 months ago

pyjanitorPython

Python library providing clean, chainable functions for data cleaning and manipulation with pandas DataFrames.

#stream-processing#api#high-performance

Forks189

Last commit4 days ago

WallarooPony

A fast, resilient distributed stream processing framework that simplifies real-time data applications with high performance and easy scaling.

#apache-spark#performance-testing#distributed-systems

Forks67

Last commit5 years ago

HiBenchJava

A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.

#delimited-files#command-line-tools#data-science

Forks766

Last commit7 months ago

eBay's TSV utilitiesD

A suite of high-performance command line tools for filtering, summarizing, joining, and manipulating large tabular data files.

#csv-reader#csv-writer#high-performance

Forks83

Last commit3 years ago

SepC#

A modern, minimal, and high-performance .NET library for reading and writing CSV/TSV files with zero allocations and SIMD-accelerated parsing.

#hacktoberfest#serde#high-performance

Forks52

Last commit4 days ago

simd-jsonRust

A high-performance Rust JSON parser porting simdjson's SIMD techniques, with Serde compatibility.

Stars1.4k

Forks102

Last commit9 days ago

eo-learnPython

A Python framework for processing spatio-temporal satellite imagery and extracting features for machine learning applications.

#geospatial#eo-data#copernicus

Stars1.2k

Forks305

Last commit6 months ago

pyshpPython

A pure Python library for reading and writing ESRI Shapefiles, the popular GIS vector data format.

#python-library#geojson#geospatial

A Ruby library for reading, writing, and modifying Microsoft Excel-compatible spreadsheet documents (XLS format).

#spreadsheet#xls#ruby-gem

#robotics#icp#scientific-computing

Forks234

Last commit3 months ago

cilantroC++

A lean and fast C++ library for 3D point cloud data processing with efficient implementations of common operations.

Forks204

#awesome-list#data-engineering#big-data

Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.

#gzip#backend-agnostic#deflate

Forks254

Last commit2 years ago

flate2-rsRust

A Rust library providing streaming compression and decompression for DEFLATE, zlib, and gzip formats with multiple backend options.

#csv-reader#tab-separated#high-performance

Forks199

Last commit1 day ago

Vince's CSV ParserC++

A high-performance, fully-featured CSV parser and serializer for modern C++ with streaming, random access, and robust format handling.

A Ruby gem for normalizing, formatting, and splitting E164 international phone numbers.

#formatting#phone-numbers#telephony

#iot#hacktoberfest#database

Forks238

Last commit2 months ago

wordposJavaScript

A collection of extra nodes for Node-RED, extending its capabilities with hardware, I/O, social, storage, and utility functions.

#iot#hacktoberfest#database

Forks612

Last commit2 days ago

badwordsJavaScript

A collection of extra nodes for Node-RED, extending its capabilities with hardware I/O, social APIs, data parsing, and utility functions.

#utf16#library#c-plus-plus-11

Forks612

Last commit2 days ago

rapidcsvC++

A header-only C++11 CSV parser library with easy-to-use API for reading and writing CSV files.

Forks200

Last commit1 month ago

s3renityJavaScript

Run lambda functions over S3 objects with concurrency control for data pipelining and analytics.

#cloud-storage#s3#nodejs

#batch-processing#serverless-patterns#s3

Forks47

Last commit9 years ago

s3-lambdaJavaScript

Run lambda functions over S3 objects with concurrency control for data pipelining and analytics.

#multi-core#parallel-computing#memory-efficiency

Forks47

Last commit9 years ago

ParaTextC++

A C++ library for parallel text file reading with CSV support and Python bindings.

readr <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

Forks99

Last commit2 years ago

A fast and friendly R package for reading rectangular data from delimited files like CSV and TSV.

#parsing#fwf#r-package

A fast, idiomatic, and dependency-free Go library for mapping between CSV and Go values.

#unmarshal#fast#marshal

Stars1.0k

Forks69