Data Processing

#parser-framework#high-performance#data-integration

uniVocity-parsersJava

A suite of extremely fast and reliable parsers for Java with a consistent interface for multiple file formats.

Stars935

Forks250

Last commit1 year ago

ConduitHaskell

A Haskell library for streaming data processing with constant memory usage, deterministic resource handling, and easy composition.

#haskell#functional-programming#conduit

Stars915

Forks201

Last commit1 year ago

sonic-rsRust

A high-performance Rust JSON library leveraging SIMD for parsing and serialization.

#serde#high-performance#simd

Stars903

Forks62

#code-samples#aws-services#data-engineering

aws-big-data-blogJava

Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.

Stars893

Forks613

#data-science#developer-resources#tooling

gopherdata

A curated collection of resources for Go-based data analysis, visualization, machine learning, and data science.

Stars889

Forks83

Last commit2 years ago

normalize-urlJavaScript

A JavaScript library for normalizing URLs by adding protocols, removing duplicates, sorting parameters, and stripping unnecessary components.

#deduplication#npm-package#sanitize-url

Stars878

Forks122

#python-hdfs-client#python-library#distributed-storage

SnakebitePython

A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.

Stars857

Forks213

#elixir#csv-parsing#binary-patterns

nimble_csvElixir

A simple and extremely fast CSV parsing and dumping library for Elixir with customizable parsers.

Stars818

Forks57

Last commit8 months ago

lz4netC#

A high-performance LZ4 and LZ4HC compression library for .NET, offering fast block and stream compression.

#lz4#stream-compression#high-speed

Stars808

Forks90

#apache-spark#containerization#cluster-computing

docker-sparkShell

A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.

Stars757

Forks277

Last commit5 years ago

GearpumpScala

A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.

#stream-processing#akka#cluster-computing

Stars756

Forks150

Last commit2 days ago

tech.ml.datasetClojure

A high-performance, functional tabular data processing library for Clojure, similar to Python's Pandas or R's data.table.

#etl-pipeline#functional-programming#high-performance

Stars751

Forks33

#apache-spark#connector#spark

Mongo-SparkJava

Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.

Stars730

Forks319

#jq-alternative#java-library#json-query

JSLT ()Java

A complete query and transformation language for JSON, inspired by jq, XPath, and XQuery.

A quick reference guide to the most commonly used patterns and functions in PySpark SQL.

#apache-spark#reference-guide#data-science

Stars694

Forks211

Last commit3 years ago

Carefully Curated 70 Spark Questions with Additional Optimization Guides (First in the series)

A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.

#apache-spark#spark#performance-optimization

Stars690

Forks80

#csv-reader#java-library#open-source

FastCSVJava

A fast, lightweight, and RFC 4180 compliant CSV library for Java with zero dependencies and a ~90 KiB footprint.

Stars686

Forks105

Last commit6 days ago

VcflibC++

A C++ library and command-line toolkit for parsing, manipulating, and analyzing VCF (Variant Call Format) files in bioinformatics.

#structural-variants#vcf-manipulation#command-line-tools

Stars682

Forks222

#delimited-files#csv-reader#high-performance

vroomC++

The fastest delimited file reader for R, using lazy loading and multi-threading to achieve speeds over 1 GB/sec.

Stars642

Forks73

SparkR <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.

#apache-spark#r-package#data-science

A fast, header-only CSV parser and writer library for modern C++ with memory-mapped file support.

#csv-reader#single-threaded#high-performance

Stars624

Forks107

Last commit2 years ago

pyroSARPython

A Python framework for scalable organization and processing of SAR satellite data, integrating SNAP and GAMMA.

#synthetic-aperture-radar#satellite-data#geospatial

Stars608

Forks122

Last commit6 days ago

flamboClojure

A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.

#rdd#apache-spark#mapreduce

Stars600

Forks83

Last commit8 years ago

ramda-cliLiveScript

A CLI tool for processing JSON and text data with functional pipelines using Ramda, supporting both command-line and interactive browser modes.

#stream-processing#functional-programming#pipeline

Stars583

Forks12

Last commit3 years ago

eternalJavaScript

A visual node-based programming environment for creating generative audio-visual art in the browser.

#audio-synthesis#music#procedural-music

Stars580

Forks35

Last commit11 months ago

Big QueueJava

A high-performance, disk-backed queue library using memory-mapped files for fast, persistent, and thread-safe data processing.

#java-library#message-queue#persistent-queue

Stars567

Forks217

#cascading#clojure#big-data

PigPenClojure

A Clojure library for writing map-reduce queries that compile to Apache Pig or Cascading, enabling distributed data processing with Clojure syntax.

Stars564

Forks51

Last commit3 years ago

libosmiumC++

A fast and flexible C++ library for working with OpenStreetMap data.

#geospatial#c-plus-plus-14#gis

Stars549

Forks133

#scientific-computing#cheminformatics#python-library

datamolPython

A Python library for molecular processing built on RDKit with a simple API and good defaults.

Stars541

Forks63

#cheminformatics#python-library#molecule

DatamolPython

A Python library for molecular processing built on RDKit with a simple API and good defaults.

Stars541

Forks63

#functional-programming#go-library#golang

go-functionalGo

A Go library providing functional-style iterators and consumers to augment the standard library's iter.Seq.

Stars538

Forks26

Last commit23 days ago

gis-tools-for-hadoop

A collection of GIS tools for spatial analysis of big data using Hadoop, integrating with ArcGIS Geoprocessing.

#arcgis#geospatial#apache-hive

Stars524

Forks251