Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Tags
  3. Data Processing

Data Processing

258 projects

Showing 36 of 258 projects

lua-zlib
lua-zlibC

A Lua library providing a functional, streaming interface to zlib for compression and decompression.

#gzip#inflate#deflate
Stars286
Forks110
Last commit5 months ago
bitio
bitioGo

Optimized bit-level Reader and Writer for Go, enabling efficient reading and writing of arbitrary bit lengths.

#bitstream#writer#bit
Stars257
Forks23
Last commit3 years ago
NewsQA
NewsQAPython

Tools for compiling and using the Maluuba NewsQA dataset, a machine reading comprehension dataset based on CNN articles.

#question-answering#python#reading-comprehension
Stars257
Forks56
Last commit3 years ago
Lazy JSON
Lazy JSONPHP

Framework-agnostic PHP package to load JSON of any size into Laravel lazy collections with minimal memory usage.

#laravel#memory-efficiency#dot-notation
Stars254
Forks4
Last commit2 years ago
spark-connect-go
spark-connect-goGo

An experimental Go client for Apache Spark Connect, enabling Go applications to interact with Spark clusters via gRPC.

#spark-connect#apache-spark#protocol-buffers
Stars253
Forks50
Last commit24 days ago
Joblib Apache Spark Backend
Joblib Apache Spark BackendPython

A joblib backend that enables Python parallel computing tasks to run on Apache Spark clusters.

#apache-spark#parallel-computing#joblib
Stars250
Forks24
Last commit2 months ago
geojson-merge
geojson-mergeJavaScript

A Node.js utility to merge multiple GeoJSON files into a single FeatureCollection, supporting both in-memory and streaming modes.

#geojson#geospatial#cli-tool
Stars245
Forks33
Last commit1 year ago
ChEMBL_Structure_Pipeline (formerly standardiser)
ChEMBL_Structure_Pipeline (formerly standardiser)Python

Standardizes and processes chemical molecule structures for the ChEMBL database using RDKit.

#molecule-standardization#cheminformatics#python-library
Stars242
Forks44
Last commit
ruby-spark
ruby-sparkRuby

A Ruby wrapper for Apache Spark, enabling large-scale data processing with Ruby's expressive syntax.

#rdd#apache-spark#distributed
Stars226
Forks28
Last commit8 years ago
pycascading
pycascadingPython

A Python wrapper for Cascading that enables building and controlling Hadoop data processing workflows entirely in Python.

#cascading#mapreduce#workflow-engine
Stars221
Forks35
Last commit6 years ago
hadoop-pcap
hadoop-pcapJava

A Hadoop library for reading and processing packet capture (PCAP) files in MapReduce jobs and Hive queries.

#mapreduce#serde#pcap
Stars216
Forks101
Last commit3 years ago
Crunch
CrunchGo

A Go-based toolkit for fast ETL and feature extraction on Hadoop, optimized for rapid development and execution.

#hive#pig#feature-extraction
Stars212
Forks16
Last commit11 years ago
Dataset
DatasetPython

A Python library for building lazy data processing and machine learning workflows that handle datasets larger than memory.

#pipeline-framework#batch-processing#workflow
Stars206
Forks45
Last commit20 days ago
jackson-dataformat-csv
jackson-dataformat-csvJava

Jackson extension for reading and writing CSV data as JSON-like data structures.

#library#jackson#java
Stars195
Forks73
Last commit8 years ago
iem
iemPython

A monolith codebase that powers the Iowa Environmental Mesonet's environmental data ingest, processing, and web services.

#iowa-mesonet#environmental-data#meteorology
Stars189
Forks73
Last commit4 days ago
psql2csv
psql2csvShell

A command-line tool that runs PostgreSQL queries and outputs results directly as CSV format.

#psql#postgres#homebrew-formula
Stars186
Forks21
Last commit4 years ago
trans
transJavaScript

A JavaScript library for transforming complex JSON objects using intuitive field path syntax and chained transformations.

#functional-programming#object-manipulation#trans
Stars178
Forks2
Last commit10 years ago
godal
godalGo

An idiomatic Go wrapper for the GDAL library, providing efficient raster and vector geospatial data processing.

#raster-data#cgo#geospatial
Stars177
Forks36
Last commit18 days ago
xlsx
xlsxGo

A fast and reliable Go library for reading, writing, and manipulating Microsoft Excel XLSX files.

#office-open-xml#microsoft#spreadsheet
Stars177
Forks23
Last commit5 years ago
CSV Reader
CSV ReaderRuby

A Ruby gem for reading CSV files with best practices out-of-the-box and zero configuration.

#csvrecord#csvhash#humanitarian-data
Stars176
Forks7
Last commit1 year ago
serde-aux
serde-auxRust

A Rust library providing helper functions for serde serialization and deserialization of containers, struct fields, and other common patterns.

#serde#utility-library#deserialization
Stars173
Forks29
Last commit8 months ago
CSwiftV
CSwiftVSwift

A CSV parser for Swift that conforms to RFC 4180 standards for reliable CSV file handling.

#macos-development#file-parsing#rfc4180
Stars172
Forks45
Last commit3 years ago
bzip2-rs
bzip2-rsC

Rust bindings for libbz2 providing streaming bzip2 compression and decompression.

#bzip2#bindings#streaming
Stars169
Forks68
Last commit4 months ago
Big Data For Chimps
Big Data For ChimpsRuby

A practical guide to exploratory data analytics using Hadoop with Pig and Ruby for terabyte-scale data processing.

#exploratory-analysis#data-science#terabyte-processing
Stars169
Forks63
Last commit
machine
machineGo

A Go library for building data processing workflows and pipelines with functional operations, cycles, and fan-out capabilities.

#pipeline-framework#stream-processing#functional-programming
Stars168
Forks12
Last commit12 days ago
ArchiveSpark
ArchiveSparkScala

An Apache Spark framework for efficient data processing, extraction, and derivation from web archives and archival collections.

#data-lineage#apache-spark#web-archives
Stars161
Forks19
Last commit8 months ago
Spark-BigQuery
Spark-BigQueryScala

A Spark library for reading from and writing to Google BigQuery using DataFrames and SQL.

#apache-spark#data-engineering#gcp
Stars156
Forks50
Last commit6 years ago
filter
filterGo

A Go library for filtering, sanitizing, and converting data with built-in rules and functions.

#sanitization#filter#golang-package
Stars150
Forks12
Last commit5 days ago
IterTools PHP
IterTools PHPPHP

A PHP library providing Python-inspired iteration tools for efficient data processing with loops and streams.

#stream-processing#generator#functional-programming
Stars149
Forks12
Last commit1 month ago
ExcelProvider
ExcelProviderF#

A .NET type provider for reading Excel files with static type safety and IntelliSense support.

#type-provider#spreadsheet#static-typing
Stars148
Forks49
Last commit7 months ago
json-transforms
json-transformsJavaScript

A recursive, pattern-matching framework for transforming JSON data using JSPath queries, inspired by XSLT.

#declarative-programming#jspath#pattern-matching
Stars146
Forks6
Last commit1 year ago
OneBusAway GTFS Modules
OneBusAway GTFS ModulesJava

A Java library for reading, writing, and transforming public transit data in the GTFS format.

#database#library#java
Stars144
Forks110
Last commit6 days ago
binarypig
binarypigJavaScript

A scalable malware processing and analytics platform built on Hadoop Pig for binary data extraction and analysis.

#security-analytics#malware-analysis#binary-analysis
Stars144
Forks42
Last commit12 years ago
Pycytominer
PycytominerPython

A Python package for processing and normalizing high-dimensional morphological feature data from high-throughput cell imaging experiments.

#image-analysis#biomedical-data#microscopy
Stars142
Forks40
Last commit2 days ago
node-osmium
node-osmiumC++

JavaScript bindings for libosmium to work with OpenStreetMap data, suitable for small extracts and prototyping.

#geospatial#javascript-bindings#prototyping
Stars140
Forks30
Last commit2 years ago
Apex
ApexJava

Operator and codec library for building real-time streaming applications on Apache Apex.

#apex#java#operator-library
Stars135
Forks142
Last commit6 years ago
PreviousPage 6 of 8

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub
6 months ago
11 years ago
Next
#Big Data58
#Python37
#Apache Spark30
#Csv Parser28
#Csv28
#Json28
#Distributed Computing27
#Functional Programming26
#Performance25
#High Performance24
#Machine Learning23
#Streaming22