Showing 36 of 258 projects
C# and F# language binding and extensions for Apache Spark, enabling .NET developers to write Spark driver programs and data processing operations.
A suite of extremely fast and reliable parsers for Java with a consistent interface for multiple file formats.
A curated list of awesome tools, libraries, and resources for working with CSV files.
A C library for reading and writing high-throughput sequencing data formats like SAM, CRAM, and VCF.
A Haskell library for streaming data processing with constant memory usage, deterministic resource handling, and easy composition.
Code samples and examples from AWS Big Data Blog posts for implementing data analytics solutions on AWS.
A high-performance Rust JSON library leveraging SIMD for parsing and serialization.
A curated collection of resources for Go-based data analysis, visualization, machine learning, and data science.
A JavaScript library for normalizing URLs by adding protocols, removing duplicates, sorting parameters, and stripping unnecessary components.
A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.
A simple and extremely fast CSV parsing and dumping library for Elixir with customizable parsers.
A high-performance LZ4 and LZ4HC compression library for .NET, offering fast block and stream compression.
A lightweight real-time big data streaming engine built on Akka for high-throughput, low-latency data processing.
A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.
A high-performance, functional tabular data processing library for Clojure, similar to Python's Pandas or R's data.table.
Official connector for integrating Apache Spark with MongoDB, enabling distributed data processing on MongoDB data.
A complete query and transformation language for JSON, inspired by jq, XPath, and XQuery.
A comprehensive learning guide and interview refresher for Apache Spark, covering core concepts, architecture, and performance optimization.
A quick reference guide to the most commonly used patterns and functions in PySpark SQL.
A C++ library and command-line toolkit for parsing, manipulating, and analyzing VCF (Variant Call Format) files in bioinformatics.
A fast, lightweight, and RFC 4180 compliant CSV library for Java with zero dependencies and a ~90 KiB footprint.
The fastest delimited file reader for R, using lazy loading and multi-threading to achieve speeds over 1 GB/sec.
An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.
A fast, header-only CSV parser and writer library for modern C++ with memory-mapped file support.
A Python framework for scalable organization and processing of SAR satellite data, integrating SNAP and GAMMA.
A Clojure DSL for Apache Spark that enables distributed data processing using idiomatic Clojure.
A CLI tool for processing JSON and text data with functional pipelines using Ramda, supporting both command-line and interactive browser modes.
A visual node-based programming environment for creating generative audio-visual art in the browser.
A high-performance, disk-backed queue library using memory-mapped files for fast, persistent, and thread-safe data processing.
A Clojure library for writing map-reduce queries that compile to Apache Pig or Cascading, enabling distributed data processing with Clojure syntax.
A fast and flexible C++ library for working with OpenStreetMap data.
A Python library for molecular processing built on RDKit with a simple API and good defaults.
A Python library for molecular processing built on RDKit with a simple API and good defaults.
A Go library providing functional-style iterators and consumers to augment the standard library's iter.Seq.
A collection of GIS tools for spatial analysis of big data using Hadoop, integrating with ArcGIS Geoprocessing.
Cross-platform C library for reading and writing .xlsx files with minimal dependencies and a simple interface.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.