Hadoop

59 projects

Showing 22 of 58 projects

White ElephantJava

A Hadoop log aggregator and dashboard for visualizing cluster utilization across users.

#jruby#dashboard#log-aggregation

Stars190

Forks61

Last commit12 years ago

Big Data For ChimpsRuby

A practical guide to exploratory data analytics using Hadoop with Pig and Ruby for terabyte-scale data processing.

#exploratory-analysis#data-science#terabyte-processing

Stars169

Forks62

Last commit

Archives Unleashed ToolkitScala

An open-source toolkit for analyzing web archives at scale using Apache Spark.

#apache-spark#web-archives#cultural-heritage

Stars158

Forks33

Last commit7 months ago

binarypigJavaScript

A scalable malware processing and analytics platform built on Hadoop Pig for binary data extraction and analysis.

#security-analytics#malware-analysis#binary-analysis

Stars144

Forks42

Last commit12 years ago

avaticaGo

A Go database/sql driver for Apache Avatica server, enabling Go applications to connect to Phoenix and other Avatica-backed databases.

#database-driver#geospatial#hbase

Stars126

Forks35

Last commit1 month ago

Apache DataFuJava

A collection of libraries for large-scale data processing in Hadoop ecosystems, including Spark, Pig, and incremental MapReduce.

#apache-spark#mapreduce#user-defined-functions

Stars124

Forks66

Last commit6 days ago

RHiveR

An R extension for distributed computing using Apache Hive, enabling HQL queries in R and R functions in Hive.

#cluster-computing#apache-hive#rserve

Stars122

Forks62

Last commit9 years ago

ganithaScala

A Scalding library for machine learning and statistical analysis, featuring Mahout vector integration, K-Means clustering, and Naive-Bayes classifiers.

#statistical-analysis#classification#scalding

Stars109

Forks12

Last commit11 years ago

Docker for beginnersJupyter Notebook

A collection of interactive Jupyter notebooks for learning Hadoop, Spark, and MapReduce with hands-on tutorials and demos.

#google-colab#mapreduce-bash#apache-spark

Stars87

Forks27

Last commit2 months ago

HBase ORMJava

A production-grade HBase ORM library for clean, fast, and fun object-oriented data access, also compatible with Google Cloud Bigtable.

#hbase-orm#object-mapping#mapreduce

Stars83

Forks41

Last commit3 years ago

akelaJava

Mozilla's utility library for Hadoop, HBase, Pig, and related big data technologies.

#mapreduce#hbase#java

Stars77

Forks31

Last commit12 years ago

BeetestJava

A simple utility for testing Apache Hive scripts locally without requiring Java development skills.

#unit-testing#apache-hive#data-engineering

Stars73

Forks23

Last commit9 years ago

Hive_testJava

A unit test framework for Hive scripts that provides an embedded Hive environment with Derby database and HiveThriftService.

#unit-testing#apache-hive#java

Stars64

Forks47

Last commit4 years ago

emr-sample-appsJava

Code samples demonstrating how to use popular applications on Amazon Elastic MapReduce (EMR).

#mapreduce#educational#code-samples

Stars63

Forks51

Last commit11 years ago

webarchive-indexingPython

MapReduce tools for bulk indexing of web archive WARC/ARC files into ZipNum sharded CDX clusters on Hadoop, EMR, or local systems.

#mapreduce#mrjob#zipnum-cluster

Stars46

Forks12

Last commit8 years ago

ImpalaC++

A massively-parallel C++ SQL query engine for lightning-fast analytics on petabytes of data in Hadoop clusters.

#real-time-queries#sql-query-engine#c-plus-plus

Stars34

Forks32

Last commit3 years ago

hdfs-rsRust

Rust bindings and safe wrapper APIs for Hadoop's libhdfs, enabling HDFS access from Rust applications.

#distributed-storage#ffi#rust-bindings

Stars31

Forks10

Last commit10 years ago

SparklingScala

A Scala/Spark library for efficient processing, extraction, and derivation of web archive data (CDX/WARC).

#apache-spark#jupyter-integration#cdx

Stars17

Forks2

Last commit2 months ago

SnackFSScala

A lightweight, HDFS-compatible file system built over Cassandra with a fat driver design for easy deployment.

#distributed-filesystem#hdfs-compatible#storage

Stars13

Forks5

Last commit11 years ago

CascadingJava

Provides HBase adapters for reading and writing data within Cascading data processing workflows on Hadoop clusters.

#cascading#batch-processing#lingual

Stars10

Forks11

Last commit8 years ago

HadoopConcatGzJava

A splitable Hadoop InputFormat for processing concatenated GZIP files and web archive (*.warc.gz) data efficiently in distributed systems.

#apache-spark#distributed-processing#java-library

Stars9

Forks2

Last commit8 years ago

WarcPartitionerJava

A Hadoop/MapReduce tool that splits and partitions web archive records in (W)ARC files by MIME type and year.

#mapreduce#data-partitioning#warc

Stars1

Forks1

Last commit9 years ago

PreviousPage 2 of 2

Related Tags

Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a project Star on GitHub