Hadoop

59 projects

Showing 36 of 59 projects

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

#orchestration-framework#python#luigi

Stars18.7k

Forks2.5k

Last commit6 days ago

APIJSONJava

A real-time, no-code ORM that provides APIs and documentation automatically, allowing frontend clients to customize JSON responses.

#crud#no-code#orm

Stars18.4k

Forks2.3k

Last commit2 days ago

Apache Hadoop

Last commit22 hours ago

Deeplearning4jJava

A comprehensive JVM-based deep learning ecosystem for building, training, and deploying models with support for model import and distributed training.

#distributed-training#intellij#spark-integration

A high-performance distributed POSIX file system for cloud-native environments, storing data in object storage and metadata in databases.

#filesystem#data-storage#high-performance

A fast distributed SQL query engine for big data analytics, enabling interactive queries across diverse data sources.

#database#distributed-systems#query-engine

Stars13.1k

Forks3.7k

Last commit13 hours ago

PredictionIOScala

An open source machine learning server for developers and data scientists, supporting event collection, algorithm deployment, and REST API queries.

#event-collection#spark#hbase

Stars12.5k

Forks1.9k

Last commit5 years ago

AlluxioJava

A distributed caching platform that bridges computation frameworks and storage systems for large-scale analytics and ML workloads.

#data-orchestration#spark#memory-speed

Stars7.2k

Forks2.9k

Last commit1 year ago

Azkaban (.5k)Java

Azkaban is a batch workflow job scheduler created at LinkedIn to manage Hadoop jobs.

#hacktoberfest#gradle#batch-processing

Stars4.5k

Forks1.6k

Last commit2 years ago

ScaldingScala

A Scala API for Cascading that simplifies writing Hadoop MapReduce jobs with Scala integration.

#cascading#mapreduce#functional-programming

Stars3.5k

Forks699

Last commit3 years ago

Apache KyuubiScala

A distributed, multi-tenant gateway providing serverless SQL on data warehouses and lakehouses.

#hiveserver2-alternative#hacktoberfest#spark

Stars2.4k

Forks1.0k

Last commit19 hours ago

Elasticsearch HadoopJava

Native integration library for using Elasticsearch with Hadoop, Spark, and Hive for real-time search and analytics on big data.

#apache-spark#mapreduce#data-integration

Stars2.0k

Forks1.0k

Last commit21 hours ago

GafferJava

A graph database framework for storing and querying large-scale graphs with rich properties and in-database aggregation.

#apache-spark#parquet#entity-relation

Stars1.8k

Forks363

Last commit1 year ago

GenieJava

A federated Big Data orchestration service that simplifies job execution across distributed clusters by abstracting infrastructure complexity.

#data-orchestration#spark#netflixoss

Stars1.8k

Forks375

Last commit10 days ago

HiBenchJava

A comprehensive benchmark suite for evaluating speed, throughput, and resource utilization of big data frameworks like Hadoop, Spark, and streaming engines.

#apache-spark#performance-testing#distributed-systems

Stars1.5k

Forks766

Last commit7 months ago

hdfs - A native go client for HDFSGo

A native Go client library and command-line tool for HDFS that connects directly to the namenode via protocol buffers.

#distributed-storage#command-line-tool#protocol-buffers

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources.

#awesome-list#data-engineering#big-data

Stars1.1k

Forks254

Last commit2 years ago

camusJava

LinkedIn's previous generation Kafka to HDFS pipeline for batch data ingestion.

#batch-processing#linkedin#kafka

Stars881

Forks451

Last commit5 years ago

SnakebitePython

A pure Python HDFS client and Hadoop minicluster wrapper for interacting with Hadoop Distributed File System.

#python-hdfs-client#python-library#distributed-storage

Stars857

Forks213

Last commit4 years ago

RHadoop

A collection of R packages for interacting with Hadoop ecosystems, enabling big data analysis from R.

#mapreduce#data-science#hbase

Stars760

Forks275

Last commit10 years ago

docker-sparkShell

A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.

#apache-spark#containerization#cluster-computing

Stars757

Forks277

Last commit5 years ago

SparkR <img class="emoji" alt="heart" src="https://cdn.jsdelivr.net/gh/qinwf/awesome-R@3c66da6e291bcc0520b1649125b0bed750896a9a/heart.png" height="20" align="absmiddle" width="20">R

An R package providing a lightweight frontend to use Apache Spark for distributed data processing from R.

#apache-spark#r-package#data-science

An open-source security analytics platform that integrates big data technologies for centralized security monitoring, threat detection, and investigation.

#security-analytics#real-time-processing#behavioral-analytics

Stars584

Forks187

Last commit6 years ago