Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Hadoop
  3. Crunch

Crunch

Go

A Go-based toolkit for fast ETL and feature extraction on Hadoop, optimized for rapid development and execution.

GitHubGitHub
212 stars16 forks0 contributors

What is Crunch?

Crunch is a Go-based toolkit for building ETL (Extract, Transform, Load) and feature extraction pipelines on Hadoop. It allows developers to define data transformations and feature computations using a simple API, then generates Hadoop-compatible scripts and binaries for processing semi-structured data like JSON logs at scale.

Target Audience

Data engineers and developers working with Hadoop ecosystems who need to build or optimize ETL pipelines for processing large volumes of semi-structured data efficiently.

Value Proposition

Developers choose Crunch for its rapid development cycle, seamless Hadoop integration, and the ability to embed custom Go code directly into data workflows, reducing the complexity typically associated with big data processing.

Overview

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

Use Cases

Best For

  • Processing semi-structured JSON logs in Hadoop environments
  • Building custom feature extraction pipelines for machine learning data preparation
  • Rapid prototyping of ETL jobs with minimal boilerplate
  • Integrating Go-based business logic into Hadoop data workflows
  • Generating Pig and Hive scripts automatically from data transformation definitions
  • Deploying standalone data processors as single binaries to Hadoop clusters

Not Ideal For

  • Teams using cloud-native data platforms like AWS Glue or Databricks without Pig/Hive dependencies
  • Projects requiring real-time data streaming, as Crunch is optimized for batch processing of static logs
  • Organizations where data engineers primarily work with Python or Java, not Go

Pros & Cons

Pros

Fast Development Cycle

The concise API and minimal boilerplate allow quick iteration on data transformations, as demonstrated in the Quick Start example with row field definitions.

Automated Hadoop Integration

Generates Pig scripts and Hive DDL automatically via the -crunch.stubs flag, reducing manual scripting errors and deployment time.

Embedded Go Logic

Enables custom feature extraction with Go functions, making it easy to incorporate complex business logic directly into pipelines, as shown with IP-to-location in the README.

Single Binary Deployment

Compiles into a standalone executable, simplifying distribution and execution across Hadoop clusters without dependency management.

Cons

Go-Only Ecosystem

Restricts usage to teams comfortable with Go, excluding popular data science languages like Python, which are more common in data engineering.

Batch Processing Limitation

Lacks support for streaming data, focusing solely on batch ETL jobs for static files like JSON logs, limiting use in real-time scenarios.

Documentation Incompleteness

The README notes that the 'Extending Crunch' section is a work in progress, indicating potential gaps in advanced usage guides and customization.

Frequently Asked Questions

Quick Stats

Stars212
Forks16
Contributors0
Open Issues1
Last commit11 years ago
CreatedSince 2014

Tags

#hive#feature-extraction#big-data#data-processing#data-pipeline#json-parsing#hadoop#etl#go

Built With

G
Go

Included in

Hadoop1.1k
Auto-fetched 18 hours ago

Related Projects

Elasticsearch HadoopElasticsearch Hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop

Stars1,975
Forks995
Last commit2 days ago
GenieGenie

Distributed Big Data Orchestration Service

Stars1,763
Forks372
Last commit4 months ago
hdfs-duhdfs-du

Visualize your HDFS cluster usage

Stars228
Forks82
Last commit5 years ago
White ElephantWhite Elephant

Hadoop log aggregator and dashboard

Stars190
Forks61
Last commit12 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub