Question 1

How to create a custom feature extraction function in Crunch?

Accepted Answer

Define a Go function that takes a crunch.DataReader and crunch.Row, then register it using row.Feature(). The Quick Start shows an example with IP-to-location logic, allowing you to compute new fields during processing.

Question 2

Crunch vs Apache Spark for ETL on Hadoop?

Accepted Answer

Crunch is lighter and Go-focused, ideal for rapid development of custom transformations with Hadoop integration, while Spark offers a broader ecosystem with Scala/Python support and built-in streaming, but may have more overhead for simple pipelines.

Question 3

Can Crunch handle data from sources other than JSON?

Accepted Answer

While optimized for JSON via crunch.ProcessJson, the extensible API allows custom processors for other formats, though documentation is limited and may require more Go coding effort.

Question 4

How to deploy a Crunch processor to a Hadoop cluster?

Accepted Answer

Build the Go binary, generate Pig and Hive stubs with -crunch.stubs, then upload the binary and scripts to the cluster. Use hive -f for table creation and pig commands for job execution, as detailed in the README.

Question 5

What are the performance trade-offs of using Crunch?

Accepted Answer

Go's compiled nature offers good speed, but processing is single-threaded per binary; Hadoop streaming provides parallelism, but custom functions should be optimized to avoid bottlenecks in large-scale jobs.

Question 6

Is Crunch suitable for machine learning feature engineering?

Accepted Answer

Yes, its custom feature extraction functions make it ideal for computing ML features from raw data, especially when integrated with Hadoop for scalable batch processing, though it lacks built-in ML libraries.

Crunch

What is Crunch?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions