A Docker image for Apache Spark on YARN, built on Hadoop and CentOS for easy deployment.
SequenceIQ's docker-spark is a Docker image that packages Apache Spark with Hadoop and YARN for simplified deployment. It provides a pre-configured environment to run Spark applications in containerized setups, eliminating manual installation and configuration hassles. This is particularly useful for developers and data engineers who need a reproducible Spark stack for testing or development.
Data engineers, developers, and DevOps professionals who want to quickly deploy and test Apache Spark in Docker containers, especially for YARN-based cluster environments.
It offers a ready-to-use, Dockerized Spark setup that reduces deployment time and ensures consistency across environments, building on a stable Hadoop base for reliable big data processing.
This project provides a Dockerized version of Apache Spark, pre-configured to run on YARN with Hadoop 2.6.0. It simplifies setting up a Spark environment by packaging everything into a container, making it ideal for development, testing, and reproducible deployments.
The project focuses on providing a streamlined, containerized Spark environment that leverages Docker for consistency and ease of use, building on existing Hadoop Docker images to ensure compatibility.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Packages Apache Spark 1.6.0 with Hadoop 2.6.0 on YARN, ready for YARN-client or YARN-cluster modes out of the box, eliminating manual configuration hassles.
Offers Docker images for easy pulling and running, reducing setup time and ensuring consistent environments across development and testing, as shown in the README commands.
Supports submitting Spark jobs from outside the container using YARN_CONF_DIR and HADOOP_USER_NAME environment variables, enhancing integration flexibility for external workflows.
Provides sample commands like estimating Pi in both YARN modes, helping users quickly verify setup and understand Spark functionality without extra effort.
Uses Apache Spark 1.6.0 and Hadoop 2.6.0 from 2016, which lack modern features, performance improvements, and critical security patches available in newer releases.
Exclusively configured for YARN cluster management, making it incompatible with other popular options like Kubernetes, Mesos, or Spark's standalone mode, limiting deployment flexibility.
Requires additional configuration such as setting HADOOP_USER_NAME=root for HDFS access from non-root users outside the container, adding complexity for integration scenarios.