A library enabling MongoDB to serve as input source or output destination for Hadoop MapReduce tasks and ecosystem tools.
MongoDB Connector for Hadoop is a library that enables MongoDB to function as an input source or output destination for Hadoop MapReduce tasks and other Hadoop ecosystem tools. It allows data stored in MongoDB or its BSON backup files to be processed using distributed computing frameworks like Spark, Pig, Hive, and Flume. The connector solves the problem of integrating NoSQL document data with big data processing pipelines without extensive data transformation.
Data engineers and developers working with Hadoop ecosystems who need to process MongoDB data in distributed computing workflows. It's particularly useful for teams using both MongoDB for operational data and Hadoop for analytical processing.
Developers choose this connector because it provides native integration between MongoDB and Hadoop tools, supports flexible data sourcing from various MongoDB deployments, and enables query-based filtering for efficient data processing. Its ability to work with BSON backup files on cloud storage like S3 adds deployment flexibility.
MongoDB Connector for Hadoop
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Supports multiple Hadoop tools including Pig, Spark, MapReduce, Hive, and Flume, as listed in the README, enabling seamless data flow across platforms.
Can read data directly from MongoDB or from BSON backup files stored on S3, HDFS, or local filesystems, reducing data movement barriers as highlighted in the features.
Allows filtering source data using MongoDB's query language for targeted processing, which can improve efficiency by limiting data transfers.
Can output data in BSON format for easy import back into MongoDB using mongorestore, facilitating data pipeline round-tripping.
The project is officially EOL with no further development, bug fixes, or documentation updates, as stated in the README notice, making it risky for long-term use.
Only tested with older versions like Hadoop 2.4 and Spark 1.4 per the requirements, which may not work with contemporary distributions and tools.
Requires manual copying of JAR files to each node in the Hadoop cluster as per building instructions, adding setup overhead and maintenance burden.