A Python wrapper for Cascading that enables building and controlling Hadoop data processing workflows entirely in Python.
PyCascading is a Python wrapper for the Cascading framework that enables developers to build and control Hadoop data processing workflows entirely in Python. It allows defining data pipelines with Python operators, writing user-defined functions in Python, and managing distributed computations without Java boilerplate. The tool bridges Python's simplicity with Hadoop's scalability for big data processing tasks.
Data engineers and Python developers who need to build and manage Hadoop data processing workflows but prefer working in Python rather than Java. It's particularly useful for teams already invested in Python's data ecosystem.
Developers choose PyCascading because it eliminates the need to write Java code for Hadoop workflows while maintaining full access to Cascading's capabilities. Its Python-native approach reduces development time and makes complex data pipelines more accessible to Python-centric teams.
A Python wrapper for Cascading
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Enables building entire data pipelines with Python operators and chaining syntax, as shown in the word count example where operations like split_words and group_by are piped together.
Supports writing user-defined functions in Python using decorators like @udf_map, eliminating Java coding for custom logic and making it accessible to Python developers.
Implements caching of interim results in pipes, allowing for quicker replay and debugging during development cycles, as highlighted in the key features.
Offers a local Hadoop mode via local_run.sh for executing scripts without remote deployment, facilitating easier testing and validation on local files.
The README explicitly states it is no longer maintained, posing significant risks for production use due to lack of bug fixes, security updates, and support.
Relies on Jython 2.5.2 and specific old versions of Cascading and Hadoop (e.g., 0.20.2+), limiting compatibility with modern tools and introducing potential security vulnerabilities.
Requires manual configuration of dependencies in properties files, building with Ant, and SSH-based remote deployment via scripts, making initial setup and ongoing use cumbersome.