A language for distributed deep learning that simplifies model parallelism by specifying tensor computations across hardware meshes.
Mesh TensorFlow is a distributed deep learning framework that enables model parallelism by specifying how tensor computations are split across multi-dimensional processor meshes. It solves the problem of training massive neural networks that don't fit on single devices by allowing dimensions like batch size and hidden layers to be distributed across multiple processors. The framework provides a higher-level abstraction over TensorFlow for defining distributed computation strategies.
Machine learning researchers and engineers working with extremely large models (billions of parameters) that require distributed training across multiple GPUs or TPUs. Also suitable for those needing low-latency parallel inference or handling massive activation tensors.
Developers choose Mesh TensorFlow because it provides a principled, flexible approach to model parallelism that goes beyond simple data-parallelism, with intuitive layout rules and automatic communication handling. Its unique selling point is the ability to specify complex distributed strategies through dimension naming and mesh mapping while maintaining compatibility with standard TensorFlow operations.
Mesh TensorFlow: Model Parallelism Made Easier
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Tensors have named dimensions like 'batch' and 'hidden', allowing clear mapping to mesh dimensions for intuitive parallelism, as shown in the MNIST example where dimensions are explicitly defined.
Users can define layout rules to split tensor dimensions across mesh dimensions, enabling data-parallelism, model-parallelism, or hybrid approaches, illustrated by the ability to combine splits on a 2D mesh.
Works on CPU, GPU, and TPU clusters with different implementation strategies, such as PlacementMeshImpl for CPU/GPU and SimdMeshImpl for TPU, providing versatility across hardware.
The auto_mtf subpackage can automatically choose efficient layouts based on model structure and hardware configuration, reducing manual tuning effort for performance.
Running on cloud TPU requires manual VM and TPU instance creation with specific TensorFlow versions, as detailed in the lengthy setup instructions, adding operational overhead.
Experimental features like new input pipelines are not tested on GPUs and may require debugging, per the README note, making GPU adoption less straightforward.
Users must understand mesh dimensions, layout rules, and Einsum operations, which adds complexity compared to standard distributed frameworks, as evidenced by the detailed MNIST example code.
Default APIs can become bottlenecks for large inputs, forcing reliance on experimental features like input_reader.py, which are less stable and primarily tested on TPUs.