A flow-based generative network for fast, high-quality speech synthesis from mel-spectrograms.
WaveGlow is a flow-based generative neural network for speech synthesis that converts mel-spectrograms into high-quality audio. It solves the problem of slow autoregressive audio generation by providing fast, efficient synthesis while maintaining audio quality comparable to state-of-the-art WaveNet implementations. The model uses a single network architecture trained with maximum likelihood estimation for simplicity and stability.
AI researchers and engineers working on speech synthesis, text-to-speech systems, and generative audio models who need fast inference without sacrificing quality.
Developers choose WaveGlow for its combination of fast inference speeds (1200 kHz on V100 GPUs) and high audio quality, achieved through an elegant flow-based architecture that eliminates the complexity of autoregressive models while maintaining competitive Mean Opinion Scores.
A Flow-based Generative Network for Speech Synthesis
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Generates audio at 1200 kHz on an NVIDIA V100 GPU, enabling real-time speech synthesis applications as highlighted in the README.
Achieves Mean Opinion Scores comparable to the best WaveNet implementations, ensuring professional audio quality without autoregressive complexity.
Uses a single network with one cost function for likelihood maximization, making training stable and straightforward, as described in the paper summary.
Flow-based architecture combines Glow and WaveNet insights to eliminate auto-regression, reducing computational overhead for faster generation.
Includes FP16 training and inference options, optimizing performance on compatible NVIDIA hardware, with configurable settings in config.json.
Requires git submodule initialization and installation of NVIDIA Apex, which can be challenging and platform-specific, adding setup overhead.
Optimized for NVIDIA V100 GPUs with FP16 support, limiting portability and performance on other GPUs or CPU-only environments.
Inference depends on pre-computed mel-spectrograms, adding an extra step compared to end-to-end TTS systems, as seen in the generation instructions.
README has a TODO for dataset download instructions and limited testing for multi-GPU training, indicating gaps that may hinder newcomers.