A FlashAttention 2 implementation for JAX with block-wise document mask optimization and context parallelism for efficient long-sequence training.
Kvax is a FlashAttention 2 implementation for the JAX machine learning framework, optimized for efficient attention computation on long sequences. It solves the quadratic memory and computational bottlenecks of standard attention by implementing block-wise document mask optimization and context parallelism, enabling faster training of transformer models with extended contexts.
Machine learning researchers and engineers using JAX for training large transformer models on long sequences, particularly those working on distributed training setups with FSDP/HSDP sharding.
Developers choose Kvax for its JAX-native implementation of FlashAttention 2 with specialized optimizations for document masks and context parallelism, offering better performance and memory efficiency compared to generic attention implementations when working with packed sequences and distributed training.
A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Implements block-wise attention masks built once per forward-backward pass using a Triton kernel, preventing cross-sequence contamination in packed sequences and avoiding O(seq_len²) GPU memory usage, as highlighted in the Key Concepts.
Leverages an all-gather based method from the Llama 3 training paper to distribute sequence processing across GPUs, reducing activation memory and balancing computational load, which is critical for distributed training setups.
Automatically skips blocks consisting entirely of padding tokens to save computation, with clear guidelines in the How to Use section for marking padding tokens with PADDING_SEGMENT_ID.
Uses FlashAttention 2 algorithms implemented in Triton for optimized GPU performance, as demonstrated in the benchmarks showing speedups in forward and backward passes with long sequences.
Does not support bias, sliding window, or custom masks like ALiBi, which are common in advanced transformer models, restricting its use cases as admitted in the Limitations section.
The README warns that automatically installed versions of Triton and JAX-Triton may be incompatible, forcing users to manually manage dependencies, adding friction to setup.
Requires precise marking of padding tokens with segment IDs and careful use of context managers like attention_specs for sharding, increasing integration overhead compared to drop-in alternatives.