A parallel gzip decompressor with fast random access, utilizing multi-core CPUs for high-speed decompression of standard gzip files.
Rapidgzip is a parallel gzip decompression tool and library that accelerates decompression of standard gzip files on multi-core machines. It solves the problem of slow single-threaded gzip decompression by utilizing all available CPU cores, and adds fast random access capabilities for seeking within compressed files without full decompression.
Developers and data engineers working with large gzip-compressed datasets, such as genomic FASTQ files, log archives, or scientific data, who need faster decompression or random access to portions of compressed files.
Unlike tools like bgzip or pigz that only parallelize their own formats, Rapidgzip works with any standard gzip file, offering significant speedups (up to 55x with 128 cores) and random access without requiring re-compression.
Gzip Decompression and Random Access for Modern Multi-Core Machines
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Leverages multiple CPU cores to achieve up to 24 GB/s decompression bandwidth with an index, significantly faster than single-threaded gzip.
Works with almost any gzip file from common tools like GNU gzip, bgzip, pigz, and igzip, avoiding format lock-in and re-compression needs.
Enables seeking within gzip files without full decompression, using a block offset map and cache prefetching for performance enhancements.
Supports importing and exporting precomputed indices to accelerate subsequent seeks and delegate decompression to optimized libraries like ISA-L.
Uses an LRU cache and parallel prefetcher, increasing RAM consumption as noted in the README's performance trade-offs.
First seek or random access requires building the block offset map, which can be time-consuming for large files without a precomputed index.
On NUMA architectures, it may fail to scale efficiently across sockets due to data transfer costs, as mentioned in FASTQ data benchmarks.