A suite of high-performance command line tools for filtering, summarizing, joining, and manipulating large tabular data files.
TSV Utilities is a set of command line tools for manipulating large tabular data files, such as those found in machine learning and data mining environments. It provides fast operations like filtering, statistics, sampling, and joins, often outperforming similar tools. The toolkit is designed to handle data larger than what fits in memory for applications like R or Pandas but not so large as to require distributed systems like Hadoop.
Data scientists, machine learning engineers, and developers working with large tabular datasets who need efficient command line tools for data preparation and analysis. It is especially useful for those transitioning data between tools like R, Pandas, and Unix utilities.
Developers choose TSV Utilities for its high performance and ease of integration into Unix pipelines. It offers specialized tools that are faster than alternatives like awk, cut, and sort for many tabular data tasks, with features like named field support, Unicode readiness, and header-aware processing.
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Documented performance studies show tools like tsv-filter and tsv-join significantly outperform traditional Unix utilities like awk and sort for large tabular data, with optimizations like streaming algorithms and in-memory lookups.
Designed to follow Unix conventions, tools read from files or stdin and write to stdout, enabling seamless chaining with existing command-line workflows and complementing tools like cut and grep.
Offers specialized utilities for complex tasks such as weighted sampling in tsv-sample, statistical summaries with grouping in tsv-summarize, and header-aware processing across multiple files, reducing script complexity.
Supports named fields, wildcard matching, and bash completion for faster command construction, and includes utilities like keep-header to simplify common tasks like sorting with headers intact.
Tools like tsv-join and tsv-uniq rely on in-memory hash tables, with the README admitting performance degrades beyond around 10 million unique entries, limiting effectiveness for very large or high-cardinality datasets.
Building from source requires installing a D compiler (DMD or LDC) and optionally enabling LTO/PGO for optimal performance, which is more involved than using pre-built binaries or standard package managers like apt or brew.
Primarily focused on TSV and CSV formats; lacks native support for other common data serialization formats like JSON, Parquet, or Avro, necessitating additional conversion steps in modern data workflows.