A cross-platform CLI tool for cleaning and improving text datasets for machine learning, with fast operations and LLM-based filtering.
Ambrosia is a cross-platform command-line utility designed to enhance the quality of text datasets used for training machine learning models. It provides fast, standard data cleaning operations like deduplication and filtering, alongside a unique capability to leverage large language models (LLMs) for intelligent sorting and filtering of dataset entries.
Machine learning engineers and data scientists who need to preprocess and clean text datasets efficiently, particularly those working with instruction-tuning datasets or any structured text data requiring quality filtering.
Developers choose Ambrosia for its combination of high-speed traditional data hygiene tools and programmable LLM-based filtering in a single, dependency-free binary, enabling both foundational cleaning and intelligent, prompt-driven dataset curation.
clean up your LLM datasets
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Offers efficient deduplication, whitespace trimming, and string/length filtering with commands like 'dedupe' and 'length', all optimized for speed as emphasized in the README.
Unique 'psort' command leverages LLMs like GPT-3.5/4 for classification and sorting based on custom prompts, with configurable concurrency and rate limits ('--rpm', '--tpm') for batch processing.
Most commands support specifying fields (e.g., '--fields input,output') for targeted operations, enabling complex multi-field comparisons and prompts without modifying entire datasets.
Includes ROUGE-L deduplication to identify near-duplicates based on content similarity, a step beyond exact matching, with configurable thresholds via '--rl-threshold'.
Currently only supports OpenAI's GPT-3.5 and GPT-4, with no mention of open-source or local LLMs, restricting flexibility and potentially increasing costs for users.
Explicitly marked as pre-1.0 software in the README, meaning users may encounter bugs and breaking changes, making it less suitable for production-critical workflows.
Effective use of 'psort' requires careful prompt design to ensure sortable LLM responses, and the README warns that field order isn't guaranteed without explicit specification, adding setup complexity.