Question 1

How does Ambrosia compare to Hugging Face datasets for cleaning?

Accepted Answer

Ambrosia excels as a lean CLI tool for fast operations and LLM-based filtering in a single binary, while Hugging Face datasets offer broader ML ecosystem integration but may be heavier. Choose Ambrosia for targeted cleaning with LLMs.

Question 2

Can I use Ambrosia with local LLMs like Llama?

Accepted Answer

No, the README specifies support only for GPT-3.5 and GPT-4 via API, so local or open-source LLMs aren't currently supported. You'd need cloud-based access for the 'psort' feature.

Question 3

How to set up rate limits for LLM filtering in Ambrosia?

Accepted Answer

Use the '--rpm' (requests per minute) and '--tpm' (tokens per minute) flags with the 'psort' command to control API usage and avoid rate limits, as detailed in the options section.

Question 4

Does Ambrosia filter by token count instead of bytes?

Accepted Answer

No, the 'length' command filters by byte count only, as the README states tokenizers vary and byte count is more reliable. You'd need external tools for token-based filtering.

Question 5

What's the cost of using Ambrosia with GPT-4 for large datasets?

Accepted Answer

Ambrosia is free, but 'psort' incurs costs based on OpenAI's API pricing per request and token. Use '--max-tokens' and rate limits to manage expenses, but budgets can escalate with large datasets.

Question 6

How to handle prompt injection in Ambrosia's LLM filtering?

Accepted Answer

Use the '--end-instruction' flag to append text after the data, helping mitigate unintentional prompt injection, as suggested in the README's 'psort' command description for certain use cases.

Ambrosia

What is Ambrosia?

Overview

Use Cases

Best For

Related Projects

Found a gem we're missing?

Not Ideal For

Pros & Cons

Pros

Cons

Frequently Asked Questions