A Python tool for extracting malware family names and tags from antivirus engine labels, designed for large-scale malware analysis.
AVClass is a Python-based tool that extracts structured tags—such as malware family names, classes, behaviors, and file properties—from antivirus engine labels. It addresses the challenge of inconsistent and noisy AV labeling by providing automated, vendor-agnostic analysis suitable for large-scale malware datasets.
Security researchers, malware analysts, and threat intelligence teams who need to process and label large volumes of malware samples consistently, especially those working with AV reports from sources like VirusTotal.
AVClass offers a scalable, automated solution with quantified accuracy, eliminating manual labeling efforts and providing reliable tags across diverse AV engines and platforms without requiring executable files.
AVClass malware labeling tool
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Processes millions of samples without manual intervention, as highlighted in the README's examples with large datasets like malheurReference_lb.json.
Works with labels from any set of AV engines, supporting inputs from VirusTotal v2/v3, OPSWAT MetaDefender, and custom formats without dependency on specific vendors.
Operates on sample hashes and AV labels alone, eliminating the need for executable files, which is ideal for analyzing samples where binaries are unavailable.
Provides precision, recall, and F1 scores evaluated on public datasets like Malheur, as demonstrated in the ground truth evaluation section with over 90% precision.
Output quality is limited by the noise and sparsity of input AV labels; it cannot extract tags if AV engines fail to provide non-generic tokens, as admitted in the Limitations section.
Only outputs tags that appear in at least 2 AV engines, which may miss relevant information from single detections or emerging threats with limited AV coverage.
Designed for large-scale dataset analysis rather than real-time or interactive use, making it unsuitable for scenarios requiring immediate tagging of individual samples.