An open dataset and toolkit for training static PE malware machine learning models, featuring millions of labeled Windows executable samples.
EMBER is an open-source dataset and toolkit for training static machine learning models to detect malware in Windows Portable Executable (PE) files. It provides labeled features from millions of PE samples, along with scripts for feature extraction, model training, and classification. The project solves the problem of inconsistent benchmarking in malware detection research by offering a standardized, reproducible framework.
Cybersecurity researchers, data scientists, and machine learning engineers focused on malware detection and static analysis of Windows executables. It's particularly valuable for academic institutions and security teams developing or evaluating ML-based threat detection systems.
Researchers choose EMBER because it provides a large, curated, and versioned dataset with reproducible tooling, eliminating the need to collect and label PE files manually. Its open nature and benchmark models enable direct comparison of new techniques against established baselines, accelerating research in the field.
Elastic Malware Benchmark for Empowering Researchers
Includes over 2 million PE files with labeled features from 2017 and 2018, providing a substantial foundation for model training without manual data collection.
Offers scripts like train_ember.py and classify_binaries.py to train LightGBM models and classify new binaries, ensuring consistent experimental results across studies.
Documents feature versions (1 and 2) tied to LIEF library releases, allowing researchers to track changes and maintain reproducibility in feature extraction.
Converts raw JSON features to vectorized formats (e.g., CSV, dataframes) via functions like create_vectorized_features, simplifying pipeline integration.
The README admits different sample selection criteria for 2017 vs. 2018 datasets, which can skew longitudinal studies and require careful handling.
Feature extraction depends on specific LIEF versions; models trained with one version may yield unpredictable results with another, breaking reproducibility.
Datasets are from 2017-2018, making them less relevant for detecting contemporary malware trends without supplementation with newer data.
LIEF 0.9.0 fails to install on Mac M1 chips, forcing users to rely on Docker workarounds, adding setup complexity and potential inconsistencies.
A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.
Malware samples, analysis exercises and other interesting resources.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.