An open dataset and toolkit for training static PE malware machine learning models, featuring extracted features from millions of Windows executable files.
EMBER is an open-source dataset and toolkit for training static malware detection models on Windows Portable Executable (PE) files. It provides extracted features from millions of PE files, along with scripts to train benchmark machine learning models and classify new samples. The project addresses the need for standardized, reproducible benchmarks in malware research.
Security researchers, data scientists, and malware analysts working on machine learning-based detection systems. It's particularly valuable for academics and industry professionals who need reproducible baselines for comparing malware classification approaches.
EMBER provides a curated, version-controlled dataset with consistent feature extraction, enabling direct comparison of different machine learning techniques. Unlike proprietary datasets, it's fully open and includes tools to reproduce benchmark results, accelerating research in malware detection.
Elastic Malware Benchmark for Empowering Researchers
Uses the LIEF library to extract detailed raw and vectorized features from PE files, providing a rich, structured dataset for machine learning models as described in the README.
Includes scripts like train_ember.py to train LightGBM models and classify_binaries.py for predictions, ensuring consistent and verifiable results in research.
Maintains feature calculation across specific LIEF versions (e.g., 0.9.0 for version 2), reducing variability and enabling exact replication of experiments.
Combines over 2 million samples from 2017 and 2018, allowing studies on malware evolution, though with noted inconsistencies in sample selection.
Datasets are frozen from 2017 and 2018, which may not reflect current malware trends, limiting relevance for detecting modern threats without supplemental data.
LIEF library has compatibility issues, especially on Mac M1, requiring Docker for installation, as noted in the README, adding overhead for some users.
The README warns that selection criteria differ between 2017 and 2018 datasets (e.g., 2018 samples are harder to classify), potentially biasing multi-year studies.
Features depend on specific LIEF versions; using different versions can lead to unpredictable results, restricting flexibility in library updates.
A collection of awesome penetration testing resources, tools and other shiny things
A curated list of awesome Hacking tutorials, tools and resources
A collection of awesome software, libraries, documents, books, resources and cools stuffs about security.
A curated list of CTF frameworks, libraries, resources and softwares
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.