An open dataset and toolkit for training static PE malware machine learning models, featuring millions of labeled Windows executable samples.
EMBER is an open-source dataset and toolkit for training static machine learning models to detect malware in Windows Portable Executable (PE) files. It provides labeled features from millions of PE samples, along with scripts for feature extraction, model training, and classification. The project solves the problem of inconsistent benchmarking in malware detection research by offering a standardized, reproducible framework.
Cybersecurity researchers, data scientists, and machine learning engineers focused on malware detection and static analysis of Windows executables. It's particularly valuable for academic institutions and security teams developing or evaluating ML-based threat detection systems.
Researchers choose EMBER because it provides a large, curated, and versioned dataset with reproducible tooling, eliminating the need to collect and label PE files manually. Its open nature and benchmark models enable direct comparison of new techniques against established baselines, accelerating research in the field.
Elastic Malware Benchmark for Empowering Researchers
Includes over 2 million PE files with labeled features from 2017 and 2018, providing a substantial foundation for model training without manual data collection.
Offers scripts like train_ember.py and classify_binaries.py to train LightGBM models and classify new binaries, ensuring consistent experimental results across studies.
Documents feature versions (1 and 2) tied to LIEF library releases, allowing researchers to track changes and maintain reproducibility in feature extraction.
Converts raw JSON features to vectorized formats (e.g., CSV, dataframes) via functions like create_vectorized_features, simplifying pipeline integration.
The README admits different sample selection criteria for 2017 vs. 2018 datasets, which can skew longitudinal studies and require careful handling.
Feature extraction depends on specific LIEF versions; models trained with one version may yield unpredictable results with another, breaking reproducibility.
Datasets are from 2017-2018, making them less relevant for detecting contemporary malware trends without supplementation with newer data.
LIEF 0.9.0 fails to install on Mac M1 chips, forcing users to rely on Docker workarounds, adding setup complexity and potential inconsistencies.
A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.
Malware samples, analysis exercises and other interesting resources.
EMBER2024 is an updated malware dataset designed for researchers to explore a variety of classification tasks, including malicious/benign detection, malware family classification, and behavior prediction. It provides raw features and multiple label types for 3.2 million files, enabling holistic evaluation of machine learning models in cybersecurity. ## Key Features - **Multi-File Type Support** — Includes Win32, Win64, .NET, APK, ELF, and PDF files for cross-platform analysis. - **Temporal Split** — Training and test sets are separated by time to simulate detection of newer malware. - **Challenge Set** — Contains 6,315 evasive malicious files initially undetected by antivirus products. - **Feature Version 3** — Re-implemented feature vector format using the stable pefile library, with additions like DOS header and Authenticode signature features. - **Extended Labels** — Seven types of labels and tags support diverse classification tasks beyond simple detection. - **Capa Integration** — Includes malware behavior analysis results (ATT&CK techniques, MBC behaviors) for Win32, Win64, .NET, and ELF files. ## Philosophy EMBER2024 aims to provide a comprehensive, realistic benchmark that reflects the evolving malware landscape, enabling robust evaluation of classifier performance on novel and evasive threats.
Code for our DLS'21 paper - BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.