Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Executable Packing
  3. Ember

Ember

NOASSERTIONJupyter Notebook

An open dataset and toolkit for training static PE malware machine learning models, featuring millions of labeled Windows executable samples.

GitHubGitHub
1.2k stars313 forks0 contributors

What is Ember?

EMBER is an open-source dataset and toolkit for training static machine learning models to detect malware in Windows Portable Executable (PE) files. It provides labeled features from millions of PE samples, along with scripts for feature extraction, model training, and classification. The project solves the problem of inconsistent benchmarking in malware detection research by offering a standardized, reproducible framework.

Target Audience

Cybersecurity researchers, data scientists, and machine learning engineers focused on malware detection and static analysis of Windows executables. It's particularly valuable for academic institutions and security teams developing or evaluating ML-based threat detection systems.

Value Proposition

Researchers choose EMBER because it provides a large, curated, and versioned dataset with reproducible tooling, eliminating the need to collect and label PE files manually. Its open nature and benchmark models enable direct comparison of new techniques against established baselines, accelerating research in the field.

Overview

Elastic Malware Benchmark for Empowering Researchers

Use Cases

Best For

  • Benchmarking new malware detection algorithms against established models
  • Training static ML models for PE file classification
  • Studying the evolution of malware features over time (2017-2018)
  • Extracting structured features from Windows executables for research
  • Reproducing academic malware detection experiments
  • Developing feature engineering techniques for PE files

Not Ideal For

  • Real-time malware detection systems needing dynamic analysis or behavioral features
  • Research focused on non-Windows platforms (e.g., Android APKs, macOS executables)
  • Projects requiring up-to-date malware samples from the past 3-5 years
  • Teams unwilling to reconcile dataset inconsistencies between 2017 and 2018 releases

Pros & Cons

Pros

Large Labeled Datasets

Includes over 2 million PE files with labeled features from 2017 and 2018, providing a substantial foundation for model training without manual data collection.

Reproducible Benchmarking

Offers scripts like train_ember.py and classify_binaries.py to train LightGBM models and classify new binaries, ensuring consistent experimental results across studies.

Feature Versioning Support

Documents feature versions (1 and 2) tied to LIEF library releases, allowing researchers to track changes and maintain reproducibility in feature extraction.

Easy ML Integration

Converts raw JSON features to vectorized formats (e.g., CSV, dataframes) via functions like create_vectorized_features, simplifying pipeline integration.

Cons

Dataset Selection Inconsistencies

The README admits different sample selection criteria for 2017 vs. 2018 datasets, which can skew longitudinal studies and require careful handling.

LIEF Version Lock-in

Feature extraction depends on specific LIEF versions; models trained with one version may yield unpredictable results with another, breaking reproducibility.

Outdated Malware Samples

Datasets are from 2017-2018, making them less relevant for detecting contemporary malware trends without supplementation with newer data.

Platform-Specific Limitations

LIEF 0.9.0 fails to install on Mac M1 chips, forcing users to rely on Docker workarounds, adding setup complexity and potential inconsistencies.

Frequently Asked Questions

Quick Stats

Stars1,163
Forks313
Contributors0
Open Issues35
Last commit1 year ago
CreatedSince 2018

Tags

#malware-detection#reproducible-research#lightgbm#pe-files#feature-engineering#cybersecurity#dataset#machine-learning#windows-executables#static-analysis

Built With

L
LIEF
L
LightGBM
s
scikit-learn
p
pandas
P
Python
N
NumPy

Included in

Executable Packing1.6k
Auto-fetched 1 day ago

Related Projects

theZootheZoo

A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.

Stars13,129
Forks2,757
Last commit2 months ago
Malware ArchiveMalware Archive

Malware samples, analysis exercises and other interesting resources.

Stars1,642
Forks240
Last commit2 years ago
Ember2024Ember2024

EMBER2024 is an updated malware dataset designed for researchers to explore a variety of classification tasks, including malicious/benign detection, malware family classification, and behavior prediction. It provides raw features and multiple label types for 3.2 million files, enabling holistic evaluation of machine learning models in cybersecurity. ## Key Features - **Multi-File Type Support** — Includes Win32, Win64, .NET, APK, ELF, and PDF files for cross-platform analysis. - **Temporal Split** — Training and test sets are separated by time to simulate detection of newer malware. - **Challenge Set** — Contains 6,315 evasive malicious files initially undetected by antivirus products. - **Feature Version 3** — Re-implemented feature vector format using the stable pefile library, with additions like DOS header and Authenticode signature features. - **Extended Labels** — Seven types of labels and tags support diverse classification tasks beyond simple detection. - **Capa Integration** — Includes malware behavior analysis results (ATT&CK techniques, MBC behaviors) for Win32, Win64, .NET, and ELF files. ## Philosophy EMBER2024 aims to provide a comprehensive, realistic benchmark that reflects the evolving malware landscape, enabling robust evaluation of classifier performance on novel and evasive threats.

Stars123
Forks26
Last commit10 months ago
BODMASBODMAS

Code for our DLS'21 paper - BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. BODMAS is short for Blue Hexagon Open Dataset for Malware AnalysiS.

Stars94
Forks18
Last commit2 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub