Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Executable Packing
  3. Ember

Ember

NOASSERTIONJupyter Notebook

An open dataset and toolkit for training static PE malware machine learning models, featuring millions of labeled Windows executable samples.

GitHubGitHub
1.1k stars311 forks0 contributors

What is Ember?

EMBER is an open-source dataset and toolkit for training static machine learning models to detect malware in Windows Portable Executable (PE) files. It provides labeled features from millions of PE samples, along with scripts for feature extraction, model training, and classification. The project solves the problem of inconsistent benchmarking in malware detection research by offering a standardized, reproducible framework.

Target Audience

Cybersecurity researchers, data scientists, and machine learning engineers focused on malware detection and static analysis of Windows executables. It's particularly valuable for academic institutions and security teams developing or evaluating ML-based threat detection systems.

Value Proposition

Researchers choose EMBER because it provides a large, curated, and versioned dataset with reproducible tooling, eliminating the need to collect and label PE files manually. Its open nature and benchmark models enable direct comparison of new techniques against established baselines, accelerating research in the field.

Overview

Elastic Malware Benchmark for Empowering Researchers

Use Cases

Best For

  • Benchmarking new malware detection algorithms against established models
  • Training static ML models for PE file classification
  • Studying the evolution of malware features over time (2017-2018)
  • Extracting structured features from Windows executables for research
  • Reproducing academic malware detection experiments
  • Developing feature engineering techniques for PE files

Not Ideal For

  • Real-time malware detection systems needing dynamic analysis or behavioral features
  • Research focused on non-Windows platforms (e.g., Android APKs, macOS executables)
  • Projects requiring up-to-date malware samples from the past 3-5 years
  • Teams unwilling to reconcile dataset inconsistencies between 2017 and 2018 releases

Pros & Cons

Pros

Large Labeled Datasets

Includes over 2 million PE files with labeled features from 2017 and 2018, providing a substantial foundation for model training without manual data collection.

Reproducible Benchmarking

Offers scripts like train_ember.py and classify_binaries.py to train LightGBM models and classify new binaries, ensuring consistent experimental results across studies.

Feature Versioning Support

Documents feature versions (1 and 2) tied to LIEF library releases, allowing researchers to track changes and maintain reproducibility in feature extraction.

Easy ML Integration

Converts raw JSON features to vectorized formats (e.g., CSV, dataframes) via functions like create_vectorized_features, simplifying pipeline integration.

Cons

Dataset Selection Inconsistencies

The README admits different sample selection criteria for 2017 vs. 2018 datasets, which can skew longitudinal studies and require careful handling.

LIEF Version Lock-in

Feature extraction depends on specific LIEF versions; models trained with one version may yield unpredictable results with another, breaking reproducibility.

Outdated Malware Samples

Datasets are from 2017-2018, making them less relevant for detecting contemporary malware trends without supplementation with newer data.

Platform-Specific Limitations

LIEF 0.9.0 fails to install on Mac M1 chips, forcing users to rely on Docker workarounds, adding setup complexity and potential inconsistencies.

Frequently Asked Questions

Quick Stats

Stars1,149
Forks311
Contributors0
Open Issues35
Last commit1 year ago
CreatedSince 2018

Tags

#malware-detection#reproducible-research#lightgbm#pe-files#feature-engineering#cybersecurity#dataset#machine-learning#windows-executables#static-analysis

Built With

L
LightGBM
s
scikit-learn
p
pandas
P
Python
N
NumPy

Included in

Executable Packing1.6k
Auto-fetched 1 day ago

Related Projects

theZootheZoo

A repository of LIVE malwares for your own joy and pleasure. theZoo is a project created to make the possibility of malware analysis open and available to the public.

Stars12,959
Forks2,722
Last commit27 days ago
Malware ArchiveMalware Archive

Malware samples, analysis exercises and other interesting resources.

Stars1,629
Forks238
Last commit2 years ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub