Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Malware Analysis
  3. Ember

Ember

NOASSERTIONJupyter Notebook

An open dataset and toolkit for training static PE malware machine learning models, featuring extracted features from millions of Windows executable files.

GitHubGitHub
1.1k stars311 forks0 contributors

What is Ember?

EMBER is an open-source dataset and toolkit for training static malware detection models on Windows Portable Executable (PE) files. It provides extracted features from millions of PE files, along with scripts to train benchmark machine learning models and classify new samples. The project addresses the need for standardized, reproducible benchmarks in malware research.

Target Audience

Security researchers, data scientists, and malware analysts working on machine learning-based detection systems. It's particularly valuable for academics and industry professionals who need reproducible baselines for comparing malware classification approaches.

Value Proposition

EMBER provides a curated, version-controlled dataset with consistent feature extraction, enabling direct comparison of different machine learning techniques. Unlike proprietary datasets, it's fully open and includes tools to reproduce benchmark results, accelerating research in malware detection.

Overview

Elastic Malware Benchmark for Empowering Researchers

Use Cases

Best For

  • Training benchmark malware detection models for academic research
  • Comparing machine learning algorithms on static PE file analysis
  • Conducting longitudinal studies of malware feature evolution
  • Reproducible experiments in cybersecurity machine learning
  • Extracting structured features from PE files for custom models
  • Educational projects on malware classification and static analysis

Not Ideal For

  • Real-time malware detection in production environments
  • Analysis of non-Windows or non-PE file formats
  • Projects requiring up-to-date malware samples without additional data collection
  • Teams without infrastructure for Docker or handling large datasets

Pros & Cons

Pros

Comprehensive Feature Extraction

Uses the LIEF library to extract detailed raw and vectorized features from PE files, providing a rich, structured dataset for machine learning models as described in the README.

Reproducible Benchmark Models

Includes scripts like train_ember.py to train LightGBM models and classify_binaries.py for predictions, ensuring consistent and verifiable results in research.

Versioned Dataset Consistency

Maintains feature calculation across specific LIEF versions (e.g., 0.9.0 for version 2), reducing variability and enabling exact replication of experiments.

Large-Scale Longitudinal Data

Combines over 2 million samples from 2017 and 2018, allowing studies on malware evolution, though with noted inconsistencies in sample selection.

Cons

Outdated Malware Samples

Datasets are frozen from 2017 and 2018, which may not reflect current malware trends, limiting relevance for detecting modern threats without supplemental data.

Platform-Specific Setup Complexity

LIEF library has compatibility issues, especially on Mac M1, requiring Docker for installation, as noted in the README, adding overhead for some users.

Inconsistent Dataset Criteria

The README warns that selection criteria differ between 2017 and 2018 datasets (e.g., 2018 samples are harder to classify), potentially biasing multi-year studies.

Model Version Lock-in

Features depend on specific LIEF versions; using different versions can lead to unpredictable results, restricting flexibility in library updates.

Frequently Asked Questions

Quick Stats

Stars1,149
Forks311
Contributors0
Open Issues35
Last commit1 year ago
CreatedSince 2018

Tags

#malware-detection#reproducible-research#lightgbm#pe-files#security-research#cybersecurity#dataset#machine-learning#static-analysis

Built With

L
LightGBM
s
scikit-learn
p
pandas
P
Python
N
NumPy

Included in

Malware Analysis13.6k
Auto-fetched 1 day ago

Related Projects

Awesome PentestAwesome Pentest

A collection of awesome penetration testing resources, tools and other shiny things

Stars25,972
Forks4,794
Last commit3 months ago
Awesome HackingAwesome Hacking

A curated list of awesome Hacking tutorials, tools and resources

Stars16,210
Forks1,685
Last commit1 year ago
SecuritySecurity

A collection of awesome software, libraries, documents, books, resources and cools stuffs about security.

Stars14,259
Forks2,215
Last commit3 months ago
Awesome CTFAwesome CTF

A curated list of CTF frameworks, libraries, resources and softwares

Stars11,485
Forks1,614
Last commit1 year ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub