Open-Awesome
CategoriesAlternativesStacksSelf-HostedExplore
Open-Awesome

© 2026 Open-Awesome. Curated for the developer elite.

TermsPrivacyAboutGitHubRSS
  1. Home
  2. Empirical Software Engineering
  3. astminer

astminer

MITKotlinv0.9.0

A Kotlin library for extracting path-based code representations and ASTs from multiple languages to prepare code for machine learning models.

GitHubGitHub
300 stars78 forks0 contributors

What is astminer?

astminer is a library for mining path-based representations of code and raw abstract syntax trees (ASTs) from source code. It processes code in multiple languages like Java, Python, and C++ to convert it into formats suitable for training machine learning models, addressing the need for structured code data in software engineering research.

Target Audience

Researchers and developers working on machine learning for code, software engineering tools, or code analysis pipelines who need to preprocess source code into machine-readable representations.

Value Proposition

Developers choose astminer for its multi-language support, extensible design, and focus on path-based code representations, which are essential for modern code understanding models, and its integration flexibility as a Kotlin library or CLI tool.

Overview

A library for mining of path-based representations of code (and more)

Use Cases

Best For

  • Extracting path-based code representations for training models like code2vec
  • Preprocessing source code for machine learning in software engineering research
  • Building custom code mining pipelines for multiple programming languages
  • Analyzing code style and patterns across large codebases
  • Converting raw source code into structured AST formats for analysis
  • Creating labeled datasets for code classification or generation tasks

Not Ideal For

  • Projects requiring real-time or live code analysis, as astminer is designed for batch processing in ML pipelines
  • Teams seeking an actively maintained tool with ongoing support and updates
  • Developers needing simple, out-of-the-box code parsing without Kotlin/Java integration overhead

Pros & Cons

Pros

Multi-language Parser Support

Supports Java, Python, C/C++, JavaScript, and PHP using parsers like ANTLR and TreeSitter, enabling cross-language code analysis for diverse research projects.

Path-based Code Mining

Extracts path-based representations as per the code2vec paper, which are crucial for training modern code understanding and similarity detection models.

Extensible Architecture

Designed for easy extension to new languages and custom processing modules, allowing researchers to adapt the pipeline for specific mining tasks.

Flexible Data Storage

Offers configurable storage formats for processed data, facilitating integration with various machine learning frameworks and output requirements.

Cons

Unmaintained Project

The README explicitly states it is no longer maintained, meaning no updates, bug fixes, or community support, which limits long-term viability.

Uneven Language Coverage

Language support is inconsistent; for example, C/C++ parsing relies only on the Fuzzy parser, which may have limited features compared to more robust parsers for other languages.

Complex Integration Setup

Requires building from source or using Docker, and integration into Kotlin/Java pipelines can be cumbersome for developers not familiar with JVM ecosystems.

Frequently Asked Questions

Quick Stats

Stars300
Forks78
Contributors0
Open Issues1
Last commit6 months ago
CreatedSince 2018

Tags

#multi-language#research-tool#kotlin-library#data-pipeline#antlr#code-analysis#mining#machine-learning#software-engineering

Built With

K
Kotlin
T
Treesitter
D
Docker
A
ANTLR
G
Gradle

Included in

Empirical Software Engineering475
Auto-fetched 1 day ago

Related Projects

PyDrillerPyDriller

Python Framework to analyse Git repositories

Stars955
Forks155
Last commit4 months ago
RefactoringMinerRefactoringMiner

RefactoringMiner is a Java library and API designed to automatically identify refactoring operations within code changes across multiple programming languages. It analyzes commits, pull requests, and commit ranges to detect over 100 refactoring types, from simple renames to complex structural changes. The tool also generates detailed Abstract Syntax Tree (AST) diffs, providing a deeper understanding of code evolution beyond traditional line-based diffs. ## Key Features - **Refactoring Detection** — Identifies 40+ classic refactorings from Fowler's catalog, 52 API-level changes, 8 migration patterns, and 5 test-specific refactorings. - **Multi-Language Support** — Works with Java, Python, and Kotlin codebases, with TypeScript support planned. - **AST Diff Generation** — Produces syntax-aware diffs at commit, pull request, and commit range levels. - **Visualization Tools** — Includes a Chrome extension for refactoring-aware commit reviews and interactive diff visualization in browsers. - **Advanced Diff Features** — Supports refactoring-aware tooltips, single-page views, embedded GitHub comments, and handling of code moved between files. ## Philosophy RefactoringMiner aims to make code evolution transparent and understandable by precisely tracking structural changes, helping developers and researchers analyze refactoring practices and improve code review processes.

Stars487
Forks158
Last commit2 days ago
PercevalPerceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.

Stars319
Forks185
Last commit2 days ago
DesigniteJavaDesigniteJava

Detects smells and computes metrics of Java code

Stars192
Forks68
Last commit1 year ago
Community-curated · Updated weekly · 100% open source

Found a gem we're missing?

Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.

Submit a projectStar on GitHub