A Kotlin library for extracting path-based code representations and ASTs from multiple languages to prepare code for machine learning models.
astminer is a library for mining path-based representations of code and raw abstract syntax trees (ASTs) from source code. It processes code in multiple languages like Java, Python, and C++ to convert it into formats suitable for training machine learning models, addressing the need for structured code data in software engineering research.
Researchers and developers working on machine learning for code, software engineering tools, or code analysis pipelines who need to preprocess source code into machine-readable representations.
Developers choose astminer for its multi-language support, extensible design, and focus on path-based code representations, which are essential for modern code understanding models, and its integration flexibility as a Kotlin library or CLI tool.
A library for mining of path-based representations of code (and more)
Supports Java, Python, C/C++, JavaScript, and PHP using parsers like ANTLR and TreeSitter, enabling cross-language code analysis for diverse research projects.
Extracts path-based representations as per the code2vec paper, which are crucial for training modern code understanding and similarity detection models.
Designed for easy extension to new languages and custom processing modules, allowing researchers to adapt the pipeline for specific mining tasks.
Offers configurable storage formats for processed data, facilitating integration with various machine learning frameworks and output requirements.
The README explicitly states it is no longer maintained, meaning no updates, bug fixes, or community support, which limits long-term viability.
Language support is inconsistent; for example, C/C++ parsing relies only on the Fuzzy parser, which may have limited features compared to more robust parsers for other languages.
Requires building from source or using Docker, and integration into Kotlin/Java pipelines can be cumbersome for developers not familiar with JVM ecosystems.
Python Framework to analyse Git repositories
RefactoringMiner is a Java library and API designed to automatically identify refactoring operations within code changes across multiple programming languages. It analyzes commits, pull requests, and commit ranges to detect over 100 refactoring types, from simple renames to complex structural changes. The tool also generates detailed Abstract Syntax Tree (AST) diffs, providing a deeper understanding of code evolution beyond traditional line-based diffs. ## Key Features - **Refactoring Detection** — Identifies 40+ classic refactorings from Fowler's catalog, 52 API-level changes, 8 migration patterns, and 5 test-specific refactorings. - **Multi-Language Support** — Works with Java, Python, and Kotlin codebases, with TypeScript support planned. - **AST Diff Generation** — Produces syntax-aware diffs at commit, pull request, and commit range levels. - **Visualization Tools** — Includes a Chrome extension for refactoring-aware commit reviews and interactive diff visualization in browsers. - **Advanced Diff Features** — Supports refactoring-aware tooltips, single-page views, embedded GitHub comments, and handling of code moved between files. ## Philosophy RefactoringMiner aims to make code evolution transparent and understandable by precisely tracking structural changes, helping developers and researchers analyze refactoring practices and improve code review processes.
Send Sir Perceval on a quest to retrieve and gather data from software repositories.
Detects smells and computes metrics of Java code
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.