A universal code formatter that uses machine learning to learn formatting patterns from a corpus of existing code.
CodeBuff is a research project and tool that uses machine learning to automatically format source code. It learns formatting patterns—such as indentation, spacing, and line breaks—from a corpus of existing, well-formatted code, then applies those patterns to new, unformatted code. It aims to solve the problem of building and maintaining language-specific formatters by providing a universal, corpus-driven approach.
Researchers in programming languages and software engineering, as well as developers interested in automated code formatting tools, particularly those working with multiple languages or large codebases where consistent formatting is challenging.
CodeBuff offers a novel, data-driven alternative to traditional rule-based formatters, potentially reducing the manual effort needed to create and tune formatters for different languages. Its ability to learn from any code corpus allows it to adapt to project-specific or team-specific coding styles.
Language-agnostic pretty-printing through machine learning (uh, like, is this possible? YES, apparently).
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Uses machine learning to infer formatting rules from code corpora, eliminating the need for manual rule configuration as described in the abstract, making it highly flexible for custom styles.
Works with any language that has an ANTLR grammar, demonstrated with Java, SQL, and ANTLR grammars in the sample output, offering broad applicability.
Can learn and enforce a project's specific formatting style by training on its existing codebase, as shown in the corpus-driven approach and leave-one-out validation.
Includes validation techniques and performance benchmarks backed by an academic paper, with detailed speed tests and graph generation for empirical analysis.
Labeled as an experimental formatter in the README, it lacks the stability and polish of production-ready tools, with potential inconsistencies in output.
Requires compiling ANTLR grammars, setting up CLASSPATH, and preparing training corpora, as detailed in the formatting instructions, making it cumbersome compared to plug-and-play formatters.
Speed tests show slow load times (e.g., 2.5 minutes for Java8 grammar) and formatting times that scale with corpus size, limiting suitability for large-scale or real-time use.
Lacks integrations, plugins, and community support compared to mature formatters, with only a command-line interface and no out-of-the-box IDE support.