A tool for evaluating the quality of web code generated by Large Language Models (LLMs) using configurable checks and automated repair.
Web Codegen Scorer is a specialized evaluation tool designed to assess the quality of web code produced by Large Language Models. It enables developers to make evidence-based decisions by providing consistent, repeatable measurements across different models, prompts, and frameworks, moving beyond trial-and-error approaches.
Developers and teams using LLMs to generate web application code, particularly those who need to systematically compare models, optimize prompts, or monitor code quality over time.
It focuses specifically on web code and uses well-established quality metrics like build success, runtime errors, accessibility, and security, rather than relying on generic benchmarks. It also offers automated repair attempts and a visual report viewer for comparison.
Web Codegen Scorer is a tool for evaluating the quality of web code generated by LLMs.
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Allows setting up evaluations with different LLM models, web frameworks, and tools, as detailed in the command-line flags and environment config reference.
Assesses generated code for build success, runtime errors, accessibility, security, and coding best practices, providing comprehensive, empirical metrics beyond generic benchmarks.
Can automatically fix issues detected during code generation, with configurable repair attempts via the --max-build-repair-attempts flag, reducing manual intervention.
Focuses specifically on web code with established quality metrics, as emphasized in the philosophy, making it more relevant than broad coding benchmarks.
Requires setting up multiple API keys as environment variables and configuring evaluations through files, which can be a barrier for quick adoption or casual use.
The README admits that more checks are coming soon, and key features like interaction testing are on the roadmap, indicating current gaps in assessment capabilities.
Relies on external LLM APIs for both code generation and rating, leading to potential costs and vendor lock-in, with no built-in cost controls or offline alternatives.