Back to Leaderboards
Software Engineering

Bug-Fix and Refactor

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Alex ChenChainforge.ai9/26/20254 models evaluated

Evaluation Abstract

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Sample Prompt

Fix the following Python function that has a bug: def calculate_average(nums): return sum(nums) / len(nums) if nums else 0. The function fails when nums contains non-numeric values.
Scoring criteria: domain-specific metrics per evaluation.

Chainforge.ai Evaluation Flow

Model Performance Rankings

Leaderboard · Chainforge