Leaderboard · Chainforge

Chainforge.ai

Software Engineering

Bug-Fix and Refactor

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Alex Chen – Chainforge.ai9/26/20254 models evaluated

Evaluation Abstract

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Sample Prompt

Fix the following Python function that has a bug: def calculate_average(nums): return sum(nums) / len(nums) if nums else 0. The function fails when nums contains non-numeric values.

Scoring criteria: domain-specific metrics per evaluation.

Bug-Fix and Refactor

Evaluation Abstract

Sample Prompt

Chainforge.ai Evaluation Flow

Model Performance Rankings