Private BETA
07
Days
:
06
Hours
:
10
Min
:
44
Sec

Expert-Driven Leaderboards

Community-owned LLM rankings from domain experts across industries — beyond automated benchmarks.

Featured evaluations
Healthcare9/10/2025

Medical Diagnosis Chain-of-Thought Evaluation

Comprehensive evaluation of LLM diagnostic reasoning capabilities using real clinical cases. Measures diagnostic accuracy, reasoning coherence, and consideration of alternative diagnoses.

Dr. Sarah ChenStanford Medical

Top 3 Models

#1openai/gpt-4o
86.0%
#2anthropic/claude-3-5-sonnet
84.0%
#3google/gemini-1.5-pro
82.0%
1 flow4 models evaluated
View Leaderboard
Software Engineering9/26/2025

Bug-Fix and Refactor

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Alex ChenChainforge.ai

Top 3 Models

#1openai/gpt-4.1
91.0%
#2anthropic/claude-3-opus
88.0%
#3mistral-large-2
83.0%
1 flow4 models evaluated
View Leaderboard
Finance10/1/2025

Earnings Call Summarization

Ranks models on accuracy and fidelity for summarizing earnings calls and extracting KPIs, with hallucination penalties.

Priya SharmaQuantLeaf

Top 3 Models

#1openai/gpt-4o-mini
82.0%
#2google/gemini-1.5-flash
79.0%
#3cohere/command-r+
76.0%
1 flow3 models evaluated
View Leaderboard
Software Engineering9/26/2025

Bug-Fix and Refactor

Evaluates models on reading, refactoring, and fixing real-world code snippets with tests. Weighted scoring for safety and style.

Alex ChenChainforge.ai

Top 3 Models

#1openai/gpt-4.1
91.0%
#2anthropic/claude-3-opus
88.0%
#3mistral-large-2
83.0%
1 flow4 models evaluated
View Leaderboard
Creative10/5/2025

Style Transfer

Measures stylistic fidelity, coherence, and constraint following for author-style transfer tasks.

Morgan LeeLitWorks Collective

Top 3 Models

#1anthropic/claude-3-5-haiku
88.0%
#2openai/gpt-4o
86.0%
#3google/gemini-1.5-pro
83.0%
1 flow4 models evaluated
View Leaderboard
Healthcare9/10/2025

Medical Diagnosis Chain-of-Thought Evaluation

Comprehensive evaluation of LLM diagnostic reasoning capabilities using real clinical cases. Measures diagnostic accuracy, reasoning coherence, and consideration of alternative diagnoses.

Dr. Sarah ChenStanford Medical

Top 3 Models

#1openai/gpt-4o
86.0%
#2anthropic/claude-3-5-sonnet
84.0%
#3google/gemini-1.5-pro
82.0%
1 flow4 models evaluated
View Leaderboard
Finance10/1/2025

Earnings Call Summarization

Ranks models on accuracy and fidelity for summarizing earnings calls and extracting KPIs, with hallucination penalties.

Priya SharmaQuantLeaf

Top 3 Models

#1openai/gpt-4o-mini
82.0%
#2google/gemini-1.5-flash
79.0%
#3cohere/command-r+
76.0%
1 flow3 models evaluated
View Leaderboard
Page 1/1