概览 (Overview)

Total Runs
${ runs.length }
Test Sets
${ testSets.length }
Avg Score (All Time)
${ calculateGlobalAvg() }

近期评测趋势

测试集管理

${ set.name }

${ set.description }

ID: ${ set.id } | Created: ${ formatDate(set.created_at) }

评测历史

ID Test Set Model Status Score Date Action
#${ run.id } ${ run.test_set_name } ${ run.model_name } ${ run.status } ${ run.avg_score ? run.avg_score.toFixed(1) : '-' }/10 ${ formatDate(run.created_at) }

评测详情 #${ activeRun.run.id }

Summary

Model ${ activeRun.run.model_name }
Status ${ activeRun.run.status }
Avg Score ${ activeRun.run.avg_score.toFixed(1) }

Case Details

Score: ${ res.judge_score }/10 Latency: ${ res.latency_ms }ms
Prompt
${ res.prompt }
Expected Criteria
${ res.criteria }
Model Output
${ res.model_output }
Judge Reasoning
${ res.judge_reasoning }

Create Test Set

管理用例: ${ activeSet.name }

P: ${ c.prompt }
E: ${ c.expected_output }
C: ${ c.criteria }