Last updated:

Evaluation Framework Overview

Supported Evaluation Frameworks

The platform provides three mainstream evaluation frameworks: lm-evaluation-harness, OpenCompass, and EvalScope.

Framework Comparison

Feature	lm-evaluation-harness	OpenCompass	EvalScope
Task Scope	Global (primarily English)	Special focus on Chinese	Special focus on Chinese
Model Support	Open-source + API models	Open-source + domestic commercial models	Open-source + domestic commercial models
Best For	Academic research, English model comparison	Chinese tasks, domestic/international model comparison	Chinese tasks, newer datasets
Extensibility	High, more technical	User-friendly	User-friendly

lm-evaluation-harness

A Python evaluation tool developed by EleutherAI, providing standardized evaluation pipelines and supporting multiple NLP tasks (text generation, fill-in-the-blank, Q&A, translation). Best suited for English or global academic research.

Built-in benchmarks (partial): MMLU, HellaSwag, ARC, TruthfulQA, WinoGrande, and more.

OpenCompass

An open-source evaluation framework specifically optimized for Chinese large language model evaluation, supporting Chinese benchmarks (e.g., CEVAL, CLUE) and compatible with mainstream domestic models. Suitable for Chinese task evaluation.

Built-in benchmarks (partial): C-Eval, CMMLU, MMLU, GSM8K, HumanEval, and more.

EvalScope

An evaluation framework from the ModelScope community with built-in MMLU, CMMLU, C-Eval, GSM8K, and other benchmarks. Supports large language models, multimodal models, Embedding models, and AIGC models. Suitable for Chinese tasks with newer datasets.

Built-in benchmarks (partial): C-Eval, CMMLU, MMLU, GSM8K, ARC, HellaSwag, and more.

How to Choose a Framework

Evaluating English models or conducting academic comparisons → Recommend lm-evaluation-harness
Evaluating Chinese models focused on domestic benchmarks → Recommend OpenCompass or EvalScope
Using custom datasets → All three frameworks support this; refer to Custom Evaluation Datasets