Custom Evaluation Datasets
Overview
The platform's model evaluation tool supports custom evaluation datasets. Users can upload their own datasets and use them to evaluate model performance to meet specific business scenario assessment needs.
EvalScope Custom Datasets
Multiple Choice Questions (MCQ)
Supports CSV and JSONL formats:
| Field | Required | Description |
|---|---|---|
id |
No | Question ID |
question |
Yes | Question content |
A / B / C / D |
Yes | Option content |
answer |
Yes | Correct answer (e.g., A) |
CSV example:
id,question,A,B,C,D,answer
1,What is the capital of France?,Paris,London,Berlin,Madrid,A
Open-ended Q&A
Supports JSONL format:
| Field | Required | Description |
|---|---|---|
system |
No | System prompt |
query |
Yes | Question content |
response |
Yes | Reference answer |
JSONL example:
{"query": "What is a large language model?", "response": "A large language model is a large-scale neural network model based on the Transformer architecture..."}
OpenCompass Custom Datasets
Multiple Choice Questions (MCQ)
Supports .jsonl and .csv formats:
{"question": "What is the capital of France?", "A": "Paris", "B": "London", "C": "Berlin", "D": "Madrid", "answer": "A"}
Open-ended Q&A
Supports .jsonl and .csv formats:
{"question": "What is a large language model?", "answer": "A large language model is..."}
lm-evaluation-harness Custom Datasets
lm-evaluation-harness uses task configuration files (YAML) to define evaluation tasks. The dataset must include a task YAML file.
Example dataset directory structure:
my_custom_dataset/
task.yaml ← Task configuration file
data/
test.jsonl ← Test data
Example task.yaml:
task: my_custom_qa
dataset_path: my_custom_dataset
dataset_name: default
output_type: generate_until
doc_to_text: "Question: {{question}}\nAnswer:"
doc_to_target: "{{answer}}"
Using Custom Datasets
- Upload the custom dataset to the platform's dataset repository.
- When creating an evaluation task, search and select your uploaded dataset in the Dataset Selection field.
- Select the corresponding evaluation framework, ensuring the dataset format matches the framework's requirements.