Datasets
What are Datasets
The Dataset Hub enables unified management of data, providing easy access to and sharing of various types of data. Datasets can be categorized as text, image, audio, video, multimodal, scientific computing, and other types, used for text generation, speech recognition, image classification, relation extraction, and other tasks, supporting model training and inference.
Dataset Types
The platform supports the following dataset types:
| Type | Example Tasks |
|---|---|
| Text | Text classification, sentiment analysis, Q&A, text generation |
| Image | Image classification, object detection, image segmentation |
| Audio | Speech recognition, audio classification |
| Video | Video understanding, action recognition |
| Multimodal | Image-text understanding, visual Q&A |
Core Features
- Dataset Upload: Upload data files in various formats via the Web interface or Git.
- Dataset Preview: Preview Parquet, CSV, JSONL, and other format files online.
- Version Control: Manage datasets with Git versioning.
- Access Control: Supports public and private visibility settings.
- Processing Integration: Datasets can be directly used in platform fine-tuning and evaluation tasks.
Related Operations
Dataset Card
Learn about Dataset Cards — their purpose, metadata format, and how to write one to help users understand and use your dataset.
Create Dataset
How to create a dataset repository on the platform, including entry point, form parameters, and next steps.
Upload Datasets
How to upload dataset files using the Web interface, Git, CLI tools, or Python SDK.
Update Datasets
How to edit dataset files, modify repository settings, and delete datasets.
Download Datasets
How to download datasets using Git, CLI tools, or Python SDK.