< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Dataset Card

    What is a Dataset Card

    A Dataset Card is the README.md file in the root directory of a dataset repository. It uses YAML front-matter metadata + Markdown body to describe a dataset's basic information, purpose, structure, and usage.

    The Dataset Card is the key document that helps users quickly understand a dataset. A well-written Dataset Card should include:

    • Name and overview: A basic description and purpose of the dataset
    • Use cases: Task types and industry scenarios the dataset applies to
    • Compatible models: Model types the dataset can be used to train or fine-tune
    • Data structure: Fields, formats, and sample entries
    • Usage instructions: Code examples for loading and using the dataset
    • Creation information: Data sources, collection methods, and annotation process
    • Risks and limitations: Known biases, limitations, and usage considerations

    Metadata Format

    Dataset Cards use a YAML front-matter block to define structured information. The platform automatically parses and displays this metadata. The YAML block is placed at the beginning of the file, delimited by ---:

    ---
    language:
      - zh
      - en
    tags:
      - text-classification
      - sentiment-analysis
    task_categories:
      - text-classification
    size_categories:
      - 10K<n<100K
    license: apache-2.0
    ---
    

    Dataset Card Template

    Below is the recommended full template for a Dataset Card:

    ---
    language:
      - zh
    tags:
      - text-classification
    task_categories:
      - text-classification
    size_categories:
      - 1K<n<10K
    license: apache-2.0
    ---
    
    # Dataset Name
    
    ## Dataset Description
    
    Briefly describe the content, source, and purpose of the dataset.
    
    ## Use Cases
    
    - Text classification
    - Sentiment analysis
    - Model fine-tuning
    
    ## Data Structure
    
    | Field | Type | Description |
    |-------|------|-------------|
    | text | string | Input text |
    | label | int | Category label |
    
    Sample data:
    
    \```json
    {
      "text": "This product has a great user experience",
      "label": 1
    }
    \```
    
    ## Usage
    
    \```python
    from pycsghub.snapshot_download import snapshot_download
    
    dataset_path = snapshot_download(
        repo_id="<namespace>/<dataset-name>",
        repo_type="dataset",
        endpoint="https://<platform-host>",
        token="<access-token>"
    )
    \```
    
    ## Creation Information
    
    - **Data source**: Where the data was collected from
    - **Collection method**: How the data was gathered and processed
    - **Annotation method**: How the data was labeled and quality-checked
    - **Dataset size**: Total number of entries
    
    ## Risks and Limitations
    
    Describe known biases, limitations, and important considerations for using this dataset.
    

    Note

    Code blocks in the template above use ``` escape characters to avoid rendering conflicts. When writing your actual Dataset Card, use standard triple backticks.

    Dataset Task Tags

    Use the following standard task tags in the task_categories metadata field:

    Tag Description
    text-classification Text classification
    token-classification Token classification (e.g., named entity recognition)
    question-answering Question answering
    summarization Text summarization
    translation Machine translation
    text-generation Text generation
    text2text-generation Text-to-text generation
    fill-mask Fill mask
    sentence-similarity Sentence similarity
    conversational Conversational
    table-question-answering Table question answering
    image-classification Image classification
    object-detection Object detection
    image-segmentation Image segmentation
    image-to-text Image to text
    text-to-image Text to image
    visual-question-answering Visual question answering
    automatic-speech-recognition Automatic speech recognition
    audio-classification Audio classification
    text-to-speech Text to speech
    video-classification Video classification
    reinforcement-learning Reinforcement learning
    feature-extraction Feature extraction

    Industry Tags

    Use industry tags in the tags metadata field to identify the dataset's application domain:

    Tag Description
    finance Finance
    healthcare Healthcare
    education Education
    legal Legal
    government Government
    manufacturing Manufacturing
    retail Retail
    energy Energy
    transportation Transportation
    agriculture Agriculture
    general General purpose

    Tip

    Setting appropriate task tags and industry tags helps other users discover and filter datasets more easily.