< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Upload Datasets

    Upload Methods

    The platform supports four methods for uploading dataset files:

    Method Use Case
    Web Upload Quickly upload a single small file (max 5MB per file)
    Git Upload General-purpose method for batch and large file uploads
    csghub-cli Command-line batch upload with resumable transfers
    Python SDK Programmatic upload for automation workflows

    Prerequisites

    • A dataset repository has been created
    • An access token has been generated in User Settings → Access Tokens (required for Git / CLI / SDK methods)

    Git Upload

    Clone the Dataset Repository

    git lfs install
    git clone https://<username>:<access-token>@<platform-host>/<namespace>/<dataset-name>
    cd <dataset-name>
    

    Add Files and Push

    # Copy data files into the repository directory
    cp /path/to/your/data.parquet .
    
    # Add and commit
    git add .
    git commit -m "add dataset files"
    git push
    

    Note

    The platform automatically enables Git LFS tracking based on file extensions. Files with the following extensions are managed via LFS automatically: .parquet, .arrow, .csv, .jsonl, .json, .tsv, .txt, .bin, .safetensors, .msgpack, .h5, .hdf5, .tflite, .tar.gz, .zip, .zst, .png, .jpg, .jpeg, .gif, .bmp, .wav, .mp3, .flac, .mp4, .avi, .webm, and others.

    To manually track additional formats:

    git lfs track "*.your_extension"
    git add .gitattributes
    

    Web Upload

    1. Navigate to the dataset detail page and switch to the Files tab
    2. Click the Upload File button
    3. Select the file to upload (max 5MB per file)
    4. Enter a commit message and confirm to complete the upload

    Tip

    Web upload is best suited for quickly uploading small configuration files or README files. For large data files, use the Git or CLI methods instead.

    Upload via csghub-cli

    Install csghub-cli:

    pip install csghub-sdk
    

    Configure the access token:

    export CSG_TOKEN=<access-token>
    

    Upload dataset files:

    # Upload a single file
    csghub-cli upload <namespace>/<dataset-name> /path/to/data.parquet --repo_type dataset
    
    # Upload an entire directory
    csghub-cli upload <namespace>/<dataset-name> /path/to/dataset_dir --repo_type dataset
    

    Upload via Python SDK

    from pycsghub.repository import Repository
    
    repo = Repository(
        repo_id="<namespace>/<dataset-name>",
        repo_type="dataset",
        endpoint="https://<platform-host>",
        token="<access-token>"
    )
    
    # Clone to local
    repo.clone()
    
    # After copying data files into the local repository directory, push to remote
    repo.push()
    

    Note

    Access tokens can be generated in User Settings → Access Tokens. Use a token with write permissions.