Upload Datasets
Upload Methods
The platform supports four methods for uploading dataset files:
| Method | Use Case |
|---|---|
| Web Upload | Quickly upload a single small file (max 5MB per file) |
| Git Upload | General-purpose method for batch and large file uploads |
| csghub-cli | Command-line batch upload with resumable transfers |
| Python SDK | Programmatic upload for automation workflows |
Prerequisites
- A dataset repository has been created
- An access token has been generated in User Settings → Access Tokens (required for Git / CLI / SDK methods)
Git Upload
Clone the Dataset Repository
git lfs install
git clone https://<username>:<access-token>@<platform-host>/<namespace>/<dataset-name>
cd <dataset-name>
Add Files and Push
# Copy data files into the repository directory
cp /path/to/your/data.parquet .
# Add and commit
git add .
git commit -m "add dataset files"
git push
Note
The platform automatically enables Git LFS tracking based on file extensions. Files with the following extensions are managed via LFS automatically: .parquet, .arrow, .csv, .jsonl, .json, .tsv, .txt, .bin, .safetensors, .msgpack, .h5, .hdf5, .tflite, .tar.gz, .zip, .zst, .png, .jpg, .jpeg, .gif, .bmp, .wav, .mp3, .flac, .mp4, .avi, .webm, and others.
To manually track additional formats:
git lfs track "*.your_extension"
git add .gitattributes
Web Upload
- Navigate to the dataset detail page and switch to the Files tab
- Click the Upload File button
- Select the file to upload (max 5MB per file)
- Enter a commit message and confirm to complete the upload
Tip
Web upload is best suited for quickly uploading small configuration files or README files. For large data files, use the Git or CLI methods instead.
Upload via csghub-cli
Install csghub-cli:
pip install csghub-sdk
Configure the access token:
export CSG_TOKEN=<access-token>
Upload dataset files:
# Upload a single file
csghub-cli upload <namespace>/<dataset-name> /path/to/data.parquet --repo_type dataset
# Upload an entire directory
csghub-cli upload <namespace>/<dataset-name> /path/to/dataset_dir --repo_type dataset
Upload via Python SDK
from pycsghub.repository import Repository
repo = Repository(
repo_id="<namespace>/<dataset-name>",
repo_type="dataset",
endpoint="https://<platform-host>",
token="<access-token>"
)
# Clone to local
repo.clone()
# After copying data files into the local repository directory, push to remote
repo.push()
Note
Access tokens can be generated in User Settings → Access Tokens. Use a token with write permissions.