Data Tools
What are Data Tools
Data Tools (DataFlow) is a one-stop data processing platform that enables full lifecycle management from data to model, driving continuous optimization. It supports multiple data formats and sources including local files, cloud data, and databases, and provides efficient transformation and processing tools to ensure data consistency. Customizable Pipelines enable complex data cleaning and transformation with parallel processing for efficiency. Additionally, an integrated labeling system supports collaborative annotation with permission and review mechanisms to ensure labeling accuracy and data quality.
Data Tools are deeply integrated with the platform's dataset management — data processing requires an existing platform dataset as the target, and all processing results are saved directly to dataset repositories.
Feature Modules
| Module | Description |
|---|---|
| System Dashboard | View overall running status of data collection, format conversion, and data processing tasks |
| Data Collection | Import data from MySQL, MongoDB, files, Hive, and other sources |
| Format Conversion | Convert Excel→JSON/CSV/Parquet, Word/PPT→Markdown, and more |
| Data Processing | Configure operator Pipelines via a visual workflow editor for data cleaning and augmentation |
| Algorithm Templates | Built-in and custom data processing algorithm templates for reuse and sharing |
| Tools | A collection of specialized data processing tools for analysis, conversion, and generation |
| Data Labeling | Integrated Label Studio for multi-modal annotation of text, images, audio, and video |
Quick Start
- Create or select a dataset on the platform to use as the data source and result storage target
- Navigate to Data Tools in the top navigation
- Choose the appropriate feature based on your needs:
- Import external data → Data Collection
- Convert file formats → Format Conversion
- Clean/augment data → Data Processing
- Label data → Data Labeling
System Dashboard
The system dashboard provides an overview of the running status of data collection, format conversion, and data processing task modules.
Data Collection
The data collection feature supports importing data from MySQL, MongoDB, files, Hive, and other sources into platform datasets.
Format Conversion
Format conversion supports converting Excel, Word, PPT files to JSON, CSV, Parquet, and Markdown formats.
Data Processing
The data processing module supports configuring Mapper, Filter, Deduplicator operator Pipelines via a visual workflow editor for large model training data cleaning, augmentation, and deduplication.
Algorithm Templates
Algorithm templates allow users to save common data processing Pipelines as reusable templates, with support for both built-in and custom templates.
Tools
The tool pool aggregates specialized data processing tools for analysis, format conversion, data generation, and more.
Data Labeling
The data labeling feature integrates Label Studio to support annotation tasks for text, images, audio, video, and other multi-modal data.