< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Data Tools

    What are Data Tools

    Data Tools (DataFlow) is a one-stop data processing platform that enables full lifecycle management from data to model, driving continuous optimization. It supports multiple data formats and sources including local files, cloud data, and databases, and provides efficient transformation and processing tools to ensure data consistency. Customizable Pipelines enable complex data cleaning and transformation with parallel processing for efficiency. Additionally, an integrated labeling system supports collaborative annotation with permission and review mechanisms to ensure labeling accuracy and data quality.

    Data Tools are deeply integrated with the platform's dataset management — data processing requires an existing platform dataset as the target, and all processing results are saved directly to dataset repositories.

    Feature Modules

    Module Description
    System Dashboard View overall running status of data collection, format conversion, and data processing tasks
    Data Collection Import data from MySQL, MongoDB, files, Hive, and other sources
    Format Conversion Convert Excel→JSON/CSV/Parquet, Word/PPT→Markdown, and more
    Data Processing Configure operator Pipelines via a visual workflow editor for data cleaning and augmentation
    Algorithm Templates Built-in and custom data processing algorithm templates for reuse and sharing
    Tools A collection of specialized data processing tools for analysis, conversion, and generation
    Data Labeling Integrated Label Studio for multi-modal annotation of text, images, audio, and video

    Quick Start

    1. Create or select a dataset on the platform to use as the data source and result storage target
    2. Navigate to Data Tools in the top navigation
    3. Choose the appropriate feature based on your needs:

    System Dashboard

    The system dashboard provides an overview of the running status of data collection, format conversion, and data processing task modules.

    Data Collection

    The data collection feature supports importing data from MySQL, MongoDB, files, Hive, and other sources into platform datasets.

    Format Conversion

    Format conversion supports converting Excel, Word, PPT files to JSON, CSV, Parquet, and Markdown formats.

    Data Processing

    The data processing module supports configuring Mapper, Filter, Deduplicator operator Pipelines via a visual workflow editor for large model training data cleaning, augmentation, and deduplication.

    Algorithm Templates

    Algorithm templates allow users to save common data processing Pipelines as reusable templates, with support for both built-in and custom templates.

    Tools

    The tool pool aggregates specialized data processing tools for analysis, format conversion, data generation, and more.

    Data Labeling

    The data labeling feature integrates Label Studio to support annotation tasks for text, images, audio, video, and other multi-modal data.