Last updated:

Data Processing

Overview

The data processing module allows users to combine data processing operators (Mapper, Filter, Deduplicator, Selector) into Pipelines using a visual workflow editor, enabling cleaning, augmentation, deduplication, and quality filtering of large model training data.

Task Types

Type	Description
Data Cleaning (data_refine)	Clean data through deduplication, desensitization, and other operators to meet usage requirements
Data Augmentation (data_enhancement)	Automatically generate more training data from seed data, supporting custom parameters and Prompts
Data Generation (data_generation)	Use models to generate specific types of training data

Creating a Data Processing Task

In Data Tools → Data Processing, click Create Task and configure as follows:

Step 1: Basic Configuration

Parameter	Description
Task Name	Custom task identifier
Task Type	Operator task or tool task
Data Source	Select source dataset and branch
Data Destination	Select the target dataset for processed results
Destination Branch	Target branch for results (auto-created if not present)

Step 2: Workflow Configuration

Use the visual workflow editor to configure the processing Pipeline:

Drag operator nodes from the left panel to the canvas
Click a node's connection point and drag to another node to establish processing order
Click a node to configure its parameters (some operators have required parameters)
Use zoom, reset view, and clear canvas controls to manage the workspace

Workflow operation guide:

Drag nodes from the left panel to the canvas
Click connection points and drag to other nodes to create links
Drag nodes to reposition them
Hover over a node to reveal the delete button
Press Delete to remove a selected node

Step 3: Data Export Configuration

Configure how processed data should be saved:

Push to original dataset: Push as a new commit to the original dataset repository
Push to new dataset: Save processed results to a specified new dataset

Operator Types

The platform supports four operator types:

Type	Description
Mapper	Transform each data sample (text normalization, format conversion, etc.)
Filter	Filter data by conditions, retaining samples that meet criteria
Deduplicator	Remove duplicate data
Selector	Select a specific subset of samples from the dataset

Common operators:

Operator	Type	Function
Text Normalization	Mapper	Unicode text normalization and traditional-to-simplified Chinese conversion
Special Content Removal	Mapper	Remove URLs, invisible characters, HTML tags, etc.
Chinese Character Conversion	Mapper	Convert between traditional/simplified Chinese and Japanese kanji
Text Replacement	Mapper	Replace content matching regex patterns
Sentence Split	Mapper	Split text into sentences
Text Length Filter	Filter	Keep samples within specified length range
Special Character Ratio Filter	Filter	Filter samples with excessive special characters
N-Gram Repetition Filter	Filter	Filter samples with excessive repetition rates
Language Confidence Filter	Filter	Retain samples in a specific language
MD5 Deduplication	Deduplicator	Exact-match deduplication
SimHash Deduplication	Deduplicator	Similarity-based deduplication
MinHashLSH Deduplication	Deduplicator	Efficient approximate deduplication
Random Selector	Selector	Randomly select data samples

Viewing Task Status and Results

After submitting, the task list shows:

Field	Description
Task Name	Task identifier
Task Type	Operator task / Tool task
Running Status	Queued / Processing / Completed / Failed
Data Source	Source dataset information
Data Destination	Target dataset information

Click Details to view before/after Session comparisons and processed data counts
Click Logs to review execution logs
After completion, click Processing Results to view the processed data

Tip

It is recommended to validate your Pipeline configuration on a small data sample before processing the full dataset. Pipelines can be saved as Algorithm Templates for future reuse.