Tools
Overview
The DataFlow tool pool is a multi-modal data processing system that helps make data higher quality, more valuable, and better suited for large model processing. Compared to operators in algorithm templates, tools in the pool are focused on specific data processing scenarios and support more complex preprocessing and postprocessing operations.
Tool Categories
Tools are divided into Internal Tools and External Tools.
Built-in Tool List
| Tool Name | Type | Description |
|---|---|---|
| General Analysis Tool | Analysis | Analyzes a dataset, calculates statistics for all filter operations, applies various analyses, and generates statistical tables and distribution charts |
| Dataset Language Split Tool | Preprocessing | Loads a dataset, applies language identification filtering, and splits/saves the dataset by language |
| Dataset from Code Repository Tool | Preprocessing | Prepares a dataset from a code repository in the format: repo name, file path, file content |
| Raw arXiv to JSONL Tool | Preprocessing | Converts raw arXiv data (gzipped tar files) to JSONL format |
| Raw Stack Exchange to JSONL Tool | Preprocessing | Converts Stack Exchange archive data to multiple JSONL files |
| CSV NaN Value Reformatter | Preprocessing | Handles CSV/TSV files containing NaN values |
| JSONL NaN Value Reformatter | Preprocessing | Reformats JSONL files containing NaN values for consistent loading |
| Metadata Serializer | Preprocessing | Serializes all non-specified fields in JSONL files to ensure consistent data loading |
| MD to JSONL Tool | Preprocessing | Converts Markdown files to JSONL format with chunking support |
| Token Counter | Postprocessing | Counts tokens in a dataset using a specified tokenizer (JSONL format only) |
| Data Mixing Tool | Postprocessing | Mixes multiple datasets into one; supports JSONL, JSON, Parquet formats |
| Metadata Deserializer | Postprocessing | Deserializes specified fields in JSONL files |
| General Quality Classifier | Analysis | Predicts document scores in a dataset, providing score and should_keep columns for each row |
| URL Data Scraping Tool | Preprocessing | LLM-based data scraping for websites and local documents (XML, HTML, JSON, etc.) |
| PDF Data Extraction Tool | Preprocessing | High-quality PDF to Markdown and JSON conversion tool |
| Text Value Evaluation Tool | Analysis | Scores and filters data based on user-defined criteria, with bloom filter deduplication |
| High-Quality Dialogue Generation Tool | Generation | Generates multi-turn dialogues using fixed prompts and retains the highest quality conversations |
| Enhanced Text Description Tool | Generation | Uses a large model to generate detailed descriptions from source data |
Using Tools
- In Data Tools → Tools, browse or search for the desired tool.
- Click the Use Tool button on the tool card.
- The system redirects to the new task page with the tool pre-selected.
- Configure task parameters (data source, target dataset, tool parameters, etc.).
- Submit for execution.
Search and Filtering
Tools can be found using:
- Search by tool name: Enter keywords to search
- Filter by category: Filter by tool type (Analysis / Preprocessing / Postprocessing / Generation)
- Internal/External tabs: Switch between platform built-in tools and externally integrated tools