Last updated:

Data Collection

Overview

The data collection module supports importing data from multiple external sources and syncing results to platform dataset repositories, providing a data foundation for downstream processing and model training.

Supported Data Source Types

Type	Description
Relational Database (MySQL)	Bulk import of database tables with custom table and field selection
Non-relational Database (MongoDB)	Import non-relational data with collection, field selection, and schema transformation
File Data Import	Import files in CSV, Excel, JSON, and other formats
Hive System Import	Efficiently read data stored in Hive systems

Adding a Data Source

In Data Tools → Data Collection → Data Source Management, click Add Data Source and fill in the connection details:

Parameter	Description
Data Source Name	Custom name for easy identification
Data Source Type	MySQL / MongoDB / File Data / Hive
Server Address	Database server IP or hostname
Port	Database connection port
Database Name	Target database name
Username / Password	Database authentication credentials
Auth Type	None / LDAP / Kerberos

After filling in the details, click Test Connection to verify connectivity.

Creating a Data Collection Task

After a successful connection, the system automatically queries all tables in the database.
Select the tables and fields to import.
Configure the data destination (target dataset) and branch (created automatically if it doesn't exist).
Choose an execution mode:
- Execute Immediately: Run as soon as submitted
- Scheduled Task: Set a specific time for automatic execution
- Save Configuration Only: Trigger manually later
Click Save and Execute or Save Configuration.

Viewing Task Status

In the Data Collection Tasks list, each task shows:

Field	Description
Task Name	Custom task identifier
Data Source Type	Source database type
Connection Status	Normal / Pending Test / Anomaly
Last Updated	Most recent sync time
Records Imported / Total	Data import progress

Click Details to view task configuration, or Logs to review execution output and errors.

Viewing Import Results

After successful collection, imported data files are available in the platform's Personal Datasets.

Note

Ensure you have created a target dataset repository on the platform before running data collection. If the specified destination branch doesn't exist, it will be created automatically.