AI Compute Management Platform: Simplify Complexity, 3–10x Better Utilization
Through GPU/CPU heterogeneous compute pooling and virtualization, resource utilization can be improved by 3–10x. Unified scheduling and management of AI clusters enables platform-level management of hardware through resource pooling and GPU virtualization.
Fully compatible with domestic GPU/CPU/NPU hardware, building a secure and controllable local compute infrastructure.
Four Core Values to Fully Unleash AI Compute Potential
-
Compute Pooling for Maximum Utilization
GPU/CPU heterogeneous compute pooling and virtualization improves utilization by 3–10x.
- Heterogeneous GPU virtualization supporting NVIDIA, Cambricon, Huawei Ascend, Iluvatar, and more
- Unified compute resource pooling to manage GPU clusters, CPU clusters, and file storage
- Native Kubernetes integration for seamless compute resource and service management
- Harbor image registry integration with out-of-the-box Helm Charts application management
-
On-Demand Allocation and Flexible Scheduling
Supports compute over-provisioning and fine-grained slicing, with flexible allocation by tenant/department/project and elastic scaling on demand.
- Multi-tenant isolation with standard authentication, custom authentication, and multi-tenant permission management
- Fine-grained compute slicing with GPU scheduling at sub-card granularity
- Elastic scaling to dynamically adjust compute resources based on workload demands
- Comprehensive cluster management covering nodes, groups, containers, storage, monitoring, and logging
-
Metering, Billing, and Fine-grained Cost Management
Supports compute usage monitoring and cost accounting to help enterprises manage IT costs with precision.
- Real-time compute usage monitoring for accurate tracking of GPU/CPU resource consumption
- Multi-dimensional billing with flexible statistics by tenant, project, and time period
- Cost visualization for intuitive display of compute usage and cost allocation across departments
-
Intelligent Scheduling and Operations
Thousand-card level distributed scheduling capability with built-in job queues, service deployment, monitoring alerts, and log management.
- Thousand-card distributed scheduling supporting large-scale parallel AI training jobs
- Built-in job queues with priority scheduling and resource reservation policies
- Monitoring and alerting for real-time cluster health awareness and automatic fault notification
- Log management for unified collection and retrieval of container logs to quickly locate issues