Last updated:

GPU Card Management

Overview

The GPU Card Management page provides a unified view of all physical GPU cards in the cluster, supporting filtering and viewing GPU resource allocation and usage by node, vendor, model, and status.

Viewing the GPU Card List

In the left navigation pane, select Node Management → GPU Cards to access the cluster GPU card list.

Overview Panel

The top of the page displays the cluster GPU resource status:

Total GPU Count: Total number of physical GPU cards in the cluster
Health Status Distribution: Pie chart showing the ratio of healthy vs. unhealthy GPU cards
Vendor/Model Distribution: GPU card counts by vendor and model

GPU Card Table

The table displays detailed information for each GPU card:

Column	Description
Node	Name of the compute node where the GPU is installed; click to navigate to node details
Card Index	GPU index number within the node
Card UUID	Globally unique GPU identifier; click to navigate to GPU card details
Status	Healthy or Sub-healthy
Model	GPU model, e.g., A100, V100, 910B
NUMA	NUMA node number the GPU belongs to
Vendor	GPU vendor: NVIDIA, Huawei Ascend, Cambricon, etc.
Virtualization Mode	Current GPU virtualization running mode
vGPU	vGPU used / total
Compute	GPU compute used / total
Memory	VRAM used / total (GiB)

Filtering and Search

The following filters are supported:

Filter by Node: Select a specific node from the dropdown
Search by UUID: Enter a GPU UUID for exact matching
Label Selector: Filter using Kubernetes labels
Filter by Vendor: Select NVIDIA, Huawei Ascend, Cambricon, etc.
Filter by Model: Enter GPU model keywords
Filter by Status: Healthy / Sub-healthy

Viewing GPU Card Details

Click a GPU card's UUID in the list to navigate to its detail page.

Basic Attributes

Attribute	Description
Status	Current health status
Node	Physical node where the GPU is installed
Card UUID	Globally unique identifier
Card Index	Index within the node
Model	GPU model
Vendor	GPU vendor
NUMA	NUMA node number
Virtualization Mode	Virtualization mode inherited from the node
Total Memory	Physical VRAM size (GiB)
Driver Version	GPU driver version (if available)
Max Device Split Count	Maximum concurrent tasks after vGPU slicing (if configured)

Resource Summary

Displays current GPU card resource usage:

Compute: Used / Total
Memory: Used / Total (GiB)
vGPU: Allocated / Total

Detail Tabs

Tab	Content
Running Status	Real-time monitoring charts for GPU utilization, memory usage, allocation rate, power, and temperature, plus GPU topology
Pods	List of all Pods currently scheduled to this GPU
Monitoring	Historical monitoring metric charts with custom time range support
Events	Kubernetes events related to this GPU

GPU Cards in Node Details

In Node Management → Cluster Nodes, click a node to enter its detail page. For nodes equipped with GPUs, the running status page displays:

Total GPUs: Number of GPU cards on the node
Total VRAM: Combined VRAM across all GPUs on the node
Total vGPUs: Total vGPU count on the node
GPU Card List: Status, model, vGPU/compute/memory usage for each GPU card

Pod Scheduling Information

For Pods using GPU resources, the Pod detail page shows Scheduled to GPU information, including:

Allocated GPU UUID with a link to the physical GPU card
Allocated vendor type
Allocated VRAM size
Allocated compute percentage

Previous : Node Management Next : GPU Virtualization Mode

What’s on this Page