< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    GPU Card Management

    Overview

    The GPU Card Management page provides a unified view of all physical GPU cards in the cluster, supporting filtering and viewing GPU resource allocation and usage by node, vendor, model, and status.

    Viewing the GPU Card List

    In the left navigation pane, select Node Management → GPU Cards to access the cluster GPU card list.

    Overview Panel

    The top of the page displays the cluster GPU resource status:

    • Total GPU Count: Total number of physical GPU cards in the cluster
    • Health Status Distribution: Pie chart showing the ratio of healthy vs. unhealthy GPU cards
    • Vendor/Model Distribution: GPU card counts by vendor and model

    GPU Card Table

    The table displays detailed information for each GPU card:

    Column Description
    Node Name of the compute node where the GPU is installed; click to navigate to node details
    Card Index GPU index number within the node
    Card UUID Globally unique GPU identifier; click to navigate to GPU card details
    Status Healthy or Sub-healthy
    Model GPU model, e.g., A100, V100, 910B
    NUMA NUMA node number the GPU belongs to
    Vendor GPU vendor: NVIDIA, Huawei Ascend, Cambricon, etc.
    Virtualization Mode Current GPU virtualization running mode
    vGPU vGPU used / total
    Compute GPU compute used / total
    Memory VRAM used / total (GiB)

    The following filters are supported:

    • Filter by Node: Select a specific node from the dropdown
    • Search by UUID: Enter a GPU UUID for exact matching
    • Label Selector: Filter using Kubernetes labels
    • Filter by Vendor: Select NVIDIA, Huawei Ascend, Cambricon, etc.
    • Filter by Model: Enter GPU model keywords
    • Filter by Status: Healthy / Sub-healthy

    Viewing GPU Card Details

    Click a GPU card's UUID in the list to navigate to its detail page.

    Basic Attributes

    Attribute Description
    Status Current health status
    Node Physical node where the GPU is installed
    Card UUID Globally unique identifier
    Card Index Index within the node
    Model GPU model
    Vendor GPU vendor
    NUMA NUMA node number
    Virtualization Mode Virtualization mode inherited from the node
    Total Memory Physical VRAM size (GiB)
    Driver Version GPU driver version (if available)
    Max Device Split Count Maximum concurrent tasks after vGPU slicing (if configured)

    Resource Summary

    Displays current GPU card resource usage:

    • Compute: Used / Total
    • Memory: Used / Total (GiB)
    • vGPU: Allocated / Total

    Detail Tabs

    Tab Content
    Running Status Real-time monitoring charts for GPU utilization, memory usage, allocation rate, power, and temperature, plus GPU topology
    Pods List of all Pods currently scheduled to this GPU
    Monitoring Historical monitoring metric charts with custom time range support
    Events Kubernetes events related to this GPU

    GPU Cards in Node Details

    In Node Management → Cluster Nodes, click a node to enter its detail page. For nodes equipped with GPUs, the running status page displays:

    • Total GPUs: Number of GPU cards on the node
    • Total VRAM: Combined VRAM across all GPUs on the node
    • Total vGPUs: Total vGPU count on the node
    • GPU Card List: Status, model, vGPU/compute/memory usage for each GPU card

    Pod Scheduling Information

    For Pods using GPU resources, the Pod detail page shows Scheduled to GPU information, including:

    • Allocated GPU UUID with a link to the physical GPU card
    • Allocated vendor type
    • Allocated VRAM size
    • Allocated compute percentage