< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Features

    Overview

    Kube AI Hub is a heterogeneous compute management platform designed for GPU/AI workloads. It covers the full management lifecycle from hardware onboarding to workload delivery, with Kubernetes as its core engine. Through a unified web console, the platform provides visual operations across three major product modules: the Compute Platform, the Industry AI Model Platform, and System Management.

    Compute Platform

    Resource Overview

    Core metrics display for all cluster compute resources, including compute/VRAM allocation and utilization, node count, GPU card count, total CPU, total memory, total VRAM, container-level resource usage, and monitoring alerts.

    Node Management

    Cluster Node Management

    Monitor and manage compute node status. Supports enabling or disabling scheduling on nodes, viewing physical GPU cards and GPU card topology, configuring GPU virtualization modes, and monitoring all tasks running on each node.

    Sharing/Slicing Policies

    Configure GPU sharing and slicing policies at the node level, setting vGPU count, compute power, and VRAM sharing/overcommit ratios for all GPU cards on a node.

    GPU Card Management

    Monitor physical GPU card status, including allocation and usage information, and view all tasks running on each physical GPU card.

    Service Components

    Displays the health status and runtime metrics of Kubernetes core service components (API Server, Scheduler, Controller Manager, etcd, etc.), helping ops teams quickly identify component-level failures.

    Projects (Namespace Management)

    Project Management

    Organize cluster resources by project, view project resources and pods, and assign projects to tenants.

    Resource Quotas

    A mechanism to limit GPU and other resource usage, ensuring fair resource allocation and preventing overuse. Supports setting quotas for GPU, CPU, memory, pods, and jobs.

    App Workloads

    Workloads

    Manages cluster-level workloads, including:

    • Deployments: Stateless application management with rolling upgrades, rollback, and autoscaling (HPA)
    • StatefulSets: Stateful service management with stable Pod identity and persistent storage guarantees
    • DaemonSets: Ensures a specified Pod runs on every node, suitable for node monitoring and log collection

    Jobs and CronJobs

    • Jobs: Manage one-time batch tasks with configurable parallelism and completion policies
    • CronJobs: Periodically triggered batch tasks based on Cron expressions, suitable for data cleanup and scheduled reporting

    Pods

    View pods across all namespaces at cluster level, with filtering by node, status, and labels. Supports direct access to container terminals for command execution.

    Services

    Manages ClusterIP, NodePort, and LoadBalancer type Kubernetes services, displaying port mappings and backend Endpoint status.

    Routes (Ingress)

    Manages cluster-level Ingress rules, supporting HTTP/HTTPS routing based on domain names and paths.

    Configuration Management

    • Secrets: Securely store sensitive information (passwords, certificates, tokens) with RBAC-controlled access
    • ConfigMaps: Manage application configuration in key-value or file format
    • Service Accounts: Manage API access credentials for Pods

    Network Management

    • Network Policies (NetworkPolicy): Define Pod-to-Pod communication rules to enforce network-layer isolation
    • Pod IP Pools (IPPool): Manage Pod IP address allocation ranges with Calico integration support

    Storage Management

    • Persistent Volume Claims (PVC): Declare persistent storage needs with support for dynamic provisioning via StorageClass
    • Storage Classes: Define storage backend policies supporting Ceph, NFS, LocalPV, JuiceFS, and other storage solutions
    • Volume Snapshots: Create point-in-time snapshots of persistent volumes for quick recovery
    • Volume Snapshot Classes: Define snapshot drivers and deletion policies

    Custom Resources (CRDs)

    Provides a visual management interface for all Custom Resource Definitions (CRDs) in the cluster, available to platform administrators. Supports viewing custom resource instances and their details.

    Monitoring and Alerting

    Cluster Status Monitoring

    Provides cluster-level monitoring at second-level precision, including CPU utilization, memory utilization, load average, disk usage, I/O throughput, network traffic, and API Server request rates.

    GPU Resource Monitoring

    A monitoring view designed specifically for GPU compute scenarios. Displays real-time metrics per node and per GPU card: GPU health status, GPU utilization and VRAM usage, GPU allocation, and resource usage rankings.

    Enable spec.monitoring.gpu.enabled in ClusterConfiguration and let ks-installer reconcile before using this view. Deploy GPU metric exporters such as NVIDIA DCGM separately in the cluster. See the Monitoring FAQ.

    Alert Messages

    Displays alerts triggered in the cluster in real time, including node anomalies, Pod crashes, and storage alerts. Supports filtering by severity level (Critical, Warning, Info) with full alert history.

    Alert Rules

    Define custom alerting rules for nodes, workloads, and pods. Set metric thresholds and trigger conditions, configure alert repeat intervals and notification recipients. Supported notification channels include email, WeCom, DingTalk, and Slack.

    Log Management

    Log Receivers

    Configure unified log collection policies at the cluster level, supporting log storage backends including Elasticsearch, Kafka, and Fluentd. Multi-tenant log isolation is enforced — tenants can only view logs belonging to their own workloads.

    Log Query

    Provides log collection, query, and management capabilities from the user's perspective. Supports viewing and searching logs at the project, application, and container levels.

    Events and Audit

    Event Query

    Supports Kubernetes event queries. Events can be searched by message, tenant, project, resource type, resource name, reason, and category.

    Audit Log Query

    Records all activities by users, administrators, and other components that impact the system, generating events that can be queried by cluster, project, tenant, resource type, resource name, action, operator, and time range.

    Cluster Settings

    • Basic Information: View cluster name, Kubernetes version, node count, and other basic details
    • Visibility: Control which workspaces can access the cluster
    • Member Management: Manage cluster-level user and role assignments
    • Role Management: Define custom cluster-level access roles and permission sets
    • Log Receivers: Configure log collection backends in cluster settings
    • Gateway Settings: Manage cluster or project-level ingress gateways

    Tenant Management (Workspaces)

    Tenant (Workspace) List

    View the tenant list, edit tenant information, and add or remove tenants.

    Tenant Projects

    Manage projects under a tenant, view project resource usage, and create or delete projects.

    Application Management

    Provides Helm-based application lifecycle management. Tenant administrators can upload or create new application templates, perform rapid testing, and publish tested applications to the App Store for other tenants to deploy with one click.

    Tenant Settings

    Configure tenant resource quotas, manage tenant members, roles, and departments. Platform users can be added to tenants via invitation.

    Tenant User Management

    User Management

    Add and manage users within the organization, configure permissions, roles, and member operation privileges.

    Platform Roles

    Control user access to platform resources, including cluster management, workspace management, and platform user management.

    Resource Specifications

    Create and manage instance specifications including CPU, memory, disk, and GPU configurations. Users can select pre-defined specifications when creating container instances.

    Third-party Login

    Provides a built-in OAuth service, supporting external user access through multiple methods including LDAP, OIDC, CAS, and OAuth 2.0.

    Industry AI Model Platform

    Model Hub

    Model Creation

    Create model repositories with a card-style UI.

    Model Card Management

    Provides a card-style UI for model list management, enabling users to intuitively view model information with multi-criteria quick search support.

    Model Filtering

    Supports tag-based model card filtering (innovative version updates are in progress; the standard version is currently available).

    Model File Download

    Provides model file download capability for secondary customization and deployment.

    Model File Upload

    Supports 4 upload methods: clone repo source files, sync files from existing repos, and sync models from Hugging Face or ModelScope to internal repos.

    Model Integration

    Supports importing various large and small model types, including Qwen, DeepSeek, ChatGLM, Kimi, Llama, and other well-known domestic and international models, with ongoing expansion to more model platforms.

    Model Inference Management

    Public Model Inference Service

    Supports deploying and managing public model inference service platforms.

    Dedicated Model Instance Deployment

    Supports deploying user-dedicated model inference instances with compute resource management, service online/offline control, and deletion.

    Service Details

    Provides a dialog interface for conversation and model interaction, displaying API/SDK method documentation for calling model inference services.

    Service Invocation

    Supports one-click model service deployment, with API/SDK calling support for published model inference services.

    Model Training and Evaluation

    Model Fine-tuning

    The platform provides GPU-accelerated fine-tuning instance hosting services, supporting mainstream fine-tuning frameworks (LLaMA-Factory, MS-Swift). Users only need to select compute resources and datasets to quickly fine-tune large models without writing complex training code.

    Model Evaluation

    Supports visual interface model evaluation with three mainstream evaluation frameworks: lm-evaluation-harness, OpenCompass, and EvalScope. Supports custom evaluation datasets — users can upload their own datasets for model performance evaluation to meet specific business scenario assessment needs.

    Data Management

    Provides dataset collection creation, creation, deletion, and version management capabilities. Supports uploading or downloading datasets via Git or SDK.

    Development Tools

    Development Environment

    A unified interface for creating and managing Notebook instances. Users can directly use platform compute resources for data analysis, model training, and experimentation without manual environment setup.

    User Spaces

    Provides rapid service deployment through code repositories with compute resource configuration, offering three interactive development environments: JupyterLab, VS Code, and Eclipse Theia.

    System Management

    User Information Management

    Personal Information

    Supports user profile configuration, including viewing personal models, code, and user spaces.

    Account Settings

    Configure user Access Tokens and SSH Keys.

    Admin Console

    Resource Management Console

    Manage personal platform resources including Notebook instances, fine-tuning instances, dedicated inference instances, model evaluation tasks, and Notebook image configurations.

    Runtime Framework and Image Management

    Configure inference engines, model fine-tuning training, model evaluation, and Notebook image configurations.

    Compute Specification Management

    Supports system administrators in managing compute specifications through a card-style interface, including Notebook image configurations.