Last updated:

Features

Overview

Kube AI Hub is a heterogeneous compute management platform designed for GPU/AI workloads. It covers the full management lifecycle from hardware onboarding to workload delivery, with Kubernetes as its core engine. Through a unified web console, the platform provides visual operations across three major product modules: the Compute Platform, the Industry AI Model Platform, and System Management.

Compute Platform

Resource Overview

Core metrics display for all cluster compute resources, including compute/VRAM allocation and utilization, node count, GPU card count, total CPU, total memory, total VRAM, container-level resource usage, and monitoring alerts.

Node Management

Cluster Node Management

Monitor and manage compute node status. Supports enabling or disabling scheduling on nodes, viewing physical GPU cards and GPU card topology, configuring GPU virtualization modes, and monitoring all tasks running on each node.

Sharing/Slicing Policies

Configure GPU sharing and slicing policies at the node level, setting vGPU count, compute power, and VRAM sharing/overcommit ratios for all GPU cards on a node.

GPU Card Management

Monitor physical GPU card status, including allocation and usage information, and view all tasks running on each physical GPU card.

Service Components

Displays the health status and runtime metrics of Kubernetes core service components (API Server, Scheduler, Controller Manager, etcd, etc.), helping ops teams quickly identify component-level failures.

Projects (Namespace Management)

Project Management

Organize cluster resources by project, view project resources and pods, and assign projects to tenants.

Resource Quotas

A mechanism to limit GPU and other resource usage, ensuring fair resource allocation and preventing overuse. Supports setting quotas for GPU, CPU, memory, pods, and jobs.

App Workloads

Workloads

Manages cluster-level workloads, including:

Deployments: Stateless application management with rolling upgrades, rollback, and autoscaling (HPA)
StatefulSets: Stateful service management with stable Pod identity and persistent storage guarantees
DaemonSets: Ensures a specified Pod runs on every node, suitable for node monitoring and log collection

Jobs and CronJobs

Jobs: Manage one-time batch tasks with configurable parallelism and completion policies
CronJobs: Periodically triggered batch tasks based on Cron expressions, suitable for data cleanup and scheduled reporting

Pods

View pods across all namespaces at cluster level, with filtering by node, status, and labels. Supports direct access to container terminals for command execution.

Services

Manages ClusterIP, NodePort, and LoadBalancer type Kubernetes services, displaying port mappings and backend Endpoint status.

Routes (Ingress)

Manages cluster-level Ingress rules, supporting HTTP/HTTPS routing based on domain names and paths.

Configuration Management

Secrets: Securely store sensitive information (passwords, certificates, tokens) with RBAC-controlled access
ConfigMaps: Manage application configuration in key-value or file format
Service Accounts: Manage API access credentials for Pods

Network Management

Network Policies (NetworkPolicy): Define Pod-to-Pod communication rules to enforce network-layer isolation
Pod IP Pools (IPPool): Manage Pod IP address allocation ranges with Calico integration support

Storage Management

Persistent Volume Claims (PVC): Declare persistent storage needs with support for dynamic provisioning via StorageClass
Storage Classes: Define storage backend policies supporting Ceph, NFS, LocalPV, JuiceFS, and other storage solutions
Volume Snapshots: Create point-in-time snapshots of persistent volumes for quick recovery
Volume Snapshot Classes: Define snapshot drivers and deletion policies

Custom Resources (CRDs)

Provides a visual management interface for all Custom Resource Definitions (CRDs) in the cluster, available to platform administrators. Supports viewing custom resource instances and their details.

Monitoring and Alerting

Cluster Status Monitoring

Provides cluster-level monitoring at second-level precision, including CPU utilization, memory utilization, load average, disk usage, I/O throughput, network traffic, and API Server request rates.

GPU Resource Monitoring

A monitoring view designed specifically for GPU compute scenarios. Displays real-time metrics per node and per GPU card: GPU health status, GPU utilization and VRAM usage, GPU allocation, and resource usage rankings.

Enable spec.monitoring.gpu.enabled in ClusterConfiguration and let ks-installer reconcile before using this view. Deploy GPU metric exporters such as NVIDIA DCGM separately in the cluster. See the Monitoring FAQ.

Alert Messages

Displays alerts triggered in the cluster in real time, including node anomalies, Pod crashes, and storage alerts. Supports filtering by severity level (Critical, Warning, Info) with full alert history.

Alert Rules

Define custom alerting rules for nodes, workloads, and pods. Set metric thresholds and trigger conditions, configure alert repeat intervals and notification recipients. Supported notification channels include email, WeCom, DingTalk, and Slack.

Log Management

Log Receivers

Configure unified log collection policies at the cluster level, supporting log storage backends including Elasticsearch, Kafka, and Fluentd. Multi-tenant log isolation is enforced — tenants can only view logs belonging to their own workloads.

Log Query

Provides log collection, query, and management capabilities from the user's perspective. Supports viewing and searching logs at the project, application, and container levels.

Events and Audit

Event Query

Supports Kubernetes event queries. Events can be searched by message, tenant, project, resource type, resource name, reason, and category.

Audit Log Query

Records all activities by users, administrators, and other components that impact the system, generating events that can be queried by cluster, project, tenant, resource type, resource name, action, operator, and time range.

Cluster Settings

Basic Information: View cluster name, Kubernetes version, node count, and other basic details
Visibility: Control which workspaces can access the cluster
Member Management: Manage cluster-level user and role assignments
Role Management: Define custom cluster-level access roles and permission sets
Log Receivers: Configure log collection backends in cluster settings
Gateway Settings: Manage cluster or project-level ingress gateways

Tenant Management (Workspaces)

Tenant (Workspace) List

View the tenant list, edit tenant information, and add or remove tenants.

Tenant Projects

Manage projects under a tenant, view project resource usage, and create or delete projects.

Application Management

Provides Helm-based application lifecycle management. Tenant administrators can upload or create new application templates, perform rapid testing, and publish tested applications to the App Store for other tenants to deploy with one click.

Tenant Settings

Configure tenant resource quotas, manage tenant members, roles, and departments. Platform users can be added to tenants via invitation.

Tenant User Management

User Management

Add and manage users within the organization, configure permissions, roles, and member operation privileges.

Platform Roles

Control user access to platform resources, including cluster management, workspace management, and platform user management.

Resource Specifications

Create and manage instance specifications including CPU, memory, disk, and GPU configurations. Users can select pre-defined specifications when creating container instances.

Provides a built-in OAuth service, supporting external user access through multiple methods including LDAP, OIDC, CAS, and OAuth 2.0.

Industry AI Model Platform

Model Hub

Model Creation

Create model repositories with a card-style UI.

Model Card Management

Provides a card-style UI for model list management, enabling users to intuitively view model information with multi-criteria quick search support.

Model Filtering

Supports tag-based model card filtering (innovative version updates are in progress; the standard version is currently available).

Model File Download

Provides model file download capability for secondary customization and deployment.

Model File Upload

Supports 4 upload methods: clone repo source files, sync files from existing repos, and sync models from Hugging Face or ModelScope to internal repos.

Model Integration

Supports importing various large and small model types, including Qwen, DeepSeek, ChatGLM, Kimi, Llama, and other well-known domestic and international models, with ongoing expansion to more model platforms.

Model Inference Management

Public Model Inference Service

Supports deploying and managing public model inference service platforms.

Dedicated Model Instance Deployment

Supports deploying user-dedicated model inference instances with compute resource management, service online/offline control, and deletion.

Service Details

Provides a dialog interface for conversation and model interaction, displaying API/SDK method documentation for calling model inference services.

Service Invocation

Supports one-click model service deployment, with API/SDK calling support for published model inference services.

Model Training and Evaluation

Model Fine-tuning

The platform provides GPU-accelerated fine-tuning instance hosting services, supporting mainstream fine-tuning frameworks (LLaMA-Factory, MS-Swift). Users only need to select compute resources and datasets to quickly fine-tune large models without writing complex training code.

Model Evaluation

Supports visual interface model evaluation with three mainstream evaluation frameworks: lm-evaluation-harness, OpenCompass, and EvalScope. Supports custom evaluation datasets — users can upload their own datasets for model performance evaluation to meet specific business scenario assessment needs.

Data Management

Provides dataset collection creation, creation, deletion, and version management capabilities. Supports uploading or downloading datasets via Git or SDK.

Development Tools

Development Environment

A unified interface for creating and managing Notebook instances. Users can directly use platform compute resources for data analysis, model training, and experimentation without manual environment setup.

User Spaces

Provides rapid service deployment through code repositories with compute resource configuration, offering three interactive development environments: JupyterLab, VS Code, and Eclipse Theia.

System Management

User Information Management

Personal Information

Supports user profile configuration, including viewing personal models, code, and user spaces.

Account Settings

Configure user Access Tokens and SSH Keys.

Admin Console

Resource Management Console

Manage personal platform resources including Notebook instances, fine-tuning instances, dedicated inference instances, model evaluation tasks, and Notebook image configurations.

Runtime Framework and Image Management

Configure inference engines, model fine-tuning training, model evaluation, and Notebook image configurations.

Compute Specification Management

Supports system administrators in managing compute specifications through a card-style interface, including Notebook image configurations.

Previous : Kube AI Hub Ecosystem Tools Next : Architecture

What’s on this Page