Last updated:

Using Dedicated Inference Instances

Overview Page

After the instance is deployed and reaches the Running state, click the instance name to open the overview page, which displays the following information:

Field	Description
Inference API URL	The inference service URL provided by the instance, usable directly for API calls
Running Status	Current instance status (Running, Stopped, Error, etc.)
Inference Framework	The inference framework selected during creation (e.g., vLLM, SGLang, TGI)
Resource Config	Allocated GPU/CPU/memory resource specifications
Replica Count	Number of currently running inference service replicas

Playground Testing

The platform provides an interactive Playground (sandbox) for testing model inference without writing code:

On the instance details page, switch to the Playground tab.
Enter a prompt in the input box.
Adjust inference parameters (e.g., Temperature, Top-P, Max Tokens).
Click the Send button to view the model inference results.

Tip

The Playground is suitable for quick model validation and prompt debugging. For production use, call the API directly.

API Documentation

On the instance details page, switch to the API tab to view complete API documentation and multi-language code examples.

Python Example

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://<your-endpoint-url>/v1"
)

response = client.chat.completions.create(
    model="your-model-name",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, please introduce yourself."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

JavaScript Example

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://<your-endpoint-url>/v1",
});

const response = await client.chat.completions.create({
  model: "your-model-name",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Hello, please introduce yourself." },
  ],
  temperature: 0.7,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

cURL Example

curl -X POST "https://<your-endpoint-url>/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "your-model-name",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello, please introduce yourself."}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Note

Replace <your-endpoint-url> with the inference API URL shown on the overview page, and YOUR_API_KEY with the API key generated by the platform.

Real-Time Monitoring

On the instance details page, switch to the Analysis tab to view real-time performance metrics:

Metric	Description
CPU Utilization	CPU usage percentage for each replica
GPU Utilization	GPU compute core usage
Memory Usage	System memory consumption
VRAM Usage	GPU memory allocation and usage
Inference Latency	Average response time from request to result
Throughput	Number of inference requests processed per second

Monitoring data helps assess service health and informs resource scaling decisions.

View Runtime Logs

Under the Analysis tab, you can also view runtime logs for each replica, including model loading information, request processing records, and error messages, which help troubleshoot inference service issues.

View Billing Details

On the instance details page, switch to the Billing tab to view resource usage and cost breakdown:

Field	Description
Billing Start Time	When the instance started consuming compute resources
Billing End Time	When the instance was stopped or resources were released
Resource Spec	The compute configuration in use
Accumulated Cost	Total cost to date

Stop and Delete Instances

Warning

Inference instances incur charges while running. To avoid unnecessary costs, stop instances when not in use. If the instance is no longer needed, delete it to permanently release resources. Deletion is irreversible.

Stop: Pauses the inference service, preserves instance configuration, pauses billing, and allows restart at any time.
Delete: Permanently removes the instance and all associated resources. This action cannot be undone.