Using Dedicated Inference Instances
Overview Page
After the instance is deployed and reaches the Running state, click the instance name to open the overview page, which displays the following information:
| Field | Description |
|---|---|
| Inference API URL | The inference service URL provided by the instance, usable directly for API calls |
| Running Status | Current instance status (Running, Stopped, Error, etc.) |
| Inference Framework | The inference framework selected during creation (e.g., vLLM, SGLang, TGI) |
| Resource Config | Allocated GPU/CPU/memory resource specifications |
| Replica Count | Number of currently running inference service replicas |
Playground Testing
The platform provides an interactive Playground (sandbox) for testing model inference without writing code:
- On the instance details page, switch to the Playground tab.
- Enter a prompt in the input box.
- Adjust inference parameters (e.g., Temperature, Top-P, Max Tokens).
- Click the Send button to view the model inference results.
Tip
API Documentation
On the instance details page, switch to the API tab to view complete API documentation and multi-language code examples.
Python Example
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://<your-endpoint-url>/v1"
)
response = client.chat.completions.create(
model="your-model-name",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, please introduce yourself."}
],
temperature=0.7,
max_tokens=512
)
print(response.choices[0].message.content)
JavaScript Example
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://<your-endpoint-url>/v1",
});
const response = await client.chat.completions.create({
model: "your-model-name",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Hello, please introduce yourself." },
],
temperature: 0.7,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
cURL Example
curl -X POST "https://<your-endpoint-url>/v1/chat/completions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, please introduce yourself."}
],
"temperature": 0.7,
"max_tokens": 512
}'
Note
<your-endpoint-url> with the inference API URL shown on the overview page, and YOUR_API_KEY with the API key generated by the platform.Real-Time Monitoring
On the instance details page, switch to the Analysis tab to view real-time performance metrics:
| Metric | Description |
|---|---|
| CPU Utilization | CPU usage percentage for each replica |
| GPU Utilization | GPU compute core usage |
| Memory Usage | System memory consumption |
| VRAM Usage | GPU memory allocation and usage |
| Inference Latency | Average response time from request to result |
| Throughput | Number of inference requests processed per second |
Monitoring data helps assess service health and informs resource scaling decisions.
View Runtime Logs
Under the Analysis tab, you can also view runtime logs for each replica, including model loading information, request processing records, and error messages, which help troubleshoot inference service issues.
View Billing Details
On the instance details page, switch to the Billing tab to view resource usage and cost breakdown:
| Field | Description |
|---|---|
| Billing Start Time | When the instance started consuming compute resources |
| Billing End Time | When the instance was stopped or resources were released |
| Resource Spec | The compute configuration in use |
| Accumulated Cost | Total cost to date |
Stop and Delete Instances
Warning
- Stop: Pauses the inference service, preserves instance configuration, pauses billing, and allows restart at any time.
- Delete: Permanently removes the instance and all associated resources. This action cannot be undone.