< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Using Dedicated Inference Instances

    Overview Page

    After the instance is deployed and reaches the Running state, click the instance name to open the overview page, which displays the following information:

    Field Description
    Inference API URL The inference service URL provided by the instance, usable directly for API calls
    Running Status Current instance status (Running, Stopped, Error, etc.)
    Inference Framework The inference framework selected during creation (e.g., vLLM, SGLang, TGI)
    Resource Config Allocated GPU/CPU/memory resource specifications
    Replica Count Number of currently running inference service replicas

    Playground Testing

    The platform provides an interactive Playground (sandbox) for testing model inference without writing code:

    1. On the instance details page, switch to the Playground tab.
    2. Enter a prompt in the input box.
    3. Adjust inference parameters (e.g., Temperature, Top-P, Max Tokens).
    4. Click the Send button to view the model inference results.

    Tip

    The Playground is suitable for quick model validation and prompt debugging. For production use, call the API directly.

    API Documentation

    On the instance details page, switch to the API tab to view complete API documentation and multi-language code examples.

    Python Example

    from openai import OpenAI
    
    client = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="https://<your-endpoint-url>/v1"
    )
    
    response = client.chat.completions.create(
        model="your-model-name",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        temperature=0.7,
        max_tokens=512
    )
    
    print(response.choices[0].message.content)
    

    JavaScript Example

    import OpenAI from "openai";
    
    const client = new OpenAI({
      apiKey: "YOUR_API_KEY",
      baseURL: "https://<your-endpoint-url>/v1",
    });
    
    const response = await client.chat.completions.create({
      model: "your-model-name",
      messages: [
        { role: "system", content: "You are a helpful assistant." },
        { role: "user", content: "Hello, please introduce yourself." },
      ],
      temperature: 0.7,
      max_tokens: 512,
    });
    
    console.log(response.choices[0].message.content);
    

    cURL Example

    curl -X POST "https://<your-endpoint-url>/v1/chat/completions" \
      -H "Authorization: Bearer YOUR_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "your-model-name",
        "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "Hello, please introduce yourself."}
        ],
        "temperature": 0.7,
        "max_tokens": 512
      }'
    

    Note

    Replace <your-endpoint-url> with the inference API URL shown on the overview page, and YOUR_API_KEY with the API key generated by the platform.

    Real-Time Monitoring

    On the instance details page, switch to the Analysis tab to view real-time performance metrics:

    Metric Description
    CPU Utilization CPU usage percentage for each replica
    GPU Utilization GPU compute core usage
    Memory Usage System memory consumption
    VRAM Usage GPU memory allocation and usage
    Inference Latency Average response time from request to result
    Throughput Number of inference requests processed per second

    Monitoring data helps assess service health and informs resource scaling decisions.

    View Runtime Logs

    Under the Analysis tab, you can also view runtime logs for each replica, including model loading information, request processing records, and error messages, which help troubleshoot inference service issues.

    View Billing Details

    On the instance details page, switch to the Billing tab to view resource usage and cost breakdown:

    Field Description
    Billing Start Time When the instance started consuming compute resources
    Billing End Time When the instance was stopped or resources were released
    Resource Spec The compute configuration in use
    Accumulated Cost Total cost to date

    Stop and Delete Instances

    Warning

    Inference instances incur charges while running. To avoid unnecessary costs, stop instances when not in use. If the instance is no longer needed, delete it to permanently release resources. Deletion is irreversible.
    • Stop: Pauses the inference service, preserves instance configuration, pauses billing, and allows restart at any time.
    • Delete: Permanently removes the instance and all associated resources. This action cannot be undone.