< img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=3131724&fmt=gif" />
Last updated:

    Image Text to Text

    Task Overview

    Image Text to Text tasks take image and text as joint inputs and generate text output. This is suitable for Vision Language Models (VLMs) in scenarios such as image Q&A, image captioning, and OCR understanding.

    API Usage

    Use the multimodal chat completions endpoint, passing both image and text in the messages content array:

    curl https://<instance-address>/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer <access-token>" \
      -d '{
        "model": "<model-name>",
        "messages": [
          {
            "role": "user",
            "content": [
              {
                "type": "image_url",
                "image_url": {
                  "url": "https://example.com/image.jpg"
                }
              },
              {
                "type": "text",
                "text": "Describe the content of this image."
              }
            ]
          }
        ]
      }'
    

    Passing Base64-Encoded Images

    curl https://<instance-address>/v1/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer <access-token>" \
      -d '{
        "model": "<model-name>",
        "messages": [
          {
            "role": "user",
            "content": [
              {
                "type": "image_url",
                "image_url": {
                  "url": "data:image/jpeg;base64,<base64-encoded-content>"
                }
              },
              {
                "type": "text",
                "text": "What text appears in this image?"
              }
            ]
          }
        ]
      }'
    

    Python Example

    import base64
    from openai import OpenAI
    
    client = OpenAI(
        base_url="https://<instance-address>/v1",
        api_key="<access-token>"
    )
    
    # Read and encode local image
    with open("image.jpg", "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")
    
    response = client.chat.completions.create(
        model="<model-name>",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}
                    },
                    {
                        "type": "text",
                        "text": "Describe the content of this image."
                    }
                ]
            }
        ]
    )
    print(response.choices[0].message.content)