Chat & Geração#

Aprenda como conversar com LLMs no Xinference.

Introdução#

Modelos com capacidade de chat ou generate são geralmente chamados de modelos de linguagem de grande escala (LLM) ou modelos de geração de texto. Esses modelos são projetados para responder com saídas de texto com base na entrada recebida, geralmente chamada de “prompt”. De modo geral, é possível orientar esses modelos a realizar tarefas por meio de instruções específicas ou fornecendo exemplos concretos.

Modelos com capacidade generate geralmente são grandes modelos de linguagem pré-treinados. Por outro lado, modelos equipados com função chat são LLMs (Modelos de Linguagem) refinados e alinhados, otimizados especificamente para cenários de conversação. Na maioria dos casos, modelos que terminam com “chat” (como llama-2-chat, qwen-chat, etc.) possuem a funcionalidade chat.

As APIs de Chat e Generate oferecem duas maneiras diferentes de interagir com LLMs:

Chat API (semelhante ao Chat Completion API do OpenAI) pode realizar diálogos de múltiplas rodadas.
A API de Geração (similar à Completions API da OpenAI) permite que você gere texto com base em um prompt textual.

Model Capabilities	Endpoint de API	Endpoint compatível com OpenAI
chat	Chat API	/v1/chat/completions
generate	Generate API	/v1/completions

Lista de modelos suportados#

Você pode conferir todos os recursos dos modelos LLM integrados no Xinference.

Modelo de chat#

Chat API#

Tente usar cURL, OpenAI Client ou o cliente Python do Xinference para testar a Chat API:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": "What is the largest animal?"
        }
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {
            "content": "What is the largest animal?",
            "role": "user",
        }
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
messages = [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the largest animal?"}]
model.chat(
    messages,
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
)

{
  "id": "chatcmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "chat.completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life."
      },
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

Você pode encontrar mais exemplos da API Chat no notebook de tutorial.

Gradio Chat

Exemplos de como usar a Chat API do Xinference e o cliente Python.

https://github.com/xorbitsai/inference/blob/main/examples/gradio_chatinterface.py

Modelo de Pensamento Híbrido#

Alguns modelos de linguagem de grande porte são marcados como híbridos, permitindo escolher se o modo de raciocínio será ativado ou não.

Adicionado na versão v1.17.0: O switch enable_thinking no nível de requisição é suportado a partir da v1.17.0.

Xinference fornece uma chave enable_thinking no nível da requisição, que é aplicável a diferentes templates de modelos (por exemplo, Qwen usa enable_thinking, enquanto alguns templates do DeepSeek usam thinking).

Exemplo de uso:

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "messages": [
        {"role": "user", "content": "What is the largest animal?"}
    ],
    "enable_thinking": false
  }'

import openai

client = openai.Client(
    api_key="cannot be empty",
    base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
client.chat.completions.create(
    model="<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    extra_body={"enable_thinking": False}
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    enable_thinking=False,
)

model.chat(
    [{"role": "user", "content": "What is the largest animal?"}],
    generate_config={"chat_template_kwargs": {"enable_thinking": False}},
)

Modelo generativo#

Generate API#

O Generate API replica o Completions API da OpenAI.

A principal diferença entre a Generate API e a Chat API está no formato de entrada. A Chat API aceita uma lista de mensagens como entrada, enquanto a Generate API aceita uma string de texto livre chamada “prompt” como entrada.

curl -X 'POST' \
  'http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "<MODEL_UID>",
    "prompt": "What is the largest animal?",
    "max_tokens": 512,
    "temperature": 0.7
  }'

import openai

client = openai.Client(api_key="cannot be empty", base_url="http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1")
client.chat.completions.create(
    model=("<MODEL_UID>",
    messages=[
        {"role": "user", "content": "What is the largest animal?"}
    ],
    max_tokens=512,
    temperature=0.7
)

from xinference.client import RESTfulClient

client = RESTfulClient("http://<XINFERENCE_HOST>:<XINFERENCE_PORT>")
model = client.get_model("<MODEL_UID>")
print(model.generate(
    prompt="What is the largest animal?",
    generate_config={
      "max_tokens": 512,
      "temperature": 0.7
    }
))

{
  "id": "cmpl-8d76b65a-bad0-42ef-912d-4a0533d90d61",
  "model": "<MODEL_UID>",
  "object": "text_completion",
  "created": 1688919187,
  "choices": [
    {
      "index": 0,
      "text": "The largest animal that has been scientifically measured is the blue whale, which has a maximum length of around 23 meters (75 feet) for adult animals and can weigh up to 150,000 pounds (68,000 kg). However, it is important to note that this is just an estimate and that the largest animal known to science may be larger still. Some scientists believe that the largest animals may not have a clear \"size\" in the same way that humans do, as their size can vary depending on the environment and the stage of their life.",
      "finish_reason": "None"
    }
  ],
  "usage": {
    "prompt_tokens": -1,
    "completion_tokens": -1,
    "total_tokens": -1
  }
}

FAQ#

O Xinference oferece métodos de integração com LangChain ou LlamaIndex?#

Sim, você pode consultar as seções relevantes na documentação oficial de Xinference de cada um deles. Aqui estão os links: