Skip to content
Xorbits Inference

Configure Xorbits Inference

Xorbits Inference (Xinference) is a powerful open-source model inference framework that supports LLM, Embedding, Rerank, and multimodal models. It provides distributed inference, one-click deployment, OpenAI-compatible API, and other features.

1. Install and Deploy Xinference

1.1 Access Xinference Official Website

Visit the Xinference official website and check the documentation: https://inference.readthedocs.io/

GitHub repository: https://github.com/xorbitsai/inference

1.2 System Requirements

  • Operating System: Linux, macOS, Windows
  • Python: 3.8-3.11
  • GPU: Optional (NVIDIA GPU with CUDA support)
  • Memory: Depends on model size, 16GB+ recommended

1.3 Install Xinference

Install using pip

bash
# Install Xinference
pip install "xinference[all]"

# Verify installation
xinference --version

Expected Result:

xinference, version 0.x.x

Common Errors:

  • If you see command not found: xinference, the installation failed or it's not in PATH
  • If Python version is incompatible, ensure you're using Python 3.8-3.11

Install using Docker

bash
# Pull Xinference official image
docker pull xprobe/xinference:latest

# Run Xinference service
docker run -p 9997:9997 -v $HOME/.xinference:/root/.xinference xprobe/xinference:latest

Expected Result:

  • Image pull successful, displays Status: Downloaded newer image for xprobe/xinference:latest
  • After container starts, log output displays, including Starting Xinference at http://0.0.0.0:9997

Common Errors:

  • If pull fails, it might be a network issue, try configuring Docker mirror acceleration
  • If port 9997 is occupied, modify the -p parameter to use a different port

1.4 Start Xinference Service

Start Local Service

bash
# Start Xinference
xinference-local --host 0.0.0.0 --port 9997

Expected Result:

Starting Xinference at http://0.0.0.0:9997
Xinference service started successfully
You can now access the web UI at http://localhost:9997

Common Errors:

  • If you see Address already in use, the port is occupied, use --port to specify a different port
  • If you see permission errors, try using sudo or check file permissions

Start Cluster Mode

bash
# Start supervisor (master node)
xinference-supervisor --host 0.0.0.0 --port 9997

# Start worker (worker node) on other machines
xinference-worker --endpoint http://supervisor_host:9997

Expected Result:

  • Supervisor starts successfully, displays listening address and port
  • Worker connects successfully, displays Connected to supervisor at http://supervisor_host:9997

1.5 Deploy Model

Deploy models through Web UI or command line:

Using Web UI

  1. Visit: http://localhost:9997
  2. Click the Launch Model button
  3. Select a model (e.g., qwen2.5-7b-instruct)
  4. Configure parameters and launch

Expected Result:

  • Web UI displays list of available models
  • After clicking Launch, the model starts downloading (if not available locally)
  • After model loads successfully, status shows "Running"
  • You can see the model's access address, e.g., http://localhost:9997/v1

Common Issues:

  • If model list is empty, check network connection or manually download models
  • First deployment will download model files, which may take a while (depending on model size)
  • If insufficient memory, choose a model with fewer parameters

Using Command Line

bash
# Deploy Qwen 2.5 7B model
xinference launch --model-name qwen2.5-7b-instruct --size-in-billions 7

# View deployed models
xinference list

Expected Result:

Model launched successfully
Model UID: qwen2.5-7b-instruct-xxxxx
Model is now available at: http://localhost:9997/v1/models/qwen2.5-7b-instruct-xxxxx

Example output of viewing deployed models:

UID                                   Name                      Type    Status
qwen2.5-7b-instruct-xxxxx            qwen2.5-7b-instruct       LLM     Running

Common Errors:

  • If model doesn't exist, use xinference registrations to view available models
  • If launch fails, check logs to understand the specific error

1.6 Verify Service is Running

bash
# Check service status
curl http://localhost:9997/v1/models

Correct Return Result Example:

json
{
  "object": "list",
  "data": [
    {
      "id": "qwen2.5-7b-instruct-xxxxx",
      "object": "model",
      "created": 1699234567,
      "owned_by": "xinference",
      "permission": []
    }
  ]
}

If the above JSON content is returned, Xinference service has started successfully and the model is deployed.

Error Cases:

  1. Connection Failed:

    bash
    curl: (7) Failed to connect to localhost port 9997: Connection refused

    Service is not started or port configuration is incorrect, please check if the service is running normally.

  2. Empty List Returned:

    json
    {
      "object": "list",
      "data": []
    }

    Service is running normally but no models have been deployed yet. You need to deploy a model first.

  3. 404 Error:

    json
    {"detail": "Not Found"}

    Access path is incorrect, confirm you're using the correct API endpoint /v1/models.

2. Configure Xinference Model in CueMate

2.1 Navigate to Model Settings

After logging into CueMate, click Model Settings in the dropdown menu at the top right corner.

Navigate to Model Settings

2.2 Add New Model

Click the Add Model button in the upper right corner.

Click Add Model

2.3 Select Xorbits Inference Provider

In the dialog that appears:

  1. Provider Type: Select Xorbits Inference
  2. After clicking, it will automatically proceed to the next step

Select Xinference

2.4 Fill in Configuration Information

Fill in the following information on the configuration page:

Basic Configuration

  1. Model Name: Give this model configuration a name (e.g., Xinference Qwen 2.5)
  2. API URL: Keep the default http://localhost:9997/v1 (or modify to your Xinference service address)
  3. API Key: Optional, fill in if --api-key parameter was configured when Xinference service was started
  4. Model Version: Enter the name of the deployed model

Recommended Models for 2025:

  • Qwen 2.5 Series:
    • qwen2.5-72b-instruct: Qwen 2.5 72B conversational model
    • qwen2.5-32b-instruct: Qwen 2.5 32B conversational model
    • qwen2.5-14b-instruct: Qwen 2.5 14B conversational model (recommended)
    • qwen2.5-7b-instruct: Qwen 2.5 7B conversational model
  • Qwen 2.5 Coder Series:
    • qwen2.5-coder-32b-instruct: Code generation 32B
    • qwen2.5-coder-7b-instruct: Code generation 7B
  • DeepSeek R1 Series:
    • deepseek-r1-8b: DeepSeek R1 8B reasoning-enhanced model
  • Llama 3 Series:
    • llama-3.3-70b-instruct: Llama 3.3 70B conversational model
    • llama-3.1-70b-instruct: Llama 3.1 70B conversational model
    • llama-3.1-8b-instruct: Llama 3.1 8B conversational model
  • Other Recommendations:
    • mistral-7b-instruct-v0.3: Mistral 7B conversational model
    • gemma-2-27b-it: Gemma 2 27B conversational model
    • gemma-2-9b-it: Gemma 2 9B conversational model
    • glm-4-9b-chat: GLM-4 9B conversational model

Note: Model version must be a model already deployed in Xinference.

Fill in Basic Configuration

Advanced Configuration (Optional)

Expand the Advanced Configuration panel to adjust the following parameters:

Parameters Adjustable in CueMate Interface:

  1. Temperature: Controls output randomness

    • Range: 0-2 (most models), 0-1 (Qwen series)
    • Recommended Value: 0.7
    • Effect: Higher values produce more random and creative outputs, lower values produce more stable and conservative outputs
    • Usage Recommendations:
      • Creative writing/brainstorming: 1.0-1.5
      • Regular conversation/Q&A: 0.7-0.9
      • Code generation/precise tasks: 0.3-0.5
    • Note: Qwen series models have a maximum temperature of 1, not 2
  2. Max Tokens: Limits single output length

    • Range: 256 - 131072 (depending on the model)
    • Recommended Value: 8192
    • Effect: Controls the maximum number of tokens in a single model response
    • Model Limits:
      • Qwen2.5 series: Max 32K tokens
      • Llama 3.1/3.3 series: Max 131K tokens
      • DeepSeek R1 series: Max 65K tokens
      • Mistral series: Max 32K tokens
      • Gemma 2 series: Max 8K tokens
    • Usage Recommendations:
      • Short Q&A: 1024-2048
      • Regular conversation: 4096-8192
      • Long text generation: 16384-32768
      • Ultra-long documents: 65536-131072 (supported models only)

Advanced Configuration

Additional Advanced Parameters Supported by Xinference API:

Although the CueMate interface only provides temperature and max_tokens adjustments, if you call Xinference directly through the API, you can also use the following advanced parameters (Xinference uses an OpenAI-compatible API format):

  1. top_p (nucleus sampling)

    • Range: 0-1
    • Default Value: 1
    • Effect: Samples from the smallest set of candidates whose cumulative probability reaches p
    • Relationship with temperature: Usually only adjust one of them
    • Usage Recommendations:
      • Maintain diversity but avoid extremes: 0.9-0.95
      • More conservative output: 0.7-0.8
  2. top_k

    • Range: 0-100
    • Default Value: 50
    • Effect: Samples from the top k candidates with highest probability
    • Usage Recommendations:
      • More diverse: 50-100
      • More conservative: 10-30
  3. frequency_penalty

    • Range: -2.0 to 2.0
    • Default Value: 0
    • Effect: Reduces the probability of repeating the same words (based on word frequency)
    • Usage Recommendations:
      • Reduce repetition: 0.3-0.8
      • Allow repetition: 0 (default)
  4. presence_penalty

    • Range: -2.0 to 2.0
    • Default Value: 0
    • Effect: Reduces the probability of words that have already appeared appearing again (based on presence)
    • Usage Recommendations:
      • Encourage new topics: 0.3-0.8
      • Allow repeated topics: 0 (default)
  5. stop (stop sequences)

    • Type: String or array
    • Default Value: null
    • Effect: Stops generation when the specified string is included in the generated content
    • Example: ["###", "User:", "\n\n"]
    • Use Cases:
      • Structured output: Use separators to control format
      • Dialogue systems: Prevent the model from speaking on behalf of the user
  6. stream

    • Type: Boolean
    • Default Value: false
    • Effect: Enables SSE streaming return, generating and returning content progressively
    • In CueMate: Handled automatically, no manual setting required
  7. repetition_penalty

    • Type: Float
    • Range: 1.0-2.0
    • Default Value: 1.0
    • Effect: Xinference-specific parameter, penalizes already generated tokens to reduce repetition
    • Usage Recommendations:
      • Reduce repetitive content: 1.1-1.3
      • Normal output: 1.0 (default)
No.Scenariotemperaturemax_tokenstop_ptop_kfrequency_penaltypresence_penalty
1Creative Writing1.0-1.24096-81920.95500.50.5
2Code Generation0.2-0.52048-40960.9400.00.0
3Q&A System0.71024-20480.9500.00.0
4Summary0.3-0.5512-10240.9300.00.0
5Brainstorming1.2-1.52048-40960.95600.80.8

2.5 Test Connection

After filling in the configuration, click the Test Connection button to verify the configuration is correct.

Test Connection

If the configuration is correct, a test success prompt will be displayed, along with a sample response from the model.

Test Success

If the configuration is incorrect, a test error log will be displayed, and you can view specific error information through log management.

2.6 Save Configuration

After successful testing, click the Save button to complete the model configuration.

Save Configuration

3. Use the Model

Through the dropdown menu in the top right corner, navigate to the system settings interface and select the model configuration you want to use in the large model provider section.

After configuration, you can select to use this model in interview training, question generation, and other features. You can also individually select the model configuration for a specific interview in the interview options.

Select Model

4. Supported Model List

No.Model NameModel IDParametersMax OutputUse Cases
1Qwen 2.5 72B Instructqwen2.5-72b-instruct72B32K tokensUltra-large scale tasks
2Qwen 2.5 32B Instructqwen2.5-32b-instruct32B32K tokensLarge-scale tasks
3Qwen 2.5 14B Instructqwen2.5-14b-instruct14B32K tokensMedium-scale tasks
4Qwen 2.5 7B Instructqwen2.5-7b-instruct7B32K tokensGeneral scenarios, cost-effective
5Qwen 2.5 Coder 32Bqwen2.5-coder-32b-instruct32B32K tokensLarge code generation
6Qwen 2.5 Coder 7Bqwen2.5-coder-7b-instruct7B32K tokensMedium code generation

4.2 DeepSeek R1 Series

No.Model NameModel IDParametersMax OutputUse Cases
1DeepSeek R1 8Bdeepseek-r1-8b8B65K tokensReasoning-enhanced conversation

4.3 Llama 3 Series

No.Model NameModel IDParametersMax OutputUse Cases
1Llama 3.3 70B Instructllama-3.3-70b-instruct70B131K tokensUltra-long context
2Llama 3.1 70B Instructllama-3.1-70b-instruct70B131K tokensHigh-quality conversation
3Llama 3.1 8B Instructllama-3.1-8b-instruct8B131K tokensGeneral conversation
No.Model NameModel IDParametersMax OutputUse Cases
1Mistral 7B Instructmistral-7b-instruct-v0.37B32K tokensMultilingual conversation
2Gemma 2 27B ITgemma-2-27b-it27B8K tokensGoogle flagship model
3Gemma 2 9B ITgemma-2-9b-it9B8K tokensGoogle medium model
4GLM-4 9B Chatglm-4-9b-chat9B131K tokensZhipu GLM latest version

4.5 Model Settings

  • One-click Deployment: Supports 100+ open-source models
  • Version Management: Multiple versions of the same model
  • Auto Download: Automatically downloads models on first use

4.6 Distributed Inference

bash
# Start cluster supervisor
xinference-supervisor -H 0.0.0.0 -p 9997

# Start worker on other machines
xinference-worker -e http://supervisor_ip:9997

4.7 Built-in Embedding

bash
# Deploy embedding model
xinference launch --model-name bge-large-zh --model-type embedding

5. Common Issues

5.1 Model Download Failed

Symptom: Download fails when deploying model for the first time

Solution:

  1. Check network connection, ensure access to HuggingFace
  2. Set mirror acceleration: export HF_ENDPOINT=https://hf-mirror.com
  3. Pre-download model to ~/.xinference/cache
  4. Use local model path

5.2 Port Conflict

Symptom: Port occupied prompt when starting service

Solution:

  1. Modify startup port: xinference-local --port 9998
  2. Check and close processes occupying the port
  3. Use lsof -i :9997 to view port occupation

5.3 Insufficient Memory

Symptom: Insufficient memory prompt when deploying model

Solution:

  1. Choose a model with fewer parameters
  2. Use quantized version
  3. Configure GPU acceleration
  4. Increase system memory

5.4 Empty Model List

Symptom: Accessing /v1/models returns empty list

Solution:

  1. Confirm at least one model has been deployed
  2. Use xinference list to check deployment status
  3. Check Xinference service logs
  4. Restart Xinference service

5.5 GPU Acceleration

bash
# Auto-detect and use GPU
xinference launch --model-name qwen2.5-7b-instruct

# Specify GPU device
CUDA_VISIBLE_DEVICES=0,1 xinference-local

5.6 Quantization Acceleration

bash
# Use 4-bit quantization
xinference launch --model-name qwen2.5-7b-instruct --quantization 4-bit

5.7 Batch Processing Optimization

Adjust xinference.toml configuration file:

toml
[inference]
max_batch_size = 32
max_concurrent_requests = 256

5.8 Ease of Use

  • Web UI management interface
  • One-click model deployment
  • OpenAI-compatible API

5.9 Rich Features

  • Supports LLM, Embedding, Rerank
  • Multimodal model support
  • Distributed inference cluster

5.10 Production Ready

  • High availability architecture
  • Monitoring and logging
  • Load balancing

Released under the GPL-3.0 License.