Skip to content
Local Large Model

Configure Local Large Model

Local large models refer to open-source large language models deployed on personal computers or private servers, without relying on cloud APIs. Supporting multiple inference frameworks (Ollama, vLLM, Xinference, etc.), they provide data privacy protection and fully offline operation capabilities.

1. Deploy Local Model Service

Local large models support multiple inference frameworks, including Ollama, vLLM, Xinference, etc. This document introduces how to configure local model services in a general way.

1.1 Choose Inference Framework

Choose the appropriate inference framework based on your needs:

  • Ollama: Easy to use, suitable for individual developers
  • vLLM: High-performance inference, suitable for production environments
  • Xinference: Supports multiple models, feature-rich

For detailed installation instructions, please refer to the independent documentation for each framework:

1.2 Start Local Service

Taking Ollama as an example:

bash
# Download model
ollama pull deepseek-r1:7b

# Ollama will automatically start the service, listening on http://localhost:11434 by default

1.3 Verify Service Running

bash
# Check service status
curl http://localhost:11434/api/version

2. Configure Local Model in CueMate

2.1 Enter Model Settings Page

After logging into CueMate, click Model Settings in the dropdown menu in the upper right corner.

Enter Model Settings

2.2 Add New Model

Click the Add Model button in the upper right corner.

Click Add Model

2.3 Select Local Model Provider

In the pop-up dialog:

  1. Provider Type: Select Local Model
  2. After clicking, automatically proceed to the next step

Select Local Model

2.4 Fill in Configuration Information

Fill in the following information on the configuration page:

Basic Configuration

  1. Model Name: Give this model configuration a name (e.g., Local DeepSeek R1)
  2. API URL: Fill in the local service address
    • Ollama default: http://localhost:11434
    • vLLM default: http://localhost:8000/v1
    • Xinference default: http://localhost:9997/v1
  3. Model Version: Enter the deployed model name

2025 Recommended Models:

  • DeepSeek R1 Series: deepseek-r1:1.5b, deepseek-r1:7b, deepseek-r1:14b, deepseek-r1:32b
  • Llama 3.3 Series: llama3.3:70b (latest version)
  • Llama 3.2 Series: llama3.2:1b, llama3.2:3b, llama3.2:11b, llama3.2:90b
  • Llama 3.1 Series: llama3.1:8b, llama3.1:70b, llama3.1:405b
  • Qwen 2.5 Series: qwen2.5:0.5b, qwen2.5:1.5b, qwen2.5:3b, qwen2.5:7b, qwen2.5:14b, qwen2.5:32b, qwen2.5:72b

Note: The model version must be a model that has been deployed in the local service. The model naming of different inference frameworks may vary slightly, please adjust according to actual conditions.

Fill in Basic Configuration

Advanced Configuration (Optional)

Expand the Advanced Configuration panel to adjust the following parameters:

Parameters Adjustable in CueMate Interface:

  1. Temperature: Controls output randomness

    • Range: 0-2 (depending on model)
    • Recommended Value: 0.7
    • Function: Higher values produce more random and creative output, lower values produce more stable and conservative output
    • Model Range:
      • DeepSeek/Llama Series: 0-2
      • Qwen Series: 0-1
    • Usage Suggestions:
      • Creative writing: 0.8-1.2
      • Regular conversation: 0.6-0.8
      • Code generation: 0.3-0.5
  2. Max Tokens: Limits single output length

    • Range: 256 - 8192
    • Recommended Value: 4096
    • Function: Controls the maximum word count of model's single response
    • Usage Suggestions:
      • Short Q&A: 1024-2048
      • Regular conversation: 4096-8192
      • Long text generation: 8192 (maximum)

Advanced Configuration

Other Parameters Supported by Local Model API:

Local model services (Ollama, vLLM, Xinference) usually use OpenAI-compatible API format and support the following advanced parameters:

  1. top_p (nucleus sampling)

    • Range: 0-1
    • Default Value: 0.9
    • Function: Samples from the minimum candidate set where cumulative probability reaches p
    • Usage Suggestions: Keep default 0.9, usually only adjust one of temperature or top_p
  2. top_k

    • Range: 1-100
    • Default Value: 40 (Ollama), 50 (vLLM)
    • Function: Samples from the k candidate words with the highest probability
    • Usage Suggestions: Usually keep default value
  3. frequency_penalty (frequency penalty)

    • Range: -2.0 to 2.0
    • Default Value: 0
    • Function: Reduces the probability of repeating the same words
    • Usage Suggestions: Set to 0.3-0.8 when reducing repetition
  4. presence_penalty (presence penalty)

    • Range: -2.0 to 2.0
    • Default Value: 0
    • Function: Reduces the probability of words that have already appeared appearing again
    • Usage Suggestions: Set to 0.3-0.8 when encouraging new topics
  5. stream (streaming output)

    • Type: Boolean
    • Default Value: false
    • Function: Enable streaming return, generate and return simultaneously
    • In CueMate: Automatically handled, no manual setting required
Scenariotemperaturemax_tokenstop_pRecommended Model
Creative Writing0.8-1.04096-81920.9DeepSeek R1 7B/14B
Code Generation0.3-0.52048-40960.9Qwen 2.5 7B/14B
Q&A System0.71024-20480.9Llama 3.2 11B
Technical Interview0.6-0.72048-40960.9DeepSeek R1 7B/14B
Fast Response0.51024-20480.9Llama 3.2 3B

2.5 Test Connection

After filling in the configuration, click the Test Connection button to verify if the configuration is correct.

Test Connection

If the configuration is correct, a successful test message will be displayed, along with a sample response from the model.

Test Success

If the configuration is incorrect, test error logs will be displayed, and you can view specific error information through log management.

2.6 Save Configuration

After a successful test, click the Save button to complete the model configuration.

Save Configuration

3. Use Model

Through the dropdown menu in the upper right corner, enter the system settings interface, and select the model configuration you want to use in the LLM provider section.

After configuration, you can select to use this model in interview training, question generation, and other functions. Of course, you can also select the model configuration for a specific interview in the interview options.

Select Model

4. Supported Model Series

DeepSeek R1 Series

Model NameModel IDParametersMax OutputUse Case
DeepSeek R1 1.5Bdeepseek-r1:1.5b1.5B8K tokensLightweight reasoning
DeepSeek R1 7Bdeepseek-r1:7b7B8K tokensReasoning enhanced, technical interviews
DeepSeek R1 14Bdeepseek-r1:14b14B8K tokensHigh-performance reasoning
DeepSeek R1 32Bdeepseek-r1:32b32B8K tokensUltra-strong reasoning capability

Llama 3 Series

Model NameModel IDParametersMax OutputUse Case
Llama 3.3 70Bllama3.3:70b70B8K tokensLatest version, high performance
Llama 3.2 90Bllama3.2:90b90B8K tokensUltra-large scale reasoning
Llama 3.2 11Bllama3.2:11b11B8K tokensMedium-scale tasks
Llama 3.2 3Bllama3.2:3b3B8K tokensSmall-scale tasks
Llama 3.2 1Bllama3.2:1b1B8K tokensUltra-lightweight
Llama 3.1 405Bllama3.1:405b405B8K tokensUltra-large scale reasoning
Llama 3.1 70Bllama3.1:70b70B8K tokensLarge-scale tasks
Llama 3.1 8Bllama3.1:8b8B8K tokensStandard tasks

Qwen 2.5 Series

Model NameModel IDParametersMax OutputUse Case
Qwen 2.5 72Bqwen2.5:72b72B8K tokensUltra-large scale tasks
Qwen 2.5 32Bqwen2.5:32b32B8K tokensLarge-scale tasks
Qwen 2.5 14Bqwen2.5:14b14B8K tokensMedium-scale tasks
Qwen 2.5 7Bqwen2.5:7b7B8K tokensGeneral scenarios, cost-effective
Qwen 2.5 3Bqwen2.5:3b3B8K tokensSmall-scale tasks
Qwen 2.5 1.5Bqwen2.5:1.5b1.5B8K tokensLightweight tasks
Qwen 2.5 0.5Bqwen2.5:0.5b0.5B8K tokensUltra-lightweight

5. Common Issues

Service Connection Failed

Symptom: Cannot connect when testing connection

Solution:

  1. Confirm if the local inference service is running
  2. Check if the API URL configuration is correct
  3. Verify if the port is occupied
  4. Check firewall settings

Model Not Deployed

Symptom: Model does not exist message

Solution:

  1. Confirm the model has been deployed in the local service
  2. Check if the model name spelling is correct
  3. View the model list of the inference service

Performance Issues

Symptom: Slow model response speed

Solution:

  1. Choose models with smaller parameters
  2. Ensure sufficient GPU memory or system memory
  3. Optimize inference framework configuration
  4. Consider using quantized models

Insufficient Memory

Symptom: Model loading fails or system lags

Solution:

  1. Choose models with smaller parameters
  2. Use quantized versions (e.g., 4-bit, 8-bit)
  3. Increase system memory or use GPU
  4. Adjust memory configuration of inference framework

Minimum Configuration

Model ParametersCPUMemoryGPU
0.5B-3B4 cores8GBOptional
7B-14B8 cores16GBRecommended
32B-70B16 cores64GBRequired
Model ParametersCPUMemoryGPU
0.5B-3B8 cores16GBGTX 1660
7B-14B16 cores32GBRTX 3060
32B-70B32 cores128GBRTX 4090

Data Privacy

  • All data processing is completed locally
  • No dependency on external API services
  • Full control over data security

Cost Control

  • No API call fees
  • One-time hardware investment
  • Low long-term usage cost

Flexibility

  • Support custom models
  • Adjustable inference parameters
  • Full control over service configuration

Use Cases

  • Enterprise internal deployment
  • Sensitive data processing
  • Offline environment usage
  • Development and testing environments

Released under the GPL-3.0 License.