Skip to main content

Performance Tuning

Hardware profiles

Raspberry Pi / Limited Hardware

// config.json
{
"router": {
"model_path": "",
"keyword_fallback": true,
"confidence_threshold": 0.3
}
}
# modules/onnx_runner.py
MAX_LOADED_MODELS = 1
IDLE_TIMEOUT = 60
INTRA_THREADS = 1
INTER_THREADS = 1

Recommendation: use only Ollama or API backends, avoid local ONNX models.


Server with dedicated CPU (no GPU)

// config.json
{
"router": {
"router_type": "embedding",
"model_path": "intfloat/multilingual-e5-base",
"confidence_threshold": 0.45,
"softmax_temperature": 0.12
}
}
# modules/onnx_runner.py
MAX_LOADED_MODELS = 3
IDLE_TIMEOUT = 300
INTRA_THREADS = 4
INTER_THREADS = 2

Server with GPU

For maximum speed, use Ollama backends with GPU or vLLM:

  • l3mcore's router is lightweight (CPU only)
  • Expert models on GPU via Ollama/vLLM
  • You can increase MAX_LOADED_MODELS if you have enough VRAM

Router Speed

The router's decision time varies depending on the embedding model:

ModelTypical time/requestRAM
No ML (keywords only)< 1 ms~0
multilingual-e5-small~15 ms~120 MB
multilingual-e5-base~30 ms~280 MB
multilingual-e5-large~80 ms~560 MB

The model is loaded into RAM only once at startup. The decision time is marginal compared to the expert model's inference.

Embedding cache

For repetitive prompts (e.g. a bot with a few frequent questions), you can implement caching in a plugin:

# plugins/embedding_cache.py
from functools import lru_cache

@lru_cache(maxsize=512)
def cached_route(prompt: str) -> str:
return prompt # the hook only caches if the prompt is identical

def before_routing(prompt: str) -> str:
return cached_route(prompt.strip().lower())

Monitoring

Regularly check logs/app.log to detect:

  • Experts that never receive traffic (poorly chosen keywords)
  • Consistently very low scores (threshold too high or insufficient keywords)
  • Timeouts on external backends