System Architecture

l3mcore is designed as a modular middleware system. It sits between client applications and the actual AI models.

High-Level Flow

sequenceDiagram
    participant Client
    participant API Server
    participant Router
    participant Dispatcher
    participant Backend

    Client->>API Server: POST /v1/chat/completions
    API Server->>API Server: Validate size, Rate Limit, Sanitize
    API Server->>Router: Analyze Prompt
    Router-->>API Server: Returns Expert Label
    API Server->>Dispatcher: Routes request to Expert
    Dispatcher->>Backend: Executes Inference (Ollama/API/Local)
    Backend-->>Dispatcher: Transmits tokens (stream)
    Dispatcher-->>Client: Transmits tokens to client

Client Request: HTTP to the l3mcore API server
Security and Validation: size validation, rate limit, sanitization
Routing: the Router analyzes the prompt and determines the expert
Dispatch: the Expert Dispatcher forwards to the correct backend
Response: direct streaming to the client

The Routing Engine (3 levels)

graph TD
    A[User Request] --> B{ML Router Available?}
    B -- Yes --> C[Calculate Embeddings and Softmax]
    B -- No --> F
    C --> D{Score >= Threshold?}
    D -- Yes --> E[Dispatch to Expert]
    D -- No --> F[Keyword and Fuzzy Fallback]
    F --> G{Is there a Match?}
    G -- Yes --> E
    G -- No --> H[General Fallback Model]

Level 1: Machine Learning (Main)

Uses text embeddings with SentenceTransformers to convert prompts into mathematical vectors and compare them with each expert.

Hybrid Scoring System (per expert):

Pre-calculates individual vectors for each keyword
Pre-calculates the normalized centroid of all keywords
Pre-calculates the vector of the expert's description

On each request, it compares the prompt against these vectors with 4 signals:

Signal	Default weight	What it measures
`max_keyword`	40%	Maximum similarity with any individual keyword
`description`	30%	Similarity with the expert's description
`mean_keyword`	20%	Mean similarity with all keywords
`top3_vote`	10%	Consensus: fraction of top-3 keywords above threshold 0.4

Scores are normalized with Softmax to obtain a true probability distribution.

Level 2: Keyword and Fuzzy Fallback

If ML is not available or the score is below the confidence_threshold, it uses rapidfuzz:

Exact token overlap: identical words
Fuzzy matching: partial match for typos or conjugations

Level 3: General Fallback

If no one passes the fallback threshold, the request goes to the model designated as "fallback": true in experts.json (usually a general-purpose model).

The Expert Dispatcher

graph LR
    A[Expert Dispatcher] --> B{Expert Type}
    B -- api --> C[LiteLLM Provider]
    C --> D[OpenAI / Anthropic / Gemini]
    B -- ollama --> E[Ollama Instance]
    B -- local --> F[SpecificModelRunner]
    F --> G[ONNX / GGUF in RAM]

The Dispatcher abstracts the complexity of each backend and instantiates the correct runner according to the expert's type.

High-Level Flow​

The Routing Engine (3 levels)​

Level 1: Machine Learning (Main)​

Level 2: Keyword and Fuzzy Fallback​

Level 3: General Fallback​

The Expert Dispatcher​