REFERENCE ARCHITECTURE

On-prem AI, from request to audited answer

One reference for how a private enterprise AI platform moves a request through identity, orchestration, model routing, private retrieval, and on-prem inference — while keeping every byte inside your perimeter and every action in your audit trail.

Evaluate against the RFP checklist

Reference architecture diagram

Your perimeter — on-premises, private cloud, or air-gapped
Who Identity & access
  • SSO (Entra ID / Okta / Keycloak)
  • RBAC policy engine
  • Per-team & per-environment scoping
How Orchestration
  • Multi-agent planner
  • Tool & connector broker
  • Human-in-the-loop approval gates
Which model Model routing
  • Policy-based model selection
  • Sensitivity & residency rules
  • Cost / latency / quality trade-off
Grounding Private retrieval (RAG)
  • In-perimeter vector store
  • Local embedding models
  • Document-level access control
Inference Model runtime
  • Your GPUs / accelerators
  • vLLM / Ollama / TGI serving
  • Bring-your-own-models
Evidence Observability & audit
  • SIEM log streaming
  • Full request lineage
  • Model-governance reporting
Egress to external model providers is optional and can be fully disabled.
THE LAYERS

Six layers, one governed path

01

Identity & access

Requests authenticate against your existing identity provider over SAML or OIDC. Role-based access control resolves which agents, tools, models, and data sources the user is entitled to before any work begins.

02

Orchestration

A multi-agent orchestrator turns a request into a governed plan: which agents run, which tools they may call, and where human approval is required. Sensitive workflows route through review gates.

03

Model routing

The router selects a model per task based on your policy — domain, data sensitivity, latency, cost, and residency. High-sensitivity workloads can be pinned to specific on-prem models; nothing is sent to an external API unless you allow it.

04

Private retrieval (RAG)

Retrieval runs against your own document stores and vector indexes inside the perimeter. Embeddings are generated locally; access to source documents is enforced by the same RBAC that governs the user.

05

Model runtime

Open-weight or licensed models run on your hardware via your preferred serving stack. The platform is model-agnostic, so you can add, retire, or swap models without rewriting workflows.

06

Observability & audit

Every prompt, retrieval, tool call, model route, and response is captured with actor and context, then streamed to your SIEM. This is the evidence layer for audits, incident response, and model-governance reporting.

DESIGN PRINCIPLES

Why this architecture holds up in a review

Data never crosses the boundary

Prompts, documents, embeddings, and responses stay inside your perimeter. External model APIs are optional, not required — and can be disabled entirely.

Your controls, not ours

Identity, secrets, encryption keys, network segmentation, and logging are the ones you already operate. The platform integrates with them rather than replacing them.

Policy is enforced, not advisory

Model choice, tool access, and data reach are governed by policy the router and orchestrator enforce at runtime — the same rules an auditor can inspect.

Adapt it to your environment.

Our architects will map this reference to your identity provider, network zones, serving stack, and compliance obligations — and hand your security team a concrete deployment diagram. Start at the Trust Center.