stefandango.dev

Self-Reliance · Series · Geography is not jurisdiction · Part 2 of 3

What "on-premises AI" actually means in 2026

Part 2 of a series on what on-premises and EU-jurisdiction AI actually mean for regulated organisations in 2026.

Legal landscape, benchmark figures, and hardware prices as of May 2026. Pricing references reflect public API rates and retail hardware prices at time of writing; both move, and the hardware figures will date faster than the legal reasoning. Re-check before making procurement decisions.

Part 1 (Why your AI pilot shouldn't leave the EU) argued that "EU region" is a statement about geography, not jurisdiction. The same hollowing-out has happened to "on-premises." A vendor selling you a managed cloud appliance, a vendor selling you a sovereign-cloud SKU, and a vendor selling you a rack you operate yourself will all use the phrase, and they mean three different things with three very different cost profiles.

This post is about what's actually on the menu. Four tiers, what each one buys you, what each one costs, and where each one breaks. The framing is for organisations weighing real procurement decisions -- public sector bodies, regulated private-sector businesses, anyone with workloads that landed in the harder categories from Part 1.

How to read this post: the main sections are written in plain language for procurement and leadership. The indented blockquotes give architects and CTOs the technical detail behind each claim. Skim past them or read them -- the argument lands either way.

The short version, for the impatient: if you're doing retrieval-augmented generation over internal documents -- which covers most "AI over our own data" projects -- Tier 3 is probably where you'll land. The rest of this post is the reasoning for why that's the sensible default and, just as importantly, where it stops being the right answer. If you're not doing RAG, the tier that fits depends on the workload, and the framework at the end sorts that out.

The cost axes nobody itemises

Before the tiers, the axes. Most procurement conversations collapse "on-premises AI" to a single dimension -- usually "control" -- and that's how organisations end up surprised by what Tier 4 actually costs to run. There are at least five axes that move independently:

Jurisdictional control. Whose courts can compel the data. The thing Part 1 was about. Tiers 2, 3, and 4 are all defensible here; Tier 1 is the one with the open question.

Operational sovereignty. Who can technically reach the data during normal operations -- vendor engineers, support staff, subcontractors. This is the axis where "sovereign cloud" SKUs improve over standard cloud, and where Tier 4 is the only architecture that fully closes the loop.

Capability ceiling. What models you can actually run. The closed frontier is only reachable at Tier 1. Everywhere else the ceiling is the open-weights ecosystem -- moderated at Tier 2 by what EU vendors actually serve, and at Tier 4 by what your hardware fits. The gap between closed and open is closing but real as of May 2026.

Operational burden. Who keeps the lights on. Cloud APIs are zero burden; a fully on-prem rack is a continuous-cost line item with a name attached to it.

Cost shape. Not just hardware or API spend -- staffing, model upgrades, replacement cycles, downtime. The axis where Tier 4 surprises organisations who scoped on rack price alone.

The four tiers below sit at different points on all five axes. That's the whole reason there are four of them rather than one.

Tier 1: US provider, "EU region"

Data residency in the EU, jurisdiction in the US. Azure OpenAI in EU regions, AWS Bedrock with EU model endpoints, Google Vertex AI with EU residency commitments, and the major SaaS layers built on top of them.

This is the path of least resistance. The platforms are mature, the documentation is good, the integrations exist for every framework you might already be using, and the EU-residency offerings are real engineering work -- not theatre. The EU-US Data Privacy Framework makes it a legally valid transfer mechanism as of May 2026, and for many workloads the calculation comes out in favour of Tier 1 even after the analysis in Part 1.

What Tier 1 gets right. Frontier-quality models without operational burden. Predictable pricing that scales linearly with use. A vendor ecosystem that means hiring is easy. Compliance certifications -- ISO 27001, SOC 2, BSI C5, France's SecNumCloud -- that genuinely address security posture and operational discipline.

What Tier 1 gets wrong. It doesn't address jurisdictional control, which is a separate axis from the certifications above. Microsoft France's Director of Public and Legal Affairs confirmed this under oath before the French Senate in June 2025; it's now on the public record. For workloads where that axis doesn't matter -- open-source code completion, public-document Q&A, anything where the worst-case CLOUD Act exposure is acceptable -- Tier 1 is the right answer. For workloads where it does matter, no amount of certifications closes the gap.

For the architect: the integration pattern here is well-trodden -- Azure OpenAI SDK, AWS Bedrock client libraries, Google's Vertex AI SDK, or the provider-neutral abstractions like LangChain, Semantic Kernel, or Microsoft.Extensions.AI's IChatClient. The cost calculation is dominated by token volume; per-million-token rates for the EU-region endpoints of frontier models sit roughly at parity with their US-region siblings as of May 2026. If you build the integration through an abstraction, swapping providers later is configuration; if you bake the vendor SDK into application code, it's a rewrite. Sovereignty insurance is cheap on day one.

Cost shape. Pay-per-use, no fixed minimum, scales with traffic. A modest internal assistant might cost €50--€200/month in API spend; a heavily-used production application can run to tens of thousands. The dominant cost line is API calls, not infrastructure.

Operational burden. Near zero. You're consuming a service. The vendor handles model upgrades, GPU procurement, capacity planning, and uptime. The only thing your team operates is the application layer.

Tier 2: EU-jurisdiction API provider

An inference endpoint operated under EU jurisdiction. Two kinds of vendor sell this. EU labs serving their own models -- Mistral, Aleph Alpha -- and EU infrastructure providers serving open-weights models from any lab: Scaleway's Generative APIs and OVHcloud's AI Endpoints both host DeepSeek, Qwen, GLM, and Llama models on EU infrastructure, under EU corporate structures. The same offer is appearing outside the EU proper too -- Telenor's AI Factory hosts under Norwegian jurisdiction, which is EEA rather than EU; the argument works the same way, but you're trusting Norwegian law rather than EU law.

Either way, sovereignty comes as a property of the vendor, not as a feature you have to negotiate. The provider isn't subject to extraterritorial US law because it isn't a US company. You give up some control -- the catalogue is whatever the vendor chooses to serve, upgrade cadence isn't yours to set, and you depend on the provider's uptime -- but the procurement story is dramatically simpler. There's no transfer mechanism question. There's no parent-company-jurisdiction question. The data flows through an EU legal entity, governed by EU law, end to end.

The second kind of vendor changes what the tier's capability question is. Once EU endpoints serve anyone's open weights, the comparison that matters is not "how good is the European model" -- it's closed frontier versus the open-weights ecosystem, and the ecosystem's best model is on the menu regardless of which lab trained it. A single lab's benchmark deck, Mistral's included, is one input to that comparison -- it doesn't settle it. The sovereign-hosting vendors behave accordingly: they serve whatever open weights make the capability case rather than defaulting to the European lab. What gates the tier in practice is two filters: availability -- whether the model and version you want is actually on an EU endpoint, since catalogues lag releases and vary by vendor -- and provenance, because hosting someone else's weights buys data-jurisdiction sovereignty, not model-provenance sovereignty. The provenance filter gets its own section below.

For most regulated organisations doing pure generation -- without an internal corpus to protect -- this is the sensible middle ground. The open-weights field has closed enough of the gap that "EU-jurisdiction endpoints can't deliver good results" is no longer a credible blanket objection; what's left to work out is which model you pick, whether your vendor actually serves it, and how it behaves.

What Tier 2 gets right. The jurisdictional question becomes uninteresting -- it's answered by the choice of provider. Cost is comparable to Tier 1, often better on output-token rates. The capability ceiling tracks the whole open-weights ecosystem rather than one lab's roadmap. And any open-weights model on the menu can be self-hosted later (Tier 4) without re-implementing the application -- a useful insurance policy.

What Tier 2 gets wrong. Smaller vendors and less mature SDK ecosystems. Then there's availability lag: a new open-weights release is not on EU endpoints on day one, catalogues differ by vendor, and the model-and-version you validated against can be deprecated on someone else's schedule. The capability comparison needs care model by model: labs publish the benchmarks that flatter them -- on the suites a lab does publish, the picture looks good by construction; on the ones it doesn't, you're guessing. That gap-where-the-numbers-aren't applies to every model on offer, Mistral included, and it's workload-specific -- Mistral's own deck, for instance, is strong for coding and agentic tasks and silent on general reasoning. And the tier's other gate, provenance, travels with every non-EU entry on the menu.

For the architect: treat the tier's capability ceiling as the open-weights ecosystem, filtered by what your chosen EU endpoint actually serves -- check the catalogue for the exact model and version before designing around a launch announcement, and expect catalogues to trail releases by weeks. Of the models actually served, Mistral Medium 3.5 (128B dense, open weights, 256K context) is the reference point where provenance and jurisdiction align: 77.6% on SWE-Bench Verified on Mistral's own published figures -- ahead of the previous-generation proprietary flagship, and capable enough for most production coding and agentic workloads. The claim worth defending is that it's good enough for the work, not that it matches the frontier -- and a tier decision only needs good enough. The caveat worth flagging in any procurement deck is selective publication, and it applies to every lab. Mistral did not publish the standard general-reasoning suites (MMLU, GPQA, and similar) alongside the release, so the strong coding and agentic numbers don't automatically transfer to document-reasoning or open-ended analysis workloads; the Chinese-origin entries that lead some published leaderboards carry the provenance considerations from the note below. Validate on your own task either way. Mistral's pricing is public at $1.50/$7.50 per million input/output tokens, which undercuts the hosted proprietary frontier on input cost and sits below it on output; for agent loops that generate a lot of text, that ratio dominates the bill. The open-weights story is the quiet insurance policy regardless of lab: the same weights that run on the EU endpoint self-host later -- Mistral Medium 3.5 on as few as four GPUs -- so a workload that outgrows its risk tolerance for cloud can move to Tier 4 without an application rewrite. Tool-use reliability on the mid-tier models is good enough to build against in 2026 -- run a brief validation pass on your own tool surface before committing, but the failure modes from a year ago (eager hallucinated tool calls, malformed JSON) have improved.

Cost shape. Same pay-per-use model as Tier 1, broadly similar headline rates, often cheaper on the dimensions that dominate real bills (output tokens, batch inference). A working prototype against an EU-hosted mid-tier model in the low tens of euros per month is realistic for internal use.

Operational burden. Same as Tier 1 -- you operate the application, the vendor operates everything else. The smaller vendor footprint means slightly more weight on your own monitoring (you can't lean on Azure's status page being everyone's status page), but in practice the burden is indistinguishable.

Tier 3: Hybrid -- retrieval in your perimeter, generation via EU API

Embeddings, retrieval, and document storage stay inside your perimeter. The generation step calls out to an EU-jurisdiction API.

For organisations doing RAG over internal documents -- which is the common case for "AI over our own data" -- this is the tier I'd reach for first in 2026, and the one I run myself at personal scale. The reasoning is workload-specific, not universal: the data-protection question for this kind of workload is dominated by what crosses your network boundary in raw form. If the only thing that ever leaves is a retrieved chunk -- which you can scope, redact, filter, or refuse to retrieve at all -- the surface area shrinks dramatically. The full corpus stays where it always was. The model never sees documents it wasn't authorised to see, because retrieval is the gate.

It's a Tier 2 generation story with a Tier 4 storage story. Both the legal exposure and the operational burden land somewhere in between, but the curve is asymmetric -- you get most of the sovereignty benefit of Tier 4 at a fraction of the operational cost. That asymmetry is why it's the default for this workload class; it stops being the answer the moment the workload isn't RAG, or the moment even chunked content can't cross the boundary (see Tier 4).

What Tier 3 gets right. The sensitive data -- the full text of contracts, case files, patient notes, internal correspondence -- never leaves your perimeter in bulk. Only chunks selected by your retrieval logic do, and you decide what's retrievable. Generation runs against a frontier-adjacent model without you having to host one. Operational burden is moderate -- you run a vector store and an embedder, both of which are mature commodity infrastructure.

What Tier 3 gets wrong. It's still a hybrid. Retrieved chunks do leave your perimeter at generation time, so for workloads where no portion of the source material can travel under foreign jurisdiction -- some defence work, certain healthcare datasets, privileged legal material -- Tier 3 isn't enough. The architecture is also more moving parts than Tier 1 or Tier 2: vector store, embedder, generation client, all in one application loop. More to monitor, more to upgrade.

For the architect: the pattern is mature enough to assemble from off-the-shelf parts. Vector store: Qdrant, Weaviate, or pgvector -- whichever your team is comfortable operating. Qdrant is the easiest to deploy if you're starting fresh; pgvector wins if you're already running Postgres and want one fewer service. Embedding model: sentence-transformers/all-MiniLM-L6-v2 for general English RAG runs comfortably on CPU at ingestion speed and stays competitive on quality with much larger embedders; multilingual workloads want something like intfloat/multilingual-e5-large. Generation client: any Tier 2 EU provider via their HTTP API, abstracted behind IChatClient or equivalent so swapping is configuration. The retrieval call is the only place content leaves the network -- apply chunk-level access controls there, log every retrieval for audit, and you have the audit trail DORA and similar regulations want to see. One detail that bites people: the embedder used at query time must match the one used at ingestion exactly, including quantisation, or vectors drift and recall quietly degrades. The cleanest way to enforce parity is to run the embedder as a single service and have both the indexer and the query path call it, rather than loading the model twice and hoping the weights match.

Cost shape. Tier 2's API spend on the generation side, plus your own infrastructure for storage and embedding. The generation API line dominates the running bill, and the reason is structural rather than incidental: embedding is a one-off cost paid once per document at ingestion, while generation recurs on every query against a far larger model. For a small internal deployment -- on the order of 100,000 documents and a few hundred queries per day -- the storage and embedding side runs on commodity hardware costing less than a single mid-range laptop, amortised over years. Scale the query volume and it's the generation tokens, not the infrastructure, that move the invoice.

Operational burden. Moderate. You're running two services (vector store and embedder) on top of the application. Both have well-understood failure modes. The main operational discipline is keeping the ingestion pipeline running -- new documents need to be embedded and indexed without manual intervention, or the system rots quietly.

Tier 4: Fully on-premises

Your hardware, your network, your inference. Nothing leaves the building.

Maximum control, maximum operational burden. Justified when the data genuinely cannot leave -- some healthcare workloads, defence work, certain categories of legal material, environments under works council agreements that prohibit external processing of employee data, or anywhere the audit trail itself becomes evidence.

The hardware story has improved sharply since 2024. A single workstation with enough unified memory or VRAM now hosts models that would have needed a small cluster two years ago. A 30B-class model -- capable enough for many knowledge-work and retrieval tasks, if below the current frontier on the hardest reasoning -- fits comfortably on a 128GB unified-memory-class workstation (as of May 2026, the Framework Desktop with Strix Halo is one concrete example of that class; comparable Nvidia workstation builds exist). One such machine serves a handful of concurrent users; a pair with a load balancer in front serves a small team. For organisations that need both the privacy and reasonably modern capability, the curve has bent in a useful direction.

It has not, however, become trivial. Tier 4 means owning the model upgrade cycle, the GPU procurement cycle, the capacity planning, the on-call rotation, and the operational tail of keeping inference reliably available. For frontier-adjacent generation -- the largest open-weights models, served at production latency and throughput -- you're into hardware investments that compete with senior-engineer salaries, and you need a team that knows how to operate them.

What Tier 4 gets right. Nothing leaves. No transfer mechanism question, no chunk-level access policy to write, no audit trail for outbound calls because there are no outbound calls. The strongest possible answer to "what happens in the worst case" for the workloads where the worst case actually matters.

What Tier 4 gets wrong. Cost and burden are both real. The hardware bill is the visible part; the staffing bill is the invisible one. Model upgrades that arrive on Tier 1 as "version bumped overnight" arrive on Tier 4 as "schedule a maintenance window and re-benchmark." Capability is workload-dependent: for retrieval and agent workloads where most of the value lives in tool-use and orchestration, modern mid-size open-weights models are excellent; for the hardest reasoning tasks, the frontier-quality gap is still present.

For the architect: viable Tier 4 setups in 2026 fall into three rough classes (hardware prices are May-2026 ballparks and will date quickly -- treat them as orders of magnitude, not quotes). (1) Single unified-memory workstation, single model -- a 128GB unified-memory-class machine running a 30B-class model like Qwen3-Coder 30B or Mistral Small. Serves a few concurrent users; single point of failure; low five figures of euros. (2) Small multi-GPU server or workstation pair running 70B-plus models with redundancy -- tens of thousands of euros depending on configuration, serves a department. (3) Multi-GPU server for the largest open-weights models -- Mistral Medium 3.5 self-hosts on as few as four GPUs, and that class of deployment runs into six figures with real data-centre or colocation requirements, serving an organisation. Across all three, the inference stack itself is mature -- vLLM or TGI for throughput, Ollama for smaller deployments -- but the operational disciplines (model upgrade pipeline, capacity planning, on-call rotation) are entirely your team's problem. Don't underestimate the staffing line; that's where Tier 4 budgets break, not the hardware.

Cost shape. Heavy upfront capital expenditure, modest ongoing operating cost (power, cooling, periodic hardware refresh), significant staff time. For a small-scale deployment the hardware is manageable; at production scale serving a meaningful user base, the total cost of ownership lands well above Tier 3, and usually above Tier 1 too once you're honest about staff time.

Operational burden. High. You are now in the business of operating inference infrastructure. That's fine if you already operate other production infrastructure with similar discipline; it's a category change if you don't.

A note on "sovereign cloud" SKUs

The cloud hyperscalers have all introduced sovereign-cloud offerings as a response to the residency-vs-sovereignty gap -- Microsoft Cloud for Sovereignty, AWS European Sovereign Cloud, Google Cloud's sovereign solutions, and partnerships like Bleu (Microsoft + Capgemini/Orange in France) and S3NS (Google + Thales). These are improvements over standard cloud, but they sit between Tier 1 and Tier 2 rather than collapsing the distinction.

What they get right. Operational sovereignty improves meaningfully. Local operators handle physical access, local staff handle support, the control plane gains EU-jurisdiction overlays. For organisations that need to satisfy operational-sovereignty audit requirements without abandoning the hyperscaler ecosystem, they're a real option.

What they get wrong. The underlying corporate structure usually remains -- the joint venture or operator agreement layers EU governance on top of a substrate still operated by the foreign parent. The jurisdictional question doesn't fully resolve unless the EU partner holds majority control and the foreign parent's involvement is limited to licensing rather than operations. Read the actual corporate structure of any specific offering before treating it as equivalent to a fully EU-domiciled provider; the variation between SKUs is significant, and the marketing copy isn't a reliable guide.

For most organisations, the question between "sovereign cloud" and Tier 2 comes down to whether the ecosystem you're already in (Microsoft 365 integrations, AWS service breadth) is worth the residual jurisdictional ambiguity. For some it clearly is. For workloads where Part 1's analysis pushed you off Tier 1 in the first place, it usually isn't.

A note on model provenance

Open weights split a question that hosted proprietary APIs keep bundled: where your data goes, and where the model came from. An EU provider serving open weights answers the first completely: EU legal entity, EU infrastructure, and model weights that are just files -- nothing phones home to the lab that trained them. Hosted in Paris, DeepSeek is exactly as sovereign as Mistral on the data axis.

Moving the hosting doesn't change what's baked into the model. Weights carry the training and alignment choices of the lab that produced them, and some of those choices are political. The best-documented case is DeepSeek R1, where researchers found refusal and deflection on politically sensitive topics embedded in the openly distributed weights themselves, persisting in fully private deployment -- not just in the hosted chatbot. Every lab bakes behavioural policy into its weights, US labs included; the state-aligned toning of some Chinese-origin models is simply the variant most visible, and most awkward, in a European deployment.

Whether it matters is workload-specific, like everything else in this post. A coding assistant or a RAG system over internal case files rarely puts geopolitics in front of the model. A citizen-facing chatbot, a policy-analysis tool, or anything answering open-ended questions does -- and baked-in political alignment is a behaviour risk that benchmarks don't surface.

This is the residual argument for EU-origin weights beyond the jurisdiction one: with Mistral, provenance and jurisdiction point at the same place. Where they don't, provenance narrows what Tier 2 can actually offer you for an exposed workload -- treat it as its own line in the assessment, separate from hosting, and validate against your own prompt set before committing. The distinction cuts across tiers -- it applies equally to an EU-hosted open-weights endpoint and to the same weights self-hosted under Tier 4.

A decision framework that fits on a napkin

Putting the tiers next to each other on the axes that matter:

Tier Jurisdictional control Operational sovereignty Capability ceiling Operational burden Cost shape
1 -- US "EU region" US (CLOUD Act) Medium Frontier Near zero Pay-per-use
2 -- EU API EU Medium-high Open-weights ceiling, gated by vendor catalogue and provenance Near zero Pay-per-use
3 -- Hybrid (RAG) EU (for stored data) High (storage) / Medium (retrieved chunks) Tier 2's ceiling Moderate Low fixed + Tier 2 API
4 -- Fully on-prem EU Maximum Open-weights ceiling, gated by your hardware High High capital + staff

A pragmatic shortlist for the choice, rough order:

  1. Is the workload special-category data, privileged material, or governed by sector-specific rules (DORA, NIS2, GDPR Article 9, national administrative law)? If no, Tier 1 is defensible -- do the calculation honestly and move on. If yes, continue.
  2. Does the data need to stay inside your perimeter in raw form, end to end? If yes, Tier 4. If chunked retrieved content under EU jurisdiction is acceptable, continue.
  3. Is RAG the dominant pattern? If yes, Tier 3. Frontier-adjacent capability via Tier 2 generation, sensitive corpus in your perimeter, well-understood operational shape. For many organisations doing AI over internal documents, this is where the trade-offs land most favourably.
  4. Pure generation use case without internal documents -- chatbot, classification, summarisation of public material? Tier 2. The hybrid story doesn't buy you anything when there's no corpus to protect.

The trap in this list is treating Tier 4 as the "serious" answer for sovereignty-sensitive workloads. In many RAG systems the corpus is the primary sensitive asset, not the generated output -- and where that holds, the remaining gap from Tier 3 to Tier 4 is real but narrow, only worth crossing when even retrieved chunks can't leave.

What's next in this series

This post laid out the tiers. The next one is a reference architecture.

  • Part 3 walks through a working Tier 3 RAG system: vector store, embedding pipeline, retrieval API, generation through an EU provider, audit logging, the integration points that matter in real procurement. .NET-centric but the architectural decisions are language-neutral. The point is to show what the moving parts actually look like in working code, not to advocate for a specific stack.

The harder truth across the series is that the right answer is workload-specific, time-horizon-specific, and data-class-specific. There is no single tier that's correct for an organisation -- there's a portfolio of choices, one per workload, each defensible on its own terms. The procurement question isn't "which tier do we standardise on" -- it's "for this specific workload, with this specific data class, with this specific platform lifespan, which tier matches the risk?"

That sounds like more work than picking one. It is more work. It's also the work that actually addresses the legal exposure, instead of optimising for procurement convenience and finding out about the gap during an audit.

If your organisation is working through this -- public body, regulated private-sector business, or somewhere in between -- I'd be interested to hear what tier choices are landing where, and where the friction is showing up. Email's in the footer.


Previous: Why your AI pilot shouldn't leave the EU -- and what "EU" actually means Next: Part 3 -- a working EU-jurisdiction RAG reference architecture (forthcoming)