Self-hosted vs cloud data tools: a 2026 decision framework

Of every seven cloud-platform debates we mediate, six go like this: the platform team wants managed SaaS; the security team wants self-hosted; the CFO wants the cheaper option; the CIO wants to look modern. Two months in, nothing has been decided, the budget cycle is closing, and the team is about to default to whichever vendor sent the most expensive lunch.

The frustrating part is that the decision is rarely close once you ask the right questions. Most enterprises will end up on managed SaaS for 80% of their data stack and self-hosted for a clearly-defined 20%. The trick is identifying which is which before three months go by.

The seven questions that resolve it

Run any data-platform decision through these. They are deliberately ordered — an honest "yes" to any of the first three flips the default to self-hosted.

Is there a regulatory text that explicitly forbids data leaving the country / network? Not "guidance suggests" or "spirit of the rule" — an actual law or regulator letter. MAS Notice 644 in Singapore, RBI data localisation in India, NDMO sovereign-data classifications in Saudi, ADHICS in the UAE health sector. If any of these apply, you are likely self-hosting, full stop.
Will operating-without-internet be a continuity requirement? Defence, OT in critical-infrastructure, sovereign-government sandboxes. If "the model must keep working when the perimeter is sealed" is in the BCP, you are self-hosting.
Are you the data processor for a competitor of the SaaS vendor? A regional bank doesn't want its data inside a US hyperscaler that is also funding a fintech competing for the bank's deposits. Procurement won't sign.
Is the workload primarily an API, or primarily ad-hoc analyst use? APIs are usually fine on SaaS — the consumption is predictable. Ad-hoc analyst exploration on consumption-priced SaaS is where the surprise bills hide.
Do you have an in-house platform team that already runs Postgres / Kubernetes at scale? If yes, the operational delta of self-hosting one more thing is small. If no, the operational delta is the whole conversation.
Is the cost predictability requirement stronger than the elasticity requirement? Boards hate variable bills more than they love elasticity. If the CFO needs the number to land within ±5% of the budget every quarter, self-hosted is structurally a better fit.
What's the worst case if the vendor disappears / is acquired / changes terms? Snowflake-Databricks-Fabric will not disappear next year. A smaller SaaS vendor in the long tail might. Self-hosted gives you a guaranteed runway equal to your hardware life; managed SaaS gives you a runway equal to your contract end-date.

If you're answering "yes" to questions 1, 2, or 3, the decision is made; the rest of the meeting is just acknowledging it. If those three are no, then 4-7 are weighing exercises — usually with managed SaaS as the right answer.

The three regional contexts where it's already obvious

We work across Singapore, India, the UAE, Saudi Arabia, Australia, and the rest of ASEAN. Three regional patterns surface so consistently that we treat them as defaults rather than evaluations:

Singapore financial services. MAS Notice 644, the Technology Risk Management guidelines, and the third-party risk obligations make sending raw transaction data to an off-shore SaaS a controlled exception, not a default. Most banks we work with default to "self-hosted unless we can show why SaaS is safer". Wekams Lens-style architectures fit naturally.
India under DPDP and RBI localisation. Personal data of Indian residents has to remain in India, and the RBI has explicit local-storage requirements for payment data. Most managed SaaS satisfies this through Indian regions; some workloads are still uncomfortable with the foreign-controlled control plane. Self-hosted in a regional colo is increasingly the answer for the largest customers.
Gulf sovereign-cloud customers. Saudi NDMO, UAE TDRA, and emerging frameworks across the GCC have moved from "data residency" to "operator sovereignty". Even if the data is in-region, the workload must be operable by domestic staff with no foreign technical dependency. That's a self-hosting argument by definition.

None of this means cloud-native tools are wrong in these markets. It means a defensible architecture in these markets usually involves a self-hosted layer for the regulated workloads, with managed SaaS used selectively for the unregulated remainder.

What "self-hosted" doesn't mean

Two clarifications that come up in every steering-committee debate:

Self-hosted does not mean owning the hardware. Running the workload on commodity hardware in a regional colocation, or inside a sovereign-cloud region operated by a domestic provider, counts as self-hosted for almost every regulatory framework. You are the operator. You control the data. The hardware bill is somebody else's.

Self-hosted does not mean rejecting AI. Open-weight LLMs — Qwen, Llama, DBRX, DeepSeek — are now competent enough that the gap between "frontier model in the cloud" and "open-weight model on your hardware" is meaningful for some tasks and irrelevant for many. The relevant tasks are mostly the public-facing ones (research, creative writing). The irrelevant tasks — SQL generation, classification, summarisation, structured extraction — are exactly the ones enterprise data agents are doing.

Where the decision usually goes wrong

Two failure modes account for most regret:

Buying managed SaaS because the regulator hasn't explicitly forbidden it yet. Three regional regulators in our markets have tightened their position on cross-border data flow inside the last 18 months. Buying a platform with a re-paper-the-contract clause in 18 months is buying future pain.
Self-hosting something operationally complex without an SRE team for it. The cost of running a Snowflake-equivalent yourself is real. Don't underestimate the human bandwidth needed to do this well. If you don't have it, the right answer is to self-host the lightest possible product (a DuckDB-based agent, a Postgres-backed catalog, a single VM) and SaaS the rest.

The pragmatic stack we recommend

For most regulated-industry customers in our regions, the stack we end up advising is:

Storage in the customer's preferred cloud, in-region. AWS Singapore, Azure UAE Central, GCP Mumbai, etc.
Source systems stay where they are. Postgres / Oracle / SAP / Mainframe — whatever the customer has.
The data agent — the natural-language interface, the SQL-generating LLM, the conversation history — self-hosted on a small Kubernetes cluster or even a single VM in the customer's network. Air-gap-capable. Open-weight model bundled.
Frontier models used selectively, for unregulated tasks, on the customer's terms.

This is precisely the architecture Wekams Lens is built around: the storage and source systems stay yours; the agent runs in your network; the model is open-weight; the LLM compute runs on your hardware. Customers can layer a frontier model on top for non-sensitive workloads if they want, but the spine of the system stays under their control.

Not a religion. Not an ideology. Just the architecture that satisfies the regulator while still letting the analyst ask a question.