Cross-source SQL: joining a database and an object store in one query.

The hardest business questions cross systems. The platforms charging you the most to answer them are also the ones least able to reach across vendor boundaries. Here's why that gap exists, and what a one-query cross-source JOIN actually looks like.

Every consulting engagement on data eventually runs into the same wall. The CFO asks "why did revenue drop last week?", and the answer turns out to live in:

  • The production Postgres database (orders, customers).
  • A monthly Parquet export from Stripe sitting in an S3 bucket.
  • An error spike in the checkout-service logs, currently in CloudWatch.

Three systems. Three tools. Three logins. Three flavours of SQL or query language. By the time the analyst has stitched the answer together, the executive has moved on to the next question, or worse, made a decision on the assumption that the drop was demand-side.

Cross-source SQL — the ability to write one query that joins across all three — should be a solved problem in 2026. It isn't, at least not in the platforms most enterprises have spent the last decade buying.

What "federated" really means in the marketing

Every major data cloud advertises some flavour of federation:

  • Snowflake external tables and Iceberg. Lets Snowflake compute read Parquet files in S3. Works well if S3 sits inside an AWS account Snowflake has been granted access to. Does not help if the Parquet sits in someone else's S3, in Azure Blob, or in GCS.
  • Databricks Unity Catalog Federation. Lets Databricks compute reach into a number of external systems — including Snowflake, BigQuery, and Postgres. Works if you are already a Databricks shop and willing to register every external source in Unity Catalog. Does not help your analyst who wants to JOIN a Parquet file with a Splunk index.
  • BigQuery Omni. Lets BigQuery compute reach into AWS S3 and Azure ADLS. Tied to BigQuery for query planning. Does not help if BigQuery itself is not the centre of gravity of your data stack.

Each of these is a real capability inside the vendor's walls. None of them lets you write one statement that does JOIN postgres_orders ON s3_stripe_export ON splunk_errors when those three systems belong to three different vendors.

Why this is hard, architecturally

The technical reason most vendors don't do this is that their query engine is co-designed with their storage. Snowflake's optimiser was designed around micro-partition statistics it controls; Databricks' Photon engine was designed around Delta Lake metadata. Neither was designed to push down predicates into a system that pre-dates them by ten years.

The commercial reason is more honest: vendors don't want to make it easy to query data that lives in a competitor's storage. The longer you keep data in their warehouse, the larger the consumption bill.

The third reason — the one that's most underappreciated — is that real cross-source SQL needs the engine to be small enough to live next to the data, not the other way around. Sending 50 GB of Parquet through a hyperscaler-billed network so a hosted warehouse can JOIN it with 200 KB of Postgres rows is the wrong architecture. The right architecture pulls the engine to the data.

DuckDB is the engine you don't have to host

DuckDB has quietly become the cross-source query engine of choice for new tooling. Three reasons:

  • It is a single binary. There is no cluster to provision, no Kubernetes operator to configure. You drop it into your process and it works.
  • It speaks Postgres natively (via its postgres extension), reads Parquet / CSV / JSON-lines natively, and reads S3 / Azure Blob / ADLS Gen2 / GCS via httpfs and the azure extension.
  • It has a competent query planner that pushes filters and projections into the source where it can.

A real DuckDB session, attached to a Postgres database and pointed at an S3 bucket of Parquet files, looks like this:

INSTALL postgres; LOAD postgres;
INSTALL httpfs;  LOAD httpfs;

CREATE SECRET wekams_s3 (
  TYPE S3, KEY_ID 'AKIA...', SECRET '...', REGION 'ap-southeast-1'
);

ATTACH 'host=db.internal dbname=app user=ro' AS prod (TYPE POSTGRES, READ_ONLY);

CREATE VIEW exports.stripe AS
  SELECT * FROM read_parquet('s3://finance-exports/stripe/*.parquet');

-- Now this works:
SELECT  c.country,
        COUNT(DISTINCT o.id)              AS orders,
        SUM(s.amount_cents) / 100         AS stripe_amount_usd
FROM    prod.public.customers   c
JOIN    prod.public.orders      o ON o.customer_id = c.id
JOIN    exports.stripe          s ON s.customer_email = c.email
WHERE   o.placed_at >= NOW() - INTERVAL '7 days'
GROUP BY c.country
ORDER BY orders DESC;

One statement. Two systems. One result set. The Postgres side does the filter and the projection (DuckDB pushes them down). The S3 side returns only the Parquet row groups that overlap the date range. The JOIN happens in-process inside DuckDB.

What this changes for the agent layer

The reason cross-source SQL matters now in particular is that the natural-language interface layer is finally here. A user asks the question once, in English; the agent generates the SQL above; the engine runs it; the result comes back as a table and a sentence. The user never has to know which system is which.

Wekams Lens, the data agent we built for this category, uses DuckDB as its federation engine for exactly this reason. The LLM is told what sources are attached and what tables each one exposes; it writes the federated SQL; DuckDB attaches both Postgres and the relevant S3 bucket; the answer comes back as a row table. Same UX whether the question hits one source or four.

Where this still doesn't work

To be honest about the limits: DuckDB-style federation works extremely well when the working set fits in memory at the JOIN step. Joining a 50-million-row Postgres table with a 200-million-row Parquet table is fine. Joining two 5-billion-row sources is not what this engine was designed for, and you should not throw it at that problem.

For the petabyte-scale internal warehouse, Snowflake and Databricks remain the right answer. They earn their compute bill on those workloads.

For the 80% of cross-source questions that come up in the average enterprise — mid-sized JOINs across vendor boundaries, the kind that the CFO asks at the Monday meeting — DuckDB-driven federation is now the better answer. And it's the only answer that runs in the customer's own network.

What to try this week

If you have an analyst stuck stitching answers across systems, give them an afternoon with DuckDB. The official extensions for Postgres, MySQL, S3, and Azure cover almost every modern stack. They will write their first cross-source SQL inside an hour. Whether you then layer a natural-language agent on top — we obviously think you should — is a separate decision. The federation engine working at all is the unlock.