Logs deserve to be a first-class data source

There is a category boundary in the modern data stack that is past its expiry date. On one side: structured data, analytics, BI, "talk to your data" agents. On the other: observability, logs, traces, "how is the system behaving". The two categories share customers, share questions, and share tooling underneath — but the tools sold to the two sides have stayed stubbornly separate.

That separation is now actively hurting the business questions the analytics side is trying to answer.

The question that crosses the line

Try asking any one of the major data agents this:

"Compare orders placed in the last 5 days vs the 9 days before that, and check the checkout-service logs for any failure spike in the same windows."

Snowflake Cortex will compare the orders. It cannot reach the logs. Databricks Genie will compare the orders. It cannot reach the logs unless you have already ingested them into a Delta table, which most enterprises haven't. Fabric Copilot will compare the orders. The logs live in Application Insights, which is a different product surface.

And yet that question is the most common variety of executive question we see in real engagements. The number moved. Why? The answer is rarely "demand dropped". The answer is almost always "the payment gateway is returning more timeouts" or "the search service was returning empty results all weekend" or "the email-verification step started bouncing 30% of new signups". Those answers live in the application log.

Why the wall exists

The observability vendors built large, valuable companies around the idea that logs are a different kind of data. They use a different storage engine (Lucene-based or columnar-on-blob), a different query language (KQL, SPL, ES DSL), a different UI, and a different pricing model (ingest GB-day instead of compute hours).

The analytics vendors built equally large companies around the idea that the warehouse is the centre of gravity. Their tools are designed around tables with schemas, not around streams of semi-structured events.

Both are right within their domain. The wall between them is what's wrong — specifically, the wall is wrong inside the questions a business actually asks. No CFO has ever asked a question that fit neatly inside one category.

What "logs as a first-class source" actually means

Five concrete things have to be true for log data to feel like a real participant in the analytics stack, not a side-quest:

The query language is shared, or the abstraction layer hides the difference. The analyst should not have to learn KQL to ask "how many errors yesterday".
Logs JOIN with warehouse tables. The error count by service has to be joinable with the orders table by time bucket, and the user_email field in the log has to JOIN with the customers table.
Schema is inferred, not pre-declared. Application logs evolve constantly. The data layer has to handle a new field showing up next week without a migration.
Time is a first-class dimension. Most useful log questions are "per X over time Y". The query engine needs date_histogram-style aggregation natively.
Cost scales with use, not ingest. The economics of log analytics fall apart if you have to pay to ingest every event before you know whether you'll want to query it.

JSON-lines log files in S3 actually meet most of these criteria for free. So do Elasticsearch and OpenSearch indices, if the agent layer speaks DSL. The mistake is treating them as second-class because they're not in the warehouse.

The simple test

In our engagements we use one test to see whether a "unified data" product is genuinely unified: ask it the canonical mixed question above. If the response involves a separate tool, a screenshot from Splunk, or a paragraph that begins "unfortunately the logs are in…", the product is unified only inside its own walls.

A unified data product handles the question end-to-end:

Picks up the orders table from the warehouse.
Picks up the checkout-service log stream from wherever it lives — S3 files, Elasticsearch index, structured log volume.
Buckets both into the same time window.
Returns a single table that shows order count and error count side by side, both per time bucket.
Lets the user follow up with "which error code spiked the most?" without re-explaining the context.

How we built it in Wekams Lens

This is the specific gap that drove us to build Wekams Lens. We treat JSON-lines log files and Elasticsearch / OpenSearch indices as connectors, registered the same way as Postgres or S3. The catalog presents them to the LLM with the same shape (schema, columns, sample rows). The federation engine attaches them to the same DuckDB session as the warehouse tables, so JOINs across orders and logs are one SQL query, not three tools.

The result is that the answer to "why did orders drop" in our demo — against a Postgres orders table and a folder of JSON-lines checkout logs — is one query, one result table, one paragraph. The LLM writes the SQL with CTEs that bucket both sources by day; DuckDB executes it; the analyst sees that orders went from 3 to 1 while checkout failures went from 59 to 144, and the diagnosis writes itself.

What this means for the analytics vendor lock-in story

The honest read is that this is the biggest blind spot the big data clouds have right now. They built their AI agent on top of their warehouse because that's their gravity. Their inability to reach into the logs — which they don't own, which they don't ingest, which they don't bill for — isn't a bug; it's a strategic choice not to invest in customer questions that don't drive their consumption.

The customer pays for that choice. The analyst still has to switch tools, still has to re-paste the question, still has to mentally reconcile two result sets. Multiply that by every meaningful business question and you get the dirty secret of "unified data": it isn't.

The first generation of data agents that take logs seriously won't have to be technically better than Cortex or Genie at their core task. They just have to refuse to stop at the warehouse wall.