Gone Fishing, Why The Data Lakehouse Offers A Smarter Catch

Data repositories don’t always have clear waters.

Adrian Bridgwater

Companies watch their data. It’s a central fundamental action that enables any business to work and function. Often more formally referred to as data observability (rather than data watching), modern digital organizations use a number of software services and tools in an attempt to gain a view into the data traversing the workflows and operational systems upon which they have established their business.

We have largely moved beyond using human eyes to watch data channels and look for outliers that could pose threats, trends that could direct a change in business policy or anomalies that might lead us to tweak the mix of fuel in the digital engine room – but there’s still a major challenge.

Blindness on the data estate

We are at a point in data generation where even if we have the technology to handle data ingestion, data storage and several degrees of data management, many organizations will still be blind to some parts of their data estate. Given the widespread proliferation of smart devices, edge computing endpoints (e.g. sensors, gauges and smaller embedded computing devices such as set-top boxes and even kiosk computers) feeding the Internet of Things (IoT) and the rise of more sophisticated smart autonomous machines, it’s safe to say that now is something of a tipping point in the world of data observability.

Data scientists like to talk about their approach to data observability within the context of the metrics, logs and traces that offer them an almost secret view into how our machines are running. But this view is getting cloudy, which makes it tougher to find the clues needed to run IT operations and security effectively.

Every time a user taps, clicks, or swipes on an app, or a developer releases a new code deployment or makes an architecture change in their container platform, it generates more observability data that needs to be captured and analyzed to understand what’s going on beneath the surface.

Sending data back to rehab

Now, there’s so much of that data it’s become impossible to store and use it all cost-effectively. Storage costs money – so the economics of being a hoarder just don’t add up for most organizations. That’s forcing them to be more selective about which data they keep. As a result, the vast majority of observability data is being tossed out, or it’s locked away in ‘just in case’ cheaper storage layers where it can’t be analyzed without lengthy and expensive retrieval, rehydration, and reindexing – data rehabilitation, if you will.

Nobody has the time or inclination to do that (‘lengthy and expensive retrieval processes’ doesn’t exactly scream ‘real-time actionable insights’), so for most organizations, that means they just make do with whatever they have in their observability and log analytics tools at any one time. That means they only get pieces of the puzzle to base their decisions on, rather than a complete picture.

As well as having an incomplete dataset, most of the data organisations do keep is stored and analyzed in silos – using a host of different monitoring and analytics tools (one for a bit of infrastructure here, one for a bit of infrastructure there etc.), so lacks a crucial ingredient that ties it all together – context. That means any answers organizations can hope to get from data analytics are often incomplete, imprecise, or even – dare we say it, downright wrong – limiting the value. If you automate processes on bad data, you can’t expect things to keep running smoothly.

If this all sounds like an insurmountable challenge, then, it is, but the reason we’re able to detail this data meltdown scenario is that the IT industry is usually pretty good at being introspective. So much so that it can look at itself and see how the operations layers it builds are performing – even if that means knowing there are murky waters beneath. The answer might be to move out of the weeds and into the lakehouse.

Software intelligence company Dynatrace thinks it can provide some (if not many, or perhaps all) of the answers here. The company has now come forward with a scalable data lakehouse solution known as Grail, which represents a new core technology of the Dynatrace Software Intelligence Platform. Grail is promised to revolutionize data analytics and data management by unifying observability and security data from cloud-native and multi-cloud environments, retaining its context and delivering instant, precise and cost-efficient AI-powered answers and automation.

What is a data lakehouse?

First, there’s data, then there are databases and then there are data warehouses – the latter being a facilitating technology for Business Intelligence (BI). A real-world warehouse is a place of designated functional work by operatives where many items are brought together from different suppliers and places. Similarly, a data warehouse is a central repository of integrated data from more than one source (usually transactional systems, different databases and these days also from machine logs) where we can use data mining tools to perform analysis and reporting functions. Not all data ends up in the database or the data warehouse, some of it ends up getting dumped in the data lake, a low storage cost uncharted place typically populated with unstructured data. Following all that logic then, a data lakehouse combines the structure, management and querying capabilities of a data warehouse, with the low-cost benefits of a data lake.

So why has Dynatrace placed such faith in the scalable data lakehouse concept for Grail and how does it work?

Grail uses a Massively Parallel Processing (MPP) analytics engine, which enables organisations to run thousands of queries simultaneously rather than processing them sequentially. We can think of this a little like panning for gold, but with a multi-tool panning system capable of spanning every inch of the lake at the same time.

“Sprawling and dynamic cloud-native and multi-cloud environments are an ecosystem of various technologies and services and the composition changes by the second,” said IDC analyst Stephen Elliot. “This paradigm makes it critical for organizations to acquire a platform with advanced AI, analytics and automation capabilities. The platform must be able to ingest all observability, business, and security data, put it in an accurate context in real-time, and facilitate access to data-backed insights when needed.”

What is causal AI?

Grail is actually a ‘causational’ data lakehouse that uses Dyntrace’s own AI engine Davis. Yes, more terminology, but worth understanding now that AI is a part of so many of the apps that we all touch every day. According to the Stanford Social Innovation Review, “Causal AI identifies the underlying web of causes of a behavior or event and furnishes critical insights that predictive models fail to provide.”

Further then, we know that causal inference is the process of determining the independent, actual effect of a particular thing, when that thing (or event) is a component of a larger system.

“Organizations are painfully in need of a revolutionary take on observability and security data analytics that transcends speed and scale limits by 100X while relieving existing cost constraints for managing cloud-native and multi-cloud environments,” said Bernd Greifeneder, founder and chief technical officer at Dynatrace. “Grail delivers by boosting the Dynatrace-approach to causational AI, which retains data-context with precision and at a massive scale. Starting with logs, Grail makes it possible for teams to leverage instant analytics for any query or question, cost-effectively.”

Gone fishing (for data)

Using the parallel in our story title here – gone fishing – we can think of the standard real world fishing trip being a fairly random and unknown process, however much skill, equipment and technique the angler possesses. From the relative comfort of the lakehouse though, there’s proximity, shelter and resources – plus there might even be a fridge full of cold beers.

Fishing for data insights in the web-scale cloud-centric world of the total information universe is also a more random process, somewhat like taking to the open high seas. Working from the causational data lakehouse, we can enjoy structure, management and a far more intelligent ability to query the waters in front of us. It’s a smarter catch for sure, there’s just less chance of a cold beer at the end of it.