in ,

Navigating The Stream Between Data At Rest And Data In Motion

Navigating The Stream Between Data At Rest And Data In Motion

Technology is now a stack. You don’t need to look far to find evidence of this idea, there are any number of technology blogs, websites and newsletters that use this word to describe the combined, layered and interwoven nature of the various technology components that we now engineer into the IT stack that any business now needs to operate.

If we accept that the IT function now runs on a multi-tiered stack of coalesced technologies, then what shape should it be and from what base materials should it be formed? CEO of cloud DataBase-as-a-Service (DBaaS) company DataStax Chet Kapoor says that the first consideration should always be data; he insists that we can’t form any type of functional stack unless we think about the needs of the data it serves.

Old fashioned data

With a key focus on real time data, DataStax is not averse to referring to old fashioned data i.e. the now comparatively static way we used to gather information, filter it, store it and then, now and again, perform some type of action to access it, deduplicate, normalize and parse it so that we could plug it into an analysis engine and attempt to get some insights out of it.

Today we live in a world where customers want instantaneous experiences and Kapoor says that this immediacy is being further accelerated by the growth of Artificial Intelligence (AI) and Machine Learning (ML), both of which are helping to increase the cadence of user expectations.

The question now comes down to what happens in the deeper recesses of the IT stack engine room i.e. the place where a lot of our corporate data often gets dumped to buy us time to think about when we can perform jobs with it. That place is the data lake.

“We need to move on from the data lake and start navigating the data river,” says Kapoor. “Data really is like water in that it’s everywhere, it comes in various stable states and forms, it is an essential part of our lifeblood and people want to consume it. But most of all, data is like water because it is constantly in motion and it needs to be ubiquitous.”

All of which is cute enough as an illustrative analogy, but we have to realize that data is a whole lot more unwieldy than water; it is unstructured, variegated and certainly doesn’t all fit in the same size cup. To really stretch the analogy, getting data to where it needs to be often involves complex Extract, Transform & Load (ETL) processes… and this is not just a question of turning on a tap.

Databases, in too many flavours

Kapoor reflects on his many years working in the uber-distributed world of middleware and admits that the database world is highly fragmented. It’s almost as if there is a database for every use case, every analytics function, every process style, every software license option and every data type. But at the end of the day, he says, customers just want to analyze data in real time and be able to work with data at rest and data in motion more fluidly.

Will this be a wholesale move to this new cadence that sees us never look back? Kapoor says yes, no, maybe and it depends. There will still be a valid case for batch data processing, nightly builds and the old way of doing things.

But those old fashioned (there’s that term again) scenarios will happen according to a more prescribed menu of data urgency i.e. where jobs are related to historical data, where workloads are classified as not necessarily mission critical and – perhaps most significantly in the context of this discussion – where the data being handled does not form any functional part of a real time application where users (customers) demand immediacy.

Data at rest & in motion

But data at rest is only part of the story. Data has to be actionable for a wide range of consumers (by which we mean users, machines, API connections and any other entity that exists across the fabric of the cloud), so that means we need to be able to access real-time data in motion as well.

How we do this involves some complex technology, but we can explain it in boardroom terms. To get value from all available data involves capturing events and data points from customers, processes, or machines – as well as data stored in a database – to fuel that real-time application experience.

These applications can then deliver what the business needs in the moment – from an enhanced in-the-moment customer experience, to an automated operational process or the instant insight required by a business user. A specific example is DataStax’s version of Change Data Capture (CDC), a systems developer technology practice designed to enable us to apply database analytics functions to moving, living, live-changing data.

While our source database data will always exist and can continue to have changes applied to it which ‘persist’ and so effectively become part of the data at rest side of this argument, we can also create a secondary target area of data. This secondary element of data is created as an ‘event’ that is written to a messaging platform so that we can take actions on it, almost as if it were at rest, but in the knowledge that it is part of our live data stream.

DataStax does this using its Astra Streaming technology built on open source Apache Pulsar, a cloud-native, multi-tenant, high-performance solution for server-to-server messaging and queuing built on the publisher-subscribe (pub-sub) pattern. Astra Streaming is integrated with Astra DB, the company’s managed database service built on Apache Cassandra, to connect the data in Astra DB to other data systems in real time.

In terms of his style of innovation, Kapoor took over from previous CEO Billy Bosworth in 2019 with a somewhat different approach. Driving the company with what is an arguably more engineering-led developer-first groove, Kapoor has continued to champion DataStax’s open source credentials in relation to its core remit to provide commercial support to the Apache Cassandra database.

“Of course we are active contributors to Apache Cassandra, but I view open source as a bazaar – it’s an open market of exchange and barter. But you know, you can’t do everything out in public in the bazaar, so some of what DataStax has created sits at a different tier of the stack. This is part of why we built DataStax Serverless, to make the already eminently scalable Cassandra database truly elastic for all users.”

The ‘opinionated’ open data stack

All of this leads us to what CEO Kapoor likes to call an ‘opinionated’ open data stack.

This is a universe where an organization’s IT stack is comprised of data, obviously, but that data includes data and rest and real time data in motion. It is an open data stack (not as in pure open source or freely exchanged public data marketplace openness), meaning data can be securely accessed (and input) by the stakeholders that need it.

But most of all, it is opinionated i.e. it is a data stack built with best-of-breed components according to the opinion of the users who need it. This sits in contrast the to ‘database for every use case’ bloating that we started with. This is DataStax’s suggestion that you (dear user) only need one database if you have the ability to exert an opinion on how it functions.

With some companies, you need to ask about the history behind the corporate name, we don’t really have that problem with DataStax do we?

What do you think?

The Morning After: Hasbro can 3D-print your face onto your favorite action figure

The Morning After: Hasbro can 3D-print your face onto your favorite action figure

Columbia’s Fall From U.S. News’s Ranking Casts Doubt on Other Universities at the Top

Columbia’s Fall From U.S. News’s Ranking Casts Doubt on Other Universities at the Top