in , , , , , ,

The Inference Difference: Why Clunky Data Engineering Unhinges AI

AI has a shiny front end. As everyone who’s used an artificial intelligence service knows, we can now ask for pictures of ourselves standing on top of the Rocky Mountains, we can have ourselves superimposed onto a classic Beatles album cover… and we can ask for a text-based interpretation and summary of 147-page document to be delivered in 147 words.

As impressive as the end results are, rather like any greasy engine room, it’s the work that goes on at the backend that really forms the intelligence quotient being experienced by the user. As we start to push agentic AI services into areas beyond summarizing notes and putting us on mountains, its application in mission-critical (and life-critical) application scenarios means it will need to reason through inference, adapt and act in real-time.

Permissioned Data Perfection?

“We know that enterprises today want to be able to train AI models for specific use cases, but they can’t truly benefit from agentic AI without the inference process being able to draw on fluid, well-governed and permissioned data supply. It’s a question of inference engines being able to execute the crucial process of pointing trained models to brand new data to draw conclusions from… and right now, inference is hitting a bottleneck,” said Stuart Abbott, managing director for UK and Ireland at Vast Data.

Known for its work in core data management capabilities and deep learning computing infrastructures, Abbott and the Vast team suggest that “few organizations have the architecture in place” to support the AI inference requirements of tomorrow (and he means, tomorrow i.e. this week) at speed, scale and with a commensurate degree of trust.

RAG Retrieval Evils

Because retrieval augmented generation is what most agentic AI systems are built on, it is (perhaps paradoxically) the “retrieval” part of RAG that represents the problem beneath the surface. More acutely informed that the generalistic (although massive) world of large language models and their small model counterpart systems, RAG enables enterprises to draw upon additional external, internal or domain-specific information sources in the pursuit of a smarter and more sharp-angled end result.

That’s a lot of extra data, so for practicality’s sake, rather than storing all knowledge within the model itself, these systems fetch data from enterprise sources as needed. Each request depends on retrieving relevant context at the time of the prompt, which places increased pressure on the data infrastructure.

“This makes perfect sense from a design point of view. It’s what allows agents to be flexible, current and scalable. But it also introduces a huge dependency on the underlying infrastructure,” argued Abbott, speaking at a data symposium this week. “Imagine asking an AI service to summarize a contract, but the system has to search a document store, index it, apply permissions and return results… all before it can start writing. Now multiply that by every task an agent does across a business.”

What Slows AI Down

Where the “request-response loop” that AI makes upon the various data stores and repositories that serve it takes far longer than expected, the AI fails to deliver. The illusion of real-time AI fades when agents are held up by clunky storage, outdated indexes, or disconnected access policies. Abbott says it’s a question of latency in data service, ultimately becoming latency in judgment.

“The inference layer is often seen as super fast, especially with the rise of graphical processing units and model acceleration techniques. But the real bottleneck is often elsewhere in the total IT stack,” said Abbott. “Inference isn’t just about how fast any given AI model runs, it’s about how quickly it can get the right, permissioned data to the model and then get an answer back.”

According to Vast, even with state-of-the-art models, traditional infrastructure can result in time-to-first-token (a metric used to express how long it takes for a language model to produce a token/word for output after receiving a query or prompt from the AI model) delays of up to 11 seconds. But by adopting persistent key-value caching (a process where some tokens are prioritized as worthy of the most attention) such as through technologies such as vLLM, LMCache (a serving engine extension for language models designed to reduce time-to-first-token and increase throughput in long-context scenarios) and NvidiaGPU Direct Storage (technology that enables a direct data path for direct memory access transfers between a GPU’s memory and storage), Abbott suggests that enterprises can cut this delay down to 1.5 seconds, sometimes less.

All very technical, yes. Although you might need to be a senior data science engineer to instantiate and operate those functions, anyone can understand the aligment point of this technology i.e. if we are able to smarter about how data is stored, spliced, sorted , sieved and served at the backend, then we can make it more useful at the front-end as it moves into its role in AI.

Chunky (Not Clunky) Prefill

“This [above set of data mechanics] is not simply a performance tweak. For agentic systems that revisit the same source material, caching reduces repeated prefill processes and helps deliver faster, more responsive interactions,” explained Vast’s Abbott. “Pair this with continuous batching, chunked prefill and disaggregated decode (more deep tech I admit, but hopefully the context is clear by now) and the gap between enterprise data and ‘agentic thought’ starts to close. But only if infrastructure allows for it.”

Today, many organizations still rely on traditional extract, transform and load data pipelines, separate vector databases and batch-based access controls. The Vast team argue that these systems weren’t designed for agentic workloads; they weren’t built for real-time semantic search, for identity-bound access or for multi-turn inference at scale.

“The smartest AI model in the world can’t help you if it’s stuck waiting on glued-up code and batch data processing jobs,” Abbott noted. “That’s why enterprises are starting to re-evaluate how and where inference happens. The move is toward AI-native infrastructure. A combination of storage and software platforms where data, permissions, compute, and AI workflows live together, not in silos. The future of enterprise AI isn’t just about smarter agents. If we’re asking agents to help guide decisions, they can’t be working from memory… they need to see what we see, when we see it, securely.”

Competitive Analysis, Data Storage Infrastructure

Vast Data makes much of its approach to building its unified architecture, which coalesces storage intelligence into a centrally managed space to deliver on deep learning computing infrastructures. But Vast is not alone in this market, Pure Storage is never far behind when enterprise organziations or any size line up a beauty parade of potential vendors to engage with.

Also eating from the same table is NetApp, with its heritage in storage intelligence offering, the NetApp ONTAP Data Platform for AI goes very much head-to-head in this market. HPE has an Nvidia partnership hinged around providing data services for AI, Dell Technologies has its PowerScale, which eats up unstructured data (a favourite meal for language models) for breakfast… and then there are the cloud hyperscalers.

All of that said, DataDirect Networks, IBM Storage Scale, Weka (all storage vendors, basically) and so on are all now positioning themselves as AI-first and AI-friendly i.e. it’s the default message that the entire technology industry obviously has to pay lip service to and resonate.

With cloud-native machine learning platforms at the heart of some of their most progressive deployments, Microsoft Azure, Google Cloud and Amazon Web Services all represent obvious (perhaps more expensive) options for businesses looking to sharpen up their AI data backbones. Vast execs would probably prefer to be compared to the cloud hyperscalers (well, they would, wouldn’t they?) as opposed to the above-noted storage specialists, but that’s an argument to chew out in a well-stocked bar, or at least a technology conference break-out session.

What might matter most in terms of competitive analysis in this field is whether, under the hood, any given data storage specialist still essentially remains as a file system, with performance enhancements, rather than being a platform aimed at integrating full AI workflows. Vast aims to set itself apart from that tier of data engineering by saying that its approach with its Insight Engine service is to build a native vector capability and structured data layer with additional capabilities designed to enable policy-aware, real-time inference inside the storage layer.

After The AI Enthusiasm

There’s an additional challenge around data sovereignty to bear in mind here. Agentic systems will need to enforce permissions dynamically, explain how and why they reached decisions and prove that data access was compliant every single time.

Abbott’s parting words on this subject were that this calls for more than just AI enthusiasm; it calls for AI infrastructure maturity.

This post was created with our nice and easy submission form. Create your post!

What do you think?

Cambridge Audio Refines Its Melomania P100 SE Wireless Headphones With New Features

Cambridge Audio Refines Its Melomania P100 SE Wireless Headphones With New Features

Google is bringing image and PDF uploads to AI Mode

Google is bringing image and PDF uploads to AI Mode