Why WEKA’s Nvidia DGX SuperPOD Certification Matters

Deep Learning

getty

One of the most significant takeaways from Nvidia’s recent GTC event is that AI changes our thinking about infrastructure. Servers are built differently, racks of power-hungry GPUs require new cooling methods, networking is reaching nearly unimaginable speeds, and keeping clusters of AI computing resources fed with data disrupts traditional approaches to storage.

While every major storage vendor had a presence at GTC, those bringing novel solutions to the data path challenges inherent in scalable AI grabbed the most attention. VAST Data, Weka.io, and even Hammerspace were showing off the art of the possible in highly scalable data solutions.

Rethinking Storage for AI

Most traditional storage vendors can deliver what’s needed for nearly any modest enterprise AI application. Pure Storage, NetApp, VAST, and Dell all deliver Nvidia-validated products targeted at this market. Whomever you’re currently buying your storage from can likely deliver a solution that works for you for most applications.

A different class of challenge arises, however, when providing storage for AI training clusters that can scale to hundreds or even thousands of nodes. This environment is more akin to supercomputing than what’s traditionally found in an enterprise data center. Whether a GPU-cloud provider or generative AI training cluster, a stall in the data path can have a dramatic impact here.

Traditional storage solutions are not always adept at handling the extensive bandwidth provided by modern networks, which now approach 800GB/second, or managing the small files common in AI workflows. This can hamper AI development.

Data movement is often a performance-impacting bottleneck when scaling models across nodes. Likewise, the speed at which a training cluster can save and restore checkpoint data can also gate the system’s overall performance.

Scalable AI training storage solutions, like the ones offered by WEKA and integrated into systems like Nvidia’s DGX SuperPOD, are built from the ground up to address the intense demands of AI workloads. They provide ultra-high performance, can handle massive data inflows, and support the intensive read and write operations required by AI training and model checkpointing.

WEKA’s SuperPOD Certification

The Nvidia DGX SuperPOD and BasePOD are complete turnkey architectures for AI training. Nvidia’s rigorous testing and certification process provides performance guarantees for SuperPOD installations, which isn’t assured with large cluster deployments that don’t utilize core components from the SuperPOD or BasePOD architecture.

WEKA’s new WEKApod is a data platform appliance introduced by WekaIO that’s certified for the Nvidia DGX SuperPOD with Nvidia DGX H100 systems. The solution is engineered to integrate WEKA’s high-performance storage software with top-tier storage hardware for a seamless AI data management environment.

WEKApod delivers exceptional storage performance, capable of supporting up to 18,300,000 IOPS in its starting configuration. This high-speed storage capability is crucial for feeding Nvidia DGX SuperPOD’s compute nodes with data, ensuring that the GPUs are efficiently utilized without being bottlenecked by data access speeds.

Utilizing Nvidia ConnectX-7 network cards, WEKApod drives 400 Gb/s network connections using InfiniBand with the DGX SuperPOD. This advanced networking feature facilitates rapid data transfer rates between the storage and compute nodes, enhancing overall system performance and efficiency.

Starting with a one-petabyte configuration that includes eight storage nodes, WEKApod can scale up to hundreds of nodes to meet organizations’ growing data storage demands. This scalability ensures that enterprises can expand their data storage capacity as their AI projects grow in complexity and size.

WEKApod enables organizations to reduce deployment time and complexity by providing a pre-configured environment tailored for AI applications. This optimized AI-native architecture ensures fast access to data, speeding up computational tasks and leading to quicker development of advanced AI solutions.

Analyst’s Take

The importance of storage in AI workloads cannot be overstated. WEKA’s claim to address the inefficiencies inherent in legacy storage systems with its AI-native architecture is a crucial differentiator for the company. Such innovations are necessary to fully leverage the potential of modern compute and network capabilities.

In the competitive AI infrastructure market, companies like Dell, DDN, Hitachi Vantara, HPE, NetApp, Pure Storage, and VAST Data are all aligning their offerings with Nvidia’s solutions. As AI becomes more pervasive across industries, suppliers that fail to offer compatible solutions with Nvidia’s hardware risk losing out on opportunities

The reality is that most enterprises will never deploy an Nvidia DGX SuperPOD. These solutions are targeted at high-performance, highly scalable AI training workloads and are overkill in most environments. What’s important to note that WEKA’s SuperPOD certification is the same software stack running on WEKApod that WEKA delivers on all of its solutions.

WEKA’s SuperPOD certification, along with its previously announced BasePOD certification, tells you that, no matter what, its software can keep up with the most demanding AI workloads you can throw at it; WEKA’s storage software is not a bottleneck. It also clearly demonstrates that WEKA understands how to manage data for AI at the highest rungs of the performance ladder.

Underscoring this, WEKA also announced results from testing its software on the SPECStorage 2020 benchmark suite. Its WEKA Data Platform consistently ranked at the top of multiple benchmarks, demonstrating its ability to handle diverse IO profiles—from read and write to metadata-intensive and IOP-driven tasks—without any tuning changes. This included maintaining the top position in SPEC_ai_image on AWS, SPEC_vda on AWS, and multiple other blended benchmarks on AWS.

By enabling faster, more efficient AI data pipelines, WEKA is laying the groundwork for a new era in enterprise AI, where the speed of innovation is matched by the infrastructure’s ability to scale and provide the needed performance. This company has proven its worth in demanding GPU-cloud and hyperscale environments, offering new proof points with its Nvidia DGX SuperPOD certification.

Disclosure: Steve McDowell is an industry analyst, and NAND Research is an industry analyst firm that engages in, or has engaged in, research, analysis and advisory services with many technology companies, including those mentioned in this article. Mr. McDowell does not hold any equity positions with any company mentioned in this article.

This post was created with our nice and easy submission form. Create your post!