How Data Centers Use AI Workloads to Improve Cloud, LLM, and Inference Capabilities
Changes in the data center industry are occurring rapidly as performance capabilities and the speed of service delivery continue to grow. At the heart of this change is AI and the capabilities and infrastructure needed to deliver it to customers. At the very least, October 2024 was a month in which AI-driven industry announcements had a significant impact on shaping the future of the data center.
From infrastructure to storage design, the common denominator is a focus on AI and how these services are delivered to the customer. Building AI applications has become the de facto standard for the latest in the industry, and supporting this AI expansion will continue to impact the data center industry for the foreseeable future. However, given the growing demand for AI, both to support it and to enable its deployment, cloud and data infrastructure providers are increasing their capabilities to meet future AI and inference performance needs. While this is not an exhaustive list, below are some recent announcements from Oracle, Nvidia, Cerebras, DigitalOcean, and Lightbits Labs, each offering unique solutions and flexible infrastructure, and creating scalability for different AI applications. Standardizing AI Infrastructure
To address the challenges of deploying large-scale AI clusters, the Open Computing Project (OCP) has launched its Open Systems for AI initiative. This initiative promotes a collaborative multi-vendor ecosystem with the goal of developing a standard infrastructure for AI data centers. Nvidia and Meta’s contributions to the project, such as Nvidia’s MGX-based GB200-NVL72 platform and Meta’s Catalina AI Rack architecture, are instrumental in promoting common standards for AI compute clusters, reducing costs and operational silos for data centers. Equipment vendors such as Vertiv are also announcing dedicated support for AI data center startups. And Nvidia announced its reference architectures for implementing enterprise-class AI. These collaborations aim to address key barriers such as power density, cooling, and specialized computing hardware, with liquid-cooled computing racks and trays that support efficient, high-density operations. By creating an interoperable, multi-vendor supply chain, OCP facilitates faster adoption and a lower barrier to entry for organizations looking to implement AI infrastructure. Reference architectures, from OCP and others, enable these deployments in shorter timeframes. Scaling AI with Zettascale Superclusters
Oracle’s launch of the Oracle Cloud Infrastructure (OCI) supercluster, in collaboration with Nvidia, represents a leap in scale and performance. The new Zettascale OCI cluster supports 131,072 Blackwell GPUs and achieves a peak performance of 2.4 zttaFLOPS. OCI Supercluster to provide efficient processing capabilities
This new feature aims to simplify the complex process of deploying AI and ML models in the cloud, enabling developers to quickly deploy inference endpoints with minimal configuration. By eliminating the need for complex configurations and security settings, DigitalOcean 1-Click Models democratizes access to powerful AI models with the goal of making them accessible to a broader audience. Integrated with Hugging Face GenAI Services (HUGS), Digital Ocean’s 1-Click models provide continuous updates and optimizations, ensuring users have access to the latest performance improvements in AI models. Intelligent AI Solutions in the Cloud
Proving that AI infrastructure needs go far beyond AI/ML hardware performance, Lightbits Labs, a pioneer in NVMe over TCP storage, has partnered with self-proclaimed “world’s favorite” Crusoe Energy Systems to offer software-based storage. AI Cloud” to expand high-performance, climate-aware AI infrastructure.
Crusoe data centers are powered by a combination of clean, untapped energy sources, reducing the environmental impact of AI workloads. Lightbits software-defined storage delivers high performance with low latency, ideal for AI workloads that require constant, high-speed access to data.
Cruzo’s extensive use of Lightbits storage meets the needs of AI developers by providing a flexible, scalable infrastructure that ensures high availability and durability. This partnership enables Cruzo to offer its AI cloud users an optimized environment that includes scalable storage to meet demand, especially for programs like LLM and generative AI.
Each of these solutions contributes to a stronger and more accessible AI ecosystem and addresses the challenges of scale, efficiency, and ease of use. These innovations pave the way for future developments by creating an infrastructure that will lead to the widespread adoption of AI technologies across various business sectors.
Was this article useful to you?
0 Feedbacks