The rise of artificial intelligence is transforming the global data center industry, with enterprises facing critical decisions about how to build, deploy and optimize their AI infrastructure. With AI becoming the cornerstone of competitive advantage, enterprises must adopt new strategies to ensure they are not left behind.
Over the next decade, AI-driven computing will account for nearly 90% of all data center spending, fundamentally reshaping IT strategies, according to Dave Vellante, chief analyst of theCUBE Research. However, many organizations are struggling with challenges such as skills gaps, underutilized GPU clusters and the complexity of integrating AI into existing systems.

Penguin Solutions’ Trey Layton talks with theCUBE’s Dave Vellante about AI infrastructure solutions.
“We are witnessing the rise of a completely new computing era,” Vellante said. “Within the next decade, a trillion-dollar-plus data center business is poised for transformation, powered by what we refer to as extreme parallel computing, or as some prefer to call it, accelerated computing. While artificial intelligence is the primary accelerant, the effects ripple across the entire technology stack.”
During the “Mastering AI: The New Infrastructure Rules” event. Vellante spoke with Penguin Solutions Inc.’s Pete Manca (pictured), president of AI infrastructure provider, and Trey Layton, vice president of software and product management, about how organizations can successfully navigate AI adoption. (* Disclosure below.)
1. Building AI infrastructure requires a fundamental rethink.
One of the biggest hurdles enterprises face when implementing AI is that traditional IT infrastructures are not designed for the demands of AI workloads. Many companies are experimenting with AI in the cloud but are hesitant to move their proprietary data there due to security and cost concerns. As a result, organizations are looking for ways to build AI infrastructure in-house, according to Manca.
“Traditional infrastructures are very different than AI infrastructures, and so, they have to rethink how they do IT,” he said. “Building in-house is probably the preferred way to go, but it means literally a soup-to-nuts transformation from data center build-out, power, cooling, all the way through the architecture of their system.”
Enterprises must consider key architectural decisions, such as whether to use liquid cooling, direct-to-chip cooling or traditional air cooling. They also need to decide on the best combination of chip vendors, storage solutions and networking technologies. With numerous options to consider, organizations often seek expert guidance from companies such as Penguin Solutions to design AI environments that can scale effectively, according to Manca.
“A lot of the technology you can pick and choose depending upon your use case,” he said. “The trick is designing it right up front. What you don’t want to do is put these pieces together.”
Unlike conventional IT setups, which may involve managing hundreds of thousands of servers in-house, AI workloads — particularly large language model training — demand highly sophisticated networking solutions and specialized hardware configurations. This includes direct-to-chip connections, GPU-to-GPU communication technologies such as NVLink and advanced optical networking to bypass traditional CPU bottlenecks. By partnering with experts in AI infrastructure, enterprises can avoid costly missteps and ensure their AI deployments are built for long-term success, Manca explained.
Here’s theCUBE’s complete video interview with Pete Manca:
2. AI infrastructure must be designed for peak performance and resilience.
Unlike traditional IT environments that focus on uptime and high availability, AI infrastructure must be optimized for maximum performance at all times. Enterprises often struggle with underutilized GPU clusters, leading to inefficiencies that drive up costs. To address this, organizations must implement intelligent compute environments that optimize workloads and minimize downtime, according to Layton.
“You’re talking about a massively scalable parallel processing infrastructure that’s designed to run at peak performance all the time — that’s different than what organizations of the past have built,” he said.
One of the biggest challenges in AI infrastructure is ensuring that GPU clusters operate efficiently. GPUs fail 33 times more often than general-purpose CPUs because they run at full capacity continuously, according to Layton. Organizations need predictive failure analysis tools that can proactively identify and address potential failures before they impact operations.
Penguin Solutions has developed software solutions, such as ICE ClusterWare AIM, to tackle these challenges. The service enhances AI and HPC infrastructure by leveraging over 2 billion hours of GPU runtime expertise, using patent-pending software to prevent failures, automate maintenance and optimize performance at any cluster size, Layton added.
“We’re actually monitoring for nominal variations in temperatures in the GPUs themselves,” he said. “We’re doing latency throughput testing on the InfiniBand fabric and any deviation outside of nominal parameters, we’ll begin to institute automation that will attempt to remediate that in software. If we can’t, then we remove that device from the production workload so it doesn’t actually result in production outages.”
By integrating AI-driven monitoring and remediation capabilities, enterprises can maintain high performance while reducing downtime, ensuring that their AI infrastructure operates as efficiently as possible, Layton added.
Here’s theCUBE’s complete video interview with Trey Layton:
3. Bridging the AI skills gap is essential for building sustainable AI environments.
One of the most significant obstacles to AI adoption is the lack of in-house expertise. AI infrastructure requires a unique skill set that blends traditional enterprise IT knowledge with HPC expertise. Many IT professionals are accustomed to managing virtualization and cloud environments but lack experience in designing high-performance AI clusters, Layton pointed out.
“The high-performance computing world needs to understand the problems of IT, and the IT world needs to understand the problems of high-performance computing,” he said. “And in that we get a convergence of those two skills and that will be the future artificial intelligence infrastructure engineer … one who gets both worlds.”
To address this challenge, organizations must invest in training and seek out AI-focused technology partners. Companies such as Penguin Solutions provide AI-optimized architecture models and modular infrastructure solutions that allow businesses to scale their AI environments while maintaining operational flexibility, Layton pointed out.
Future-proofing AI infrastructure is another critical consideration. Given the rapid advancements in AI hardware and software, companies need modular architectures that can adapt to new technologies. Designing for long-term scalability is crucial for sustainable growth and efficiency.
“The reality is that there’s a blistering pace of development with the underlying hardware that’s out there,” Layton said. “You need an underlying architecture that is deployed in an environment that can accommodate those changes and also find ways to utilize some of those technologies.”
By adopting a modular, adaptable approach and leveraging the expertise of AI infrastructure specialists, enterprises can ensure that their AI investments remain viable and competitive in the long term, Layton concluded.
Here’s theCUBE’s continuing conversation with Trey Layton:
Watch the complete event episode here:
Plus, find theCUBE’s complete video playlist here:
https://www.youtube.com/watch?v=videoseries
(* Disclosure: TheCUBE is a paid media partner for the “Mastering AI: The New Infrastructure Rules” event. Neither Penguin Solutions Inc., the sponsor of theCUBE’s event coverage, nor other sponsors have editorial control over content on theCUBE or SiliconANGLE.)
Photo: SiliconANGLE
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU