According to a report from TrendForce, NVIDIA is considering a new slot design for at least some of its forthcoming Blackwell B300 GPUs, intended for artificial intelligence (AI) and high-performance computing (HPC) applications. This consideration arises from concerns about failure rates under high loads, motherboard replacement costs, and thermal management issues associated with AI GPUs. By opting for a slot design instead of soldering GPUs directly onto the motherboard, NVIDIA and other GPU designers could enhance maintenance and upgrade flexibility.
However, it's important to note that slot-based designs can typically lead to increased power consumption and thermal challenges. The most power-hungry GPUs often utilize BGA (Ball Grid Array) packaging. Insights from the supply chain indicate that Chen Shuowen, an analyst at CLSA, has reported that NVIDIA has been designing GPU slots, potentially starting with the GB200 Ultra.
Chen mentioned NVIDIA's four-way GPU design paired with a CPU motherboard, especially in the context of DGX servers, which currently support eight GPUs and two CPU boards. While this configuration seems ambitious, it underscores NVIDIA's commitment to innovation in data center architecture.
NVIDIA's data center naming conventions separate its GPUs (such as A100, H100, B100/B200) from its Grace CPU + GPU platforms (GH100, GB200). Presently, both the CPU and GPU in the GB200 platform utilize BGA packaging; it remains uncertain whether this will change with the anticipated GB200 Ultra update later this year.
Standard CPU slots offer advantages in terms of ease of maintenance and upgrades. However, they typically require more space and come with greater power consumption and thermal constraints compared to BGA packaging or SXM/OAM modules. While modules are repairable, the process can vary based on motherboard design, and handling SXM/OAM modules requires care, making them less user-friendly than slot-based designs.
Additionally, manufacturing challenges and high costs associated with add-in cards and SXM/OAM modules remain significant. Currently, most of NVIDIA's SXM modules are produced by Foxconn. Transitioning from board or module designs to slot configurations could lower costs, although this may result in performance limitations.
NVIDIA has officially launched the B200 GPU, which exceeds 1000W and is designed for the GB200 board (codenamed Bianca, featuring one Grace CPU and two Blackwell GPUs, as well as Ariel, equipped with an Ariel CPU and one Blackwell GPU), utilizing BGA packaging. Additionally, NVIDIA offers the Umbriel GPU board, which supports eight B200 (1000W) and B100 (700W) SXM module sizes.
Reports also indicate the development of the Miranda platform, which aims to enhance performance with higher TDP (Thermal Design Power), PCIe 6.0 support, and 800G networking, alongside the Oberon platform based on the GB200 architecture.
While NVIDIA has introduced H100 and even H200 add-in cards based on the Hopper architecture that reduce performance to fit within conventional server power and thermal budgets, the company has yet to announce any add-in cards based on the Blackwell GPU architecture.
Furthermore, NVIDIA is reportedly preparing a product codenamed B200A, which is based on a single B102 processor and employs TSMC's CoWoS-S packaging technology to connect four HBM3E memory stacks. This design contrasts sharply with the dual-chip B100/B200 configuration, which utilizes TSMC's CoWoS-L packaging to connect with eight HBM3E memory stacks.