As integrated circuit (IC) design firms push the boundaries of chip size to boost processing power, the demands on chip manufacturing technology have escalated. NVIDIA's new AI chip, Blackwell, is set to redefine industry standards. Described by CEO Jensen Huang as a "very, very large GPU," the Blackwell chip is currently the largest GPU in terms of area. This formidable chip is constructed by connecting two Blackwell units and utilizes TSMC's advanced 4nm process technology, featuring an impressive 208 billion transistors. However, this innovation brings with it complex packaging challenges.
Taiwan, renowned for its cutting-edge chip manufacturing capabilities, is emerging as a pivotal center for AI chip production. The CoWoS-L (Chip-on-Wafer-on-Substrate) packaging technology employed for Blackwell uses LSI (local silicon interconnect) to bridge the redistribution layer (RDL) for die connections, achieving data transfer speeds of approximately 10 TB/s. Despite these advancements, the packaging process is extremely precise, and any minor defects can render a $40,000 chip unusable, affecting yield rates and profitability.
Industry sources indicate that discrepancies in the coefficient of thermal expansion (CTE) among GPU dies, LSI bridges, RDL layers, and the main substrate can lead to chip warping and system failures. To address these issues and enhance yield rates, NVIDIA has redesigned the top metal layer and bumps of its GPU chips. This redesign effort extends beyond AI chips; the forthcoming RTX 50 series graphics cards will also undergo redesigns, potentially delaying their release. Reports suggest that NVIDIA's partners anticipate the GeForce RTX 5090 to be available in the fourth quarter of 2024.
These design challenges are not unique to NVIDIA. The industry is witnessing a growing number of similar issues. Adapting chip designs to address defects or improve yield is a common practice. AMD CEO Lisa Su has highlighted that increasing chip sizes will inevitably complicate manufacturing processes. Future chip developments must achieve breakthroughs in performance and power efficiency to meet the growing demands of AI data centers.
Cerebras, a leader in developing large-scale AI chips, has noted that the complexity of multi-chip integration is rising exponentially. They emphasize their Wafer-Scale Engine (WSE) series, which represents a "wafer-scale processor" approach in AI technology. TSMC is advancing wafer-level system integration technology with InFO-SoW (System-on-Wafer). The Dojo supercomputer training tile, based on InFO-SoW, is the first solution to achieve mass production.
In response to the trend of larger chips and the need for enhanced HBM for AI applications, TSMC plans to integrate InFO-SoW and SoIC technologies for CoW-SoW (Chip-on-Wafer-on-Substrate). This approach will stack memory or logic chips on wafers, with mass production expected by 2027. The future is poised to showcase even more colossal AI chips assembled on entire wafers.