According to reports from Tom's Hardware and other media outlets, Huawei's latest AI processor, the Ascend 910C, has reached 60% of the inference performance of NVIDIA's H100 GPU. While there is still a significant gap when compared to NVIDIA's latest Blackwell series AI chips, this development plays a key role in reducing China's dependency on NVIDIA GPUs.
Testing conducted by DeepSeek researchers shows that the Ascend 910C can achieve 60% of the inference performance of the NVIDIA H100 when executing tasks based on the DeepSeek model. Despite facing U.S. sanctions and being unable to access TSMC's advanced process technology, Huawei's AI processors continue to make rapid progress.
The Ascend 910C utilizes a chiplet packaging design, integrating approximately 53 billion transistors. Unlike its predecessor, the Ascend 910, which was manufactured using TSMC's N7+ process, the Ascend 910C is fabricated using SMIC's second-generation 7nm process (N+2).
Performance of the Ascend 910C can be further improved by manually optimizing the CUNN core. Additionally, DeepSeek's native support for Ascend processors, along with its PyTorch library, enables seamless conversion from CUDA to CUNN, facilitating smoother integration of Huawei's hardware into AI workflows.
DeepSeek’s support offers key advantages for Huawei chips: from day one, DeepSeek has supported the Ascend chips and maintained its own PyTorch repository, allowing for easy conversion of CUDA to CANN with a single line of code. There is also significant potential for performance optimization, enabling even higher levels of performance through custom enhancements.
According to Huawei's official website, CANN (Compute Architecture for Neural Networks) is the heterogeneous computing architecture designed specifically for AI scenarios. It supports a variety of AI frameworks and services AI processors and programming, playing a crucial role in enhancing the computational efficiency of Huawei's AI processors. CANN offers efficient and easy-to-use programming interfaces, enabling users to quickly build AI applications and businesses based on the Ascend platform.
CANN comes in two versions: a community edition and a commercial edition. The former provides an early experience of new features for developers to test, while the latter meets commercial standards for stability. The CANN community edition has already reached version 8.0.0.alpha003, featuring enhancements for the Ascend C processor, while the commercial version, CANN 8.0.RC3, has been released with new support for seven operating systems and an optimized installation process.
Yuchen Jin from DeepSeek noted, "The biggest challenge for Chinese chips is the stability of long-cycle training." This challenge arises from the deep integration of NVIDIA's hardware and software ecosystem, with CUDA having been developed over the past two decades. While inference performance can continue to improve, sustained training workloads will require Huawei to further enhance its hardware and software stack.
Experts predict that as AI models transition towards the "Transformer architecture" (such as GPT, BERT, etc.), the importance of CUDA and PyTorch compilers will diminish. Moreover, DeepSeek's expertise in hardware and software optimization could significantly reduce reliance on NVIDIA CUDA, thereby cutting costs.
Earlier studies showed that DeepSeek used NVIDIA's H800 chips for training, employing NVIDIA's lower-level hardware instruction language PTX (Parallel Thread Execution) instead of the higher-level CUDA programming language. This suggests that DeepSeek bypassed CUDA in favor of lower-level programming optimizations.
For developers, CUDA is a user-friendly, high-level language that focuses on program logic and algorithms without needing to worry about how the program is executed on hardware like GPUs. This reduces development complexity. In contrast, PTX operates at a near-assembly language level, allowing for fine-grained optimizations like register allocation and thread/warp adjustments. However, this programming method is complex and difficult to maintain, making high-level languages like CUDA more commonly used in the industry.