China’s AI Chips Enter the Training Stack
LongCat shows China can now run frontier-scale training on domestic chips. It does not yet show China can do it efficiently.
For 2 years, the working assumption among investors and policymakers has been clear: Chinese AI chips could handle inference, the computationally lighter task of generating outputs from models that have already been built. Training, the far more demanding process of building one, still required Nvidia.
That assumption now has a LongCat problem.
In late June, Meituan, China’s largest on-demand services platform, released LongCat-2.0. The company says the trillion-parameter model was pre-trained from scratch on a 50,000-chip clusterpowered entirely by domestically produced processors. It did not name the chip supplier, but technical indicators, including its use of HCCL, Huawei’s collective communication library, strongly point to the Ascend ecosystem. The model ingested over 35 trillion tokens during pre-training, according to its model card, without a single rollback or loss spike. The company highlighted engineering reliability measures including deterministic operators, bit-level consistency checks, and automated failure recovery.
Separately, a research consortium including Huawei completed full-parameter post-training of DeepSeek-V4-Pro, a 1.6-trillion-parameter mixture-of-experts model from Chinese AI lab DeepSeek, on a cluster of over 1,000 Ascend 910C chips. The run finished more than 1,500 training iterations with zero interruptions, according to the Shenzhen Loop Area Institute, one of the participating research bodies.
These are not isolated experiments. Huawei’s own Pangu Ultra 135B, a dense model, was pre-trained on 8,192 Ascend NPUs with 13.2 trillion tokens. A larger mixture-of-experts variant, Pangu Ultra MoE 718B, trained on roughly 6,000 NPUs and reported 30% MFU. Baidu, the search and AI group, says a key version of its ERNIE 5.1 model was trained on a cluster powered by its Kunlunxin chip unit, though it did not specify which stage of training was involved. Chinese industry reports link the effort to Baidu’s Kunlun P800 chip and cite a 97% effective training rate.
Each data point, read alone, is a progress report. Read together, they point toward a structural change in where domestic chips sit in the AI development pipeline: Chinese domestic chips have entered the training stack, and the pace of entry appears to be increasing. That a local-services company can now pre-train a model at frontier scale on domestic hardware suggests the capability is no longer confined to one or two national champions running showcase experiments.
The assumption under pressure
US export controls rest on a specific bottleneck theory. Inference requires less compute. Training requires far more. By restricting access to Nvidia’s highest-performance chips, the controls aimed to keep Chinese labs dependent on foreign hardware for the compute-heavy work of building models.
That theory is now under test. According to Morgan Stanley data compiled by the Financial Times, Chinese vendors’ AI chips already outperform the Nvidia H20, the most capable chip approved for export to China, on several metrics including processing performance and memory bandwidth. Huawei has begun marketing systems that pair its Ascend 950DT chips with DeepSeek’s V4 model for customers in the Middle East and Central Asia, according to people cited by the FT. Bernstein analysts described Huawei’s recent chip advances as another “DeepSeek moment.” The firm’s analyst Lin Qingyuan said the development “breaks the core narrative that because of export controls China’s semiconductor is dead at 7nm.”
But crossing the training threshold and closing the training efficiency gap are different questions. The first now has meaningful evidence behind it. The second remains open. Its resolution could shape the valuation ceiling for Chinese AI chip companies, the long-term revenue risk for Nvidia’s China business, and whether export controls function as an effective speed brake or merely an expensive detour.
The headlines read as if China has solved its training problem. The data tells a more layered story. Below, we map the evidence onto a 3-tier capability ladder, quantify the gap between post-training on 1,000 chips and pre-training on 50,000, and ask the question the efficiency data cannot yet answer.



