Unraveling China's AI Deception: The Significance of Hidden Aspects Behind Open Source
In the rapidly evolving world of AI, China is making strides in the open source community. Developers are working diligently to clean and make safer Common Crawl's growing open source dataset for wider use. This effort is reflective of a broader trend in China, where AI giants like Baidu, Alibaba, and Moonshot are embracing openness.
Baidu, for instance, has encouraged adoption of its ERNIE 4.5 models, potentially spurring innovation and collaboration with developers. Alibaba, too, has joined the fray with its Qwen3 suite, launched in early 2025, emphasizing open sourcing weights alongside proprietary models. DeepSeek, Kimi K2, and Qwen3 have been positioned as commodity models of choice, offering open weights to external developers.
However, the openness doesn't extend to all aspects. Fully open-source models that include both open weights and comprehensive release of training datasets remain rare in China. Chinese companies tend to balance transparency and control, often withholding detailed training data due to regulatory constraints and competitive considerations.
This nuanced approach to data sharing is evident in the case of DeepSeek and Alibaba’s Qwen3, which have gained notable momentum by offering openly accessible model weights aimed at fostering developer and global collaboration. Yet, they do not fully share their training data.
The push for open-source Large Language Models (LLMs) in China is influenced by wider goals of technological self-reliance, scaling adoption under hardware access constraints, and geopolitical dynamics. China advocates international AI collaboration and open sharing, especially toward developing nations, but this does not yet imply wholesale open data sharing from all Chinese LLM providers.
Many Chinese AI firms pursue dual strategies, developing both open and closed models. The shape and extent of data-sharing vary, with some favoring closed data and proprietary components alongside open weights to retain technical and strategic control.
The call for fully open source AI is no small order, as it requires sharing seven components, including the dataset. The AI community must choose between embracing radical transparency through genuine open source or risk building tomorrow's critical systems on today's black boxes.
The recent revelation of over 1,000 URLs containing verified Child Sexual Abuse Material in the LAION 5B, an open source dataset used for AI text-to-image generation models, underscores the importance of this choice. Establishing trust by creating unbiased, reliable, and safe AI has become crucial as AI systems are driving cars and offering medical assessments.
The Android story serves as a powerful reminder of the potential of open source. Android, a start-up acquired by Google in 2005, followed the path of open-source participation to victory in the smartphone market. Its open-source software accelerated innovation, democratized the playing field, and drove down prices.
As the AI industry angles for its models to become the next Android or iOS, the question of openness versus control remains a pressing one. The recently released fatal 2023 Tesla Full-Self-Driving (FSD) crash video exposed the dangers of incomplete and biased data sets in AI models. Much of the first data set for AI models was drawn from web scraping performed by Common Crawl, which is also used for ChatGPT and LLAMA models.
In summary, China’s leading open source LLMs like DeepSeek and Qwen3 share model weights openly and foster developer access, but withholding full training data is common. This reflects a nuanced and evolving approach to data sharing amid regulatory and strategic concerns. Explicit open sharing of training data across all major Chinese LLMs, including ERNIE 4.5 and Kimi K2, has not been reported in detail, suggesting the practice remains limited or selective.
Artificial intelligence (AI) giants in China, such as Baidu and Alibaba, are open-sourcing their models to foster collaboration with developers, with examples like ERNIE 4.5 and Qwen3. However, these companies predominantly share only the model weights, not the comprehensive training datasets, reflecting a balanced approach to data sharing due to regulatory constraints and competitive considerations.