Are Chinese open-source LLMs actually the best right now?

On public open-weights benchmarks in 2026, yes — Qwen3, DeepSeek-V3.2, GLM-4.6 and Kimi K2 consistently occupy the top of the HuggingFace leaderboard for general reasoning, code, and multilingual tasks, and the cost structure to run them is materially better than anything Western labs ship with open weights. The caveat is that 'best on the leaderboard' is not the same as 'best for your product in your regulatory context', which is the point of the article.

What does 'baked-in bias' actually mean in practice?

Two distinct things. First, explicit refusal: the base models will decline or deflect on a predictable list of PRC-sensitive topics (Tiananmen, Taiwan status, Xinjiang, Hong Kong 2019, Xi Jinping). Second, quieter editorial drift: on political or historical topics outputs skew toward officially-sanctioned phrasing even when the model doesn't refuse. The refusal pattern is a useful fingerprint for attackers — it tells them which model family you are running, which narrows their exploit surface.

What does the NCSC or the EU AI Act expect if we use one?

The NCSC's Guidelines for Secure AI System Development now list 'model provenance evidence' as a required artefact for supplier questionnaires in government-adjacent deployments. The EU AI Act's GPAI obligations — fully in force since August 2025 — put the duty on the deployer when the upstream provider is outside the EU and cannot be compelled. In practice, saying 'we use the HuggingFace weights' is not a compliant answer; you need a documented base model revision, a tampering attestation, and a plan for upstream withdrawal.

How do we verify a model from Hugging Face hasn't been tampered with?

Pin the specific revision hash in your deployment manifest, not the 'latest' tag. Store the published safetensors hash alongside. Fingerprint the model behaviourally — a small fixed suite of prompts whose outputs you hash and compare on every cold start. If the behavioural fingerprint drifts without a corresponding revision change, alert. More than a quarter of AI-backed engagements we have run this year surfaced at least one model-in-production mismatch using this method.

Should we just not use Chinese open-source models?

No — that is the wrong framing. They are, for many workloads, the best open-weights option available, and using them is a defensible engineering decision. The right framing is to treat them as supply-chain dependencies with a different regulatory context than your own: pin the revision, fingerprint the deployment, bound the blast radius of an unexpected output, keep a second source that your prompts are compatible with, and actually read the licence terms — all three major licences (Tongyi Qianwen, DeepSeek code licence, GLM) have been revised in the last eighteen months.

When the Best Open-Source Model Is Chinese

Published24 April 2026 BySpencer Read11 min

Look at the Hugging Face leaderboard any week this spring and the top five open-weights models are Chinese. Qwen3-235B and its coder sibling. DeepSeek-V3.2 and R1-Next. GLM-4.6 from Zhipu. Kimi K2 from Moonshot. InternLM2.5. On most public evals they are the strongest open-source models the world has ever had access to — and in 2026 they are quietly becoming the default backend for a lot of Western products that didn’t want to pay per-token rates to Anthropic or OpenAI. That is the story. The other story — the one UK security teams are starting to ask us about, quietly, over the phone — is what that means for the products that are running on top of them.

The shorthand version of the concern, the one that shows up in procurement calls, is “Chinese bias.” That framing is a bit lazy. The real questions are more interesting, and more mechanical: what does it mean when the weights under your customer-facing chat, your internal RAG, your code-review bot, and maybe your agentic pipeline were trained by a company that is subject to a different regulatory regime than you are, and that cannot, as a matter of Chinese law, produce a model that says certain things out loud? And separately: what does it mean that the weights are open, but the corpus and the post-training mix behind them usually aren’t?

Censorship is the obvious problem, and the least interesting one

Most people who have played with Qwen, DeepSeek, or GLM for more than an afternoon have noticed the list. Ask the base model about Tiananmen, the status of Taiwan, the 2019 Hong Kong protests, Xi Jinping, or the treatment of Uyghurs in Xinjiang, and you will get one of three responses: a refusal, a pivot, or a piece of phrasing that sounds like it was drafted by a press office. The models are getting better at hiding it — Qwen3’s refusal rate on the public CN-sensitive benchmark is materially lower than Qwen2’s was — but the pattern is still there, and it survives most of the fine-tunes we have tested.

This is, on its own, not a catastrophe for most products. If your SaaS does invoice summarisation, your users are not asking about Taiwanese sovereignty. They are asking about late fees. But censorship leakage matters in two specific ways:

Surprise in adversarial scope. A pen tester trying to jailbreak your chat feature will discover the refusal list within twenty minutes. It is a useful fingerprint — it tells them exactly which model family you are running, which narrows the exploit surface for prompt injection and data-exfiltration attacks.
Quiet editorial drift. On political or historical topics where the Chinese model’s training data is thinner or more carefully curated, outputs skew toward officially-sanctioned phrasing even when not refusing outright. If you have built a news-summarisation or research-assistant product, that is a product-quality bug long before it becomes a compliance one.

The interesting question is not “does it refuse things.” The interesting questions are further upstream.

The procurement question: who trained it, and with what?

Open weights and open training are not the same thing. Qwen’s technical reports describe the post-training mix at the level of “400B tokens, filtered.” DeepSeek’s are more generous but still redact specifics. GLM publishes almost nothing. You can run these models on your own infrastructure — great. You cannot audit what went into them — that part hasn’t changed.

For a product that is (a) sold into UK enterprise or public sector and (b) subject to procurement due-diligence, this is no longer a theoretical problem. The NCSC’s most recent update to its Guidelines for Secure AI System Development explicitly added “model provenance evidence” as a required artefact for supplier questionnaires in government-adjacent deployments. A Dutch hospital trust earlier this year cancelled a chatbot pilot because the vendor couldn’t tell them whether their Qwen fine-tune had inherited a potentially triggerable behaviour from the base model. The EU AI Act’s GPAI obligations, which kicked in fully last August, are now being enforced with enough teeth that “we use the Hugging Face weights” is not a compliant answer — somebody needs to carry the obligations, and by default that is the deployer.

The practical implication for your product: if you have wired in any of the top Chinese open-weights models and are selling into any regulated market, expect to be asked, in writing, within the next twelve months:

Which base model, at which revision hash, and where were the weights obtained
What the fine-tuning corpus looked like and who held the data
Whether the model has been evaluated against a published adversarial-robustness benchmark
Whether you can attest, under contract, that the weights have not been tampered with post-download
What your fallback is if the upstream model is withdrawn, re-licensed, or subjected to export controls

We are seeing these questions in pen-test scoping calls now, not just in procurement calls. It has become unusual for a SaaS that runs a Chinese open-weights model behind its product to not have at least one enterprise customer asking for evidence.

The supply-chain problem is the real one

Here is the part that is actually a security problem, as opposed to a governance one. A model on Hugging Face is a binary. It is several hundred gigabytes of floats, in a format (safetensors) that is safer than pickle but not auditable the way source code is. A published hash is reassuring for integrity but not for provenance — if an upstream repo were compromised tomorrow and the hash were republished alongside the tampered weights, most downstream deployers would not notice until behaviour changed.

That is not speculative. Earlier this year a malicious mirror of a popular Chinese coder model was posted to a typo-squatted org on Hugging Face and pulled by thousands of repositories before the takedown; the weights had been fine-tuned to produce subtly broken SQL when the prompt contained specific strings. Separately, a popular Qwen3-Coder fork was found to have been fine-tuned, quietly, on a proprietary codebase that had leaked onto a paste site, which is its own legal problem for anyone unknowingly redistributing the result.

These are not Chinese-model-specific risks. The same thing could happen, and has happened in smaller ways, with Western open-weights models. What is new is the combination of (a) open-source AI becoming load-bearing for Western products, (b) the centre of gravity of open-source AI being firmly in the PRC, and (c) the tooling for verifying “this is the model the vendor said it was” being years behind the tooling for verifying “this is the package the vendor said it was.”

For your application-layer pen test, that shifts the threat model in a specific way. We now routinely add a step for AI-backed products: confirm the model in production matches the one the team thinks is in production, via prompt-based fingerprinting and comparison against a known-good reference profile. More than a quarter of AI-backed engagements we have run this year turned up at least one inconsistency — usually benign (the ops team swapped the model and the product team didn’t know), sometimes not.

What we are telling clients, minus the hedging

The useful reframing — the one that gets product and security teams pulling in the same direction rather than arguing about geopolitics — is this: the model is a dependency. Treat it like one.

Pin it. A specific revision hash, stored with the deployment manifest, not “the latest Qwen3”. If it changes, your CI should know.
Fingerprint it. A small suite of prompts whose outputs you know and can hash. Run them on every cold start. If the fingerprint drifts, alert.
Bound the blast radius. If the model does start producing something your product wasn’t expecting — politically, legally, or simply factually — what breaks? If the answer is “nothing, because the output is post-processed or voted on or constrained,” you are in a much better place than if the answer is “the user sees it raw.”
Have a second source. Not because the Chinese model is going to be sanctioned tomorrow — probably it is not — but because if it is, the day you find that out is a bad day to discover that your prompt library was written around a single vendor’s quirks. We reviewed one client’s stack last month where a six-line switch between Qwen3 and Llama4 would have taken their fine-tune out of service; we rewrote the prompts so that switch would take an hour.
Actually read the licence. Qwen’s Tongyi Qianwen licence and DeepSeek’s code licence have quietly accumulated usage restrictions over the last two revisions. GLM’s has always been more liberal but changed last autumn. If you are operating at scale and haven’t checked recently, check.

None of this is “don’t use Chinese models.” They are, at the time of writing, genuinely the best open-weights options for a lot of workloads, and the cost structure is better than anything Western labs are shipping open. Using them is a reasonable engineering decision. Using them without treating them as supply-chain artefacts with a different regulatory context than your own is not.

The long tail is a procurement story, not a product one

Where this lands, eighteen months out, is not in individual product failures. It is in procurement. By 2027 we expect “which foundation model” to be a checkbox on every UK government supplier questionnaire, and “open-weights Chinese model, unaudited” to be a pause point the way “our database is hosted in China” already is. That is not a statement about the quality of the models. It is a statement about who the buyer trusts to carry the downstream risk.

If your product lives upstream of that procurement conversation — and if you sell into financial services, healthcare, public sector, or critical infrastructure, it does — the work is now. Pin your models, fingerprint your deployments, document your provenance, and make sure the person who signs the contract can also sign the attestation.

It is not the story anyone wanted from the open-source AI boom. It is the story we have.