US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure

TL;DR

The U.S. government claims China's DeepSeek V4 Pro AI model lags eight months behind U.S. models, based on a new evaluation. However, experts argue the performance gap is narrowing, with Stanford reporting only a 2.7% difference on public leaderboards.

Key points

DeepSeek V4 Pro is evaluated as eight months behind U.S. AI models
Cost comparison shows DeepSeek cheaper than GPT-5.4 mini
Stanford's AI Index indicates a 2.7% performance gap
CAISI's evaluation uses an IRT-based scoring system

Mentioned in this story

CAISIStanford

DeepSeek V4 ProGPT-5.4 miniClaude Opus 4.6

In brief

CAISI's evaluation ranked DeepSeek V4 Pro eight months behind the U.S. frontier, using an IRT-based scoring system across nine benchmarks including two private, unverifiable datasets.
The cost comparison excluded all U.S. models deemed too expensive or too weak—leaving only GPT-5.4 mini, against which DeepSeek was still cheaper on five out of seven benchmarks.
Stanford's 2026 AI Index found the U.S.-China performance gap on public leaderboards had collapsed to 2.7%. A U.S. government institute published its verdict on China's most powerful AI: eight months behind, and the more time passes, the wider the gap gets. The internet read the methodology and started asking questions. CAISI—the Center for AI Standards and Innovation, a unit inside NIST—released its evaluation of DeepSeek V4 Pro on May 1. The conclusion: DeepSeek's open-weight flagship "lags behind the frontier by about 8 months." CAISI also calls it the most capable Chinese AI model it has evaluated to date.

The scoring system

CAISI doesn't average benchmark scores like most evaluators do. Instead, it applies Item Response Theory—a statistical method from standardized testing—to estimate each model's latent capability by tracking which problems it solves and which it doesn't, across nine benchmarks in five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and math. The IRT-estimated Elo scores: GPT-5.5 at 1,260 points, Anthropic's Claude Opus 4.6 at 999. DeepSeek V4 Pro scores around 800 (±28), which is very close to GPT-5.4 mini at 749. In CAISI's system, DeepSeek sits closer to the old generation of GPT mini than to Opus. The points system in benchmarks score models the way standardized tests score students—not by raw percentage correct, but by weighting which problems they solve and which they miss, producing a points estimate that only means something relative to other models in the same evaluation. The more points, the better the model is in general terms, with the best model’s score becoming the reference point to see how capable a model is.

It’s impossible to reproduce CAISI’s results because two of the nine benchmarks are non-public, and in those two benchmarks is where the gap is widest. For example, GPT-5.5 scored 71% on CTF-Archive-Diamond, one of CAISI’s cybersecurity tests with DeepSeek registering around 32%. On public benchmarks, the picture shifts. GPQA-Diamond—PhD-level science reasoning, scored as percentage correct—placed DeepSeek at 90%, one point behind Opus 4.6's 91%. Math olympiad benchmarks (OTIS-AIME-2025, PUMaC 2024, SMT 2025) put DeepSeek at 97%, 96%, and 96%. On SWE-Bench Verified—real GitHub bug fixes, scored as percentage resolved—DeepSeek scored 74% to GPT-5.5's 81%. DeepSeek's own technical report claims V4 Pro matches Opus 4.6 and GPT-5.4. For cost comparison, CAISI filtered out any U.S, model that performed significantly worse or cost significantly more per token than DeepSeek. Only one model cleared the bar: GPT-5.4 mini. That's the entire U.S. frontier, filtered to a single entry. DeepSeek came out cheaper on 5 of 7 benchmarks even beating OpenAI’s tiniest and least capable AI model.

Q&A

What did the U.S. government say about China's DeepSeek V4 Pro AI model?

The U.S. government stated that DeepSeek V4 Pro is eight months behind U.S. AI models based on a recent evaluation.

How does DeepSeek V4 Pro compare to U.S. AI models in terms of cost?

DeepSeek V4 Pro was found to be cheaper than GPT-5.4 mini on five out of seven benchmarks, despite being the only U.S. model considered for comparison.

What does the Stanford 2026 AI Index report about the U.S.-China AI performance gap?

The Stanford 2026 AI Index reports that the performance gap between U.S. and Chinese AI models has narrowed to just 2.7% on public leaderboards.

What methodology did CAISI use to evaluate DeepSeek V4 Pro?

CAISI used an Item Response Theory-based scoring system across nine benchmarks to evaluate DeepSeek V4 Pro's capabilities.

US Government Says China's Best AI Models Lag Behind. Experts Aren't So Sure

TL;DR

Key points

In brief

The scoring system

Q&A

Related Articles

Haun Ventures Raises $1 Billion Fund for the Intersection of Crypto and AI Agents

DeepClaude Lets You Run Claude Code With DeepSeek's Brain for 17x Cheaper

GameStop's $55.5 billion eBay takeover bid puts its $368 million bitcoin stash in the crosshairs

DTCC Tokenized Securities Roadmap: Pilot In July, Scale Up In October—With Big Names Like Ripple

Does The Ethereum 300% Boost In Capacity Mean Price Can Rise 3x To $6,000?

DPRK Calls Cyber Theft Accusations ‘Absurd Slander’ Driven by Reptile Media

More from Crypto

The counterargument: Is the gap bigger or smaller?