DeepSeek says software change speeds AI responses
DeepSeek is seeking to turn software efficiency into a competitive advantage, saying a new serving framework can make its large language models respond as much as 85% faster without changing the underlying model or adding new chips.
The Chinese artificial intelligence startup said its DSpark framework has been deployed for live traffic on variants of its V4-Flash and V4-Pro models. The company describes DSpark not as a new model, but as an additional speculative decoding module attached to the same model checkpoint. In practical terms, that means DeepSeek is trying to get more output from existing hardware, an increasingly important goal as AI companies face rising costs for running chatbots, coding assistants and agentic systems.
According to DeepSeek, the deployed DSpark-5 configuration improves per-user generation speed by 60% to 85% on V4-Flash and by 57% to 78% on V4-Pro. The company says the improvement comes from changing how generated text is produced and verified during inference, the stage when an AI model responds to a user rather than being trained.
Large language models typically generate text one token at a time, a sequential process that can create bottlenecks even on powerful graphics processors. Speculative decoding uses a smaller, faster draft component to propose several tokens at once. The larger model then verifies those proposed tokens. When the guesses are accepted, users receive more output for roughly the same verification cost, increasing speed and reducing latency.
DeepSeek said DSpark uses a semi-autoregressive design, confidence-based verification and scheduling techniques intended to keep GPUs busy while avoiding wasted checks on weak draft tokens. The company has also released DeepSpec, a codebase for training and evaluating speculative decoding draft models, under an open-source MIT license.
Efficiency race reshapes AI competition
The announcement fits a broader pattern for DeepSeek, which rose to global prominence by arguing that advanced AI systems can be built and operated more cheaply than many Western competitors assumed. Its earlier V3 technical report described a mixture-of-experts model with 671 billion total parameters but only 37 billion activated for each token, along with architectural choices such as multi-head latent attention designed to reduce memory demands during inference.
Those claims drew intense attention because they challenged the idea that leadership in AI depends mainly on access to the largest clusters of the most advanced chips. DSpark extends that argument from model architecture into serving infrastructure, where operating costs can become a decisive factor for companies running AI products at scale.
If the reported gains hold up outside DeepSeek’s own systems, they could lower the cost of each generated token and allow more users to be served on the same hardware. That would matter for businesses experimenting with AI agents, reasoning models and coding tools, all of which can consume far more tokens than earlier chatbots.
Independent verification remains limited, however. DeepSeek’s headline figures are based on its own benchmarks and production data, and actual gains will depend on workload, model size, prompt mix and how often draft tokens are accepted. Speculative decoding is also not unique to DeepSeek, though the company’s decision to open-source tooling could help researchers and developers test similar approaches across other model families.
For China’s AI sector, the message is strategic as well as technical. With U.S. export controls limiting access to some advanced chips, software methods that squeeze more performance from available hardware are becoming central to competition. DeepSeek’s latest claim suggests the next phase of the AI race may be fought not only over larger models, but over how efficiently they can be served.