Model Drop

Kimi K2.6

Kimi K2.6
Kimi K2.6

Model: Kimi K2.6 (kimi-k2.6)

Model type: Text + vision, with native image and video input

Ship date: April 20, 2026

Maker: Moonshot AI (Beijing)

Pricing: $0.60 / $2.50 per million input / output tokens on the Moonshot API. $0.60 / $2.80 on OpenRouter. Free weights on Hugging Face for self-hosting.

Available on: Kimi.com, the Kimi App, Kimi API, Kimi Code, Hugging Face (open weights), OpenRouter, and Vercel AI Gateway

Headline benchmarks: SWE-Bench Pro 58.6% (leads GPT-5.4 and Claude Opus 4.6), HLE-Full with tools 54.0% (leads every model Moonshot tested against), BrowseComp 83.2% (with Agent Swarm: 86.3%), DeepSearchQA F1 92.5%, Terminal-Bench 2.0 (Terminus-2) 66.7%, SWE-Bench Verified 80.2%.

Other info: 256K context window. Mixture-of-experts: 1 trillion total parameters, 32B active per token, 384 experts (8 selected + 1 shared), 61 transformer layers, Multi-head Latent Attention, SwiGLU activation, 160K vocab, 15.5T training tokens. Knowledge cutoff April 2025. Agent Swarm scales to 300 concurrent sub-agents across 4,000 coordinated steps (up from 100 / 1,500 on K2.5). License: Modified MIT (free commercial use; visible “Kimi K2.6” credit required on products with 100M+ MAU or $20M+/month revenue).

More details: Kimi K2.6 tech blog

What shipped

Moonshot AI dropped Kimi K2.6 yesterday, as an open-weight successor to K2.5 aimed squarely at long-horizon coding, agent swarms, and autonomous execution. It’s a mixture-of-experts model (at the same 1T / 32B-active parameter budget as K2.5), with a 256K context window, native multimodal input including video, and a Modified MIT license that lets you use it commercially.

Moonshot claims frontier-grade coding and agent performance at roughly 88% less than Claude Opus 4.7. The headline numbers support the framing on specific benchmarks. SWE-Bench Pro at 58.6% beats GPT-5.4 (57.7%) and Opus 4.6 (53.4%). Humanity’s Last Exam with tools at 54.0% leads every frontier model Moonshot compared against. And, Moonshot shipped workload proofs that are hard to fake: a 13-hour autonomous rewrite of exchange-core (8-year-old open-source financial matching engine) that produced a 185% throughput gain across 4,000+ lines of code and 1,000+ tool calls, plus a 12-hour port of Qwen 0.8B inference to Zig on a Mac.

Math (AIME 2026, HMMT), general reasoning (HLE without tools), and vision (MMMU-Pro, MathVision) still trail the closed frontier by 3-6 points.

What’s new

K2.6 is an iteration on the K2 MoE family with a handful of capabilities that don’t have clean analogues in the closed frontier.

  • Agent Swarm, scaled out. K2.6 can orchestrate up to 300 concurrent sub-agents across 4,000 steps, tripling K2.5’s 100-agent / 1,500-step ceiling. This is the closest thing the open ecosystem has to a “manager agent plus specialist workforce” primitive.

  • Sustained autonomous execution. Moonshot shipped a 5-day continuous-ops agent trace (monitoring, incident response, scheduled tasks) alongside the 12-hour Zig port and 13-hour exchange-core refactor.

  • Native multimodal input, now including video. K2 Thinking was text-only. K2.5 added vision. K2.6 adds video input at the same parameter budget.

  • Claw Groups (research preview). A new orchestration layer where humans and agents running on different devices, different models, and different vendor stacks operate in a shared space. K2.6 acts as the coordinator, matches tasks to agents by skill profile, and reassigns when an agent stalls.

  • Skills from documents. Upload a PDF, a spreadsheet, or a slide deck and K2.6 extracts the structural and stylistic DNA as a reusable “Skill.” The McKinsey-deck reproduction is the obvious demo, the less obvious use is reproducing a regulator’s filing format or a brand deck.

How and where to use it

Where it runs, what it actually does well, and where you’ll regret reaching for it.

  • Where it’s available:

  • What it’s good at:

    • Long-horizon coding across Rust, Go, Python, and front-end

    • Multi-file refactors on large codebases

    • Agent orchestration where you actually want 100+ parallel sub-agents

    • Tool-heavy browsing and deep research

    • Workloads where the cost-per-token ratio dominates the decision and you need near-Opus-class output at a fraction of the price

  • What it’s bad at / shouldn’t be used for:

    • Anything where mathematical correctness is load-bearing

    • Complex tool scheduling

    • Vision-heavy workloads

    • Regulated workloads where a Chinese-jurisdiction model is a non-starter regardless of capability

    • Anything where the K2.5 family’s documented hallucination tendency is a dealbreaker (Moonshot hasn’t published a K2.6 system card yet and nothing in the public materials claims that tendency has been fixed)

First impressions

The positives

Clement Delangue at Hugging Face framed K2.6 as the standout open-source model at launch. Simon Paxton’s writeup captured where that framing actually lands:

“Kimi K2.6 sets a new bar for open-source. It excels on coding tasks at a level comparable to leading closed source models... In early testing, it sustains long multi-step sessions with impressive stability, far beyond typical models.”

The single most-cited community signal: the exchange-core rewrite demo. Thirteen hours of unsupervised work, 1,000+ tool calls, 4,000+ lines of code, 185% throughput gain on an 8-year-old matching engine that was already operating near its performance limits. Described by Simon Paxton at dev.to as the kind of workload proof that distinguishes “actual long-horizon work” from “benchmark wins.”

The ComputeLeap cost analysis boiled the structural case down to a line every procurement team will run with:

“Kimi K2.6, the latest open-weight model from Beijing-based Moonshot AI, runs at $0.60 per million input tokens on the official API. Claude Opus 4.7, Anthropic’s frontier model, costs $5.00 per million input tokens. That’s an 8.3× difference — or roughly 88% cheaper.”

Eight-times-cheaper with OpenAI-compatible SDK support means the switching cost for an A/B is a one-line base URL change.

The negatives

Hacker News user nikcub posted the honest capability summary from someone with no skin in the game:

“Below sonnet and opus 4.0 on capability... better than gemini 2.5 pro on tool calling.”

That’s the working mental model most independent reviewers arrived at: K2.6 is not the best model available, it’s a price-for-capability tradeoff that works for specific workloads and breaks down on others. The same HN thread flagged a second concern, that K2.6 “does only slightly better than Kimi K2.5” on day-to-day work, and “struggles with domain-specific tasks.”

Blockchain.news kept returning to the same gap every independent reviewer is naming:

“Open-weights models underperform in real-world usage compared with closed models such as Claude Opus 4.6.”

Moonshot’s vendor benchmark table shows K2.6 winning on several agentic metrics. Third-party evaluations, where they exist, still put Claude Opus ahead on sustained multi-step reliability.

On safety, the independent evaluation of K2.5 documented significantly fewer refusals on CBRNE-adjacent prompts than GPT-5.2 and Claude Opus 4.5, plus elevated compliance on disinformation and copyright-infringement requests, plus political bias in Chinese-language outputs. Moonshot has not published a K2.6 system card. Until an independent red team retests on K2.6, the working assumption should be that the safety profile has not meaningfully changed.

Jake’s take

From the K2-family, K2.6 is the first open-weights release where the price-per-capability math starts hurting the closed frontier in an obvious way. Sixty cents in and two-fifty out against Opus 4.7’s five dollars and twenty-five is significant (especially as frontier labs continue to raise prices across the board).

For the long-horizon coding, K2.6 may eat a lot of volume out of the Claude and GPT-5 tier. If you’re spending five figures a month on Opus for code generation, you owe it to your budget to run the A/B. The OpenAI-compatible endpoint makes the test a one-line change.

The safety profile inherited from K2.5 is real; it’s a standard red-team doc showing the model will help with CBRNE (Chemical, Biological, Radiological, Nuclear, and high-yield Explosives)-adjacent prompts that Claude and GPT-5 refuse. Moonshot’s answer to that has been “it’s open weights, that’s the tradeoff,” which is honest and also a non-answer for anyone running K2.6 inside a regulated workload. Stack that on top of the data-jurisdiction question (Beijing-based lab, nation-state interest in agentic infrastructure, political censorship findings baked into the K2.5 safety paper), the hallucination inheritance community testers keep flagging, and you get a model that is an unambiguously great deal for the right workload and a liability for the wrong one.

The interesting question for the rest of the year is whether Moonshot ships a system card that changes the safety calculus, and whether anyone outside China is willing to trust the answer when they do.