Sonnet 5 and Mythos 5.1 are teased, as frontier labs sit down with world leaders

🐯 Before we get started
Right now a data center is being planned within shouting distance from the Nashville Zoo. While water and energy concerns around data centers are typically overblown, their sound pollution and growing CO2 emissions directly affect our world and animals, making construction near a zoo a nonstarter.
I urge you to sign the petition to prevent the Nashville Zoo build, and keep an eye out for unwise construction plans in your local community (data center or otherwise).
what to know for now
🔮 The Sonnet 5 and Mythos 5.1 rumor mill is running hot after a partner-platform leak. A claude-sonnet-5 identifier surfaced on Anthropic’s partner platform, and AI watcher Andrew Curran reported the company has finished training a higher-performance Mythos checkpoint (that may ship as Mythos 5.1, get renamed Mythos 6, or stay locked inside as an R&D engine). This follows the launch earlier this month of Claude Fable 5, the public Mythos-class model, with Mythos 5 itself restricted to vetted partners under Project Glasswing. Read more
🌍 The frontier labs sat down with Trump and the G7. At the June 17 summit in Évian, France, Dario Amodei, Sam Altman, and Demis Hassabis joined roughly a dozen executives for a closed-door lunch with G7 heads of state. Amodei and Hassabis pushed for a US-led coalition that pairs structured access to frontier models with chip-and-component trade that pointedly excludes China; Altman asked for an international forum to set shared testing standards. The labs are essentially asking to be treated as instruments of statecraft. Read more
🎨 Zhipu’s GLM-5.2 took the open-weight design crown. The open-weight 744B mixture-of-experts model shipped June 13 with a 1M-token context window and an MIT license, and it landed at number one on the crowdsourced Design Arena leaderboard with an ELO of 1360, edging out Claude Fable 5. It also posts 62.1 on SWE-bench Pro against GPT-5.5’s 58.6 and beats it on long-horizon coding for roughly a sixth of the cost.
🚪 Google lost a Nobel laureate and the man who co-wrote the Transformer paper in the same week. John Jumper, who shared the 2024 Nobel Prize in Chemistry with Demis Hassabis for AlphaFold, is leaving DeepMind after nearly nine years to join Anthropic. Two days earlier, Gemini co-lead Noam Shazeer, co-author of 2017’s “Attention Is All You Need,” announced he’s leaving for OpenAI. Read more
🖼️ Getty struck a display deal with OpenAI and its stock roughly doubled overnight. The multi-year agreement puts Getty’s licensed photos and illustrations inside ChatGPT’s search and discovery results, so an answer can now come with a properly licensed image attached. Getty’s images explicitly won’t be used to train OpenAI’s models, no financials were disclosed, and shares spiked nearly 150% in premarket trading. Read more
🧪 AI Research of the Week
LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation
From Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, and Naohiko Matsuda
Jake’s Take: Half the AI industry now grades AI with AI, which is weird. You point a “judge” model at the answers and trust its verdicts. This paper borrows a term from physics, “dark current” (the noise a camera sensor produces even in total darkness), and shows that these judges have one. Feed a judge two answers of genuinely identical quality and it still picks a winner; a lot of what looks like a quality call turns out to be position bias: the model favors whichever answer it read first.
The reason this matters is that LLM-as-a-judge sits underneath a huge share of the benchmark numbers we read every week. If the ruler has a bias baked in even when there’s nothing to measure, the leaderboards might be ranking the judge’s quirks instead of the models. A companion June paper drove the point home from the other direction: across 20 models, more capable judges were often no better (and actually sometimes worse) at resisting the urge to favor their own answers.
Anyone running evals should stop treating a smart model as a neutral grader and start treating it as a measurement instrument that needs calibrating. Humans should still be in the loop on the calls that count.
what to know for later
💧 Nvidia says AI’s water problem is “largely solved”. At London Climate Week, chief sustainability officer Josh Parker said the data-center water challenge is mostly behind us, pointing to a recirculated water-and-glycol coolant that runs at 113°F and a GB200 rack that hits 300x the water efficiency of air cooling. The coming Vera Rubin platform is built to take inlet water up to 45°C, warm enough that a Microsoft data-center exec said it could let most regions drop mechanical chillers entirely. “Largely solved” is a strong claim from the company that profits most from everyone building more data centers. Read more