OpenAI and Anthropic fight for coding supremacy with GPT-5.3 Codex and Opus 4.6

OpenAI and Anthropic fight for coding supremacy with GPT-5.3 Codex and Opus 4.6
OpenAI and Anthropic fight for coding supremacy with GPT-5.3 Codex and Opus 4.6

🎆 Tired of having to explain AI stuff to your coworkers? Share Handy AI with them so that they can get the most important AI news delivered weekly to their inbox (in addition to our high quality editorials).

Share Handy AI

last week’s top stories

🏈 Super Bowl ads turn into an OpenAI vs Anthropic brand knife fight. OpenAI went earnest, Anthropic went snarky, and the whole thing spilled into public accusations about honesty and commercialization. The two leading AI companies are competing to define what “responsible AI” feels like to mainstream users. Bonus implication: AI firms are now buying cultural real estate, with numerous other AI-based companies purchasing Super Bowl ad space alongside the big two. Read me

💻 Codex gets a macOS app. OpenAI finally shipped a agent coding-focused native desktop client to compete with the likes of Cursor, Loveable, Replit, and others. The app is built around parallel agents, long-running tasks, and persistent context, so you can hand off work, interrupt it mid-flight, and keep steering without resetting the model’s brain every five minutes. Read me

🧠 GPT-5.3 Codex launches and admits it helped build itself. The release frames GPT-5.3-Codex as a broader “computer-doing” agent that can handle the full software lifecycle, not just code generation. Early versions were used to debug training, manage deployment, and diagnose evals, so the model contributed to its own shipping pipeline (yes, that sentence should make you squint). Also: faster iteration loops, stronger agent benchmarks, and a more explicit push toward general collaborator on a computer. Read me

🧩 Claude Opus 4.6 ships with a 1M-token context window. Anthropic is pitching Opus 4.6 as a step up in long-horizon work: bigger codebases, longer tasks, stronger review and debugging, and more reliable planning with a 1M context window. Included in the release is a “Fast” mode: the same model running with higher-speed inference, meaning output tokens arrive up to 2.5× faster at premium pricing ($$$). Read me

🎬 Kling 3.0 launches with longer, more consistent video generation. Kuaishou is upgrading Kling with better temporal consistency, more photorealistic output, longer clips (up to 15 seconds), and native audio support. That combo matters because the hardest part of video gen tends to be coherence across frames, followed closely by coherence between sound and motion. Read me

🎿 Olympics opening ceremony sparks backlash over an AI segment. Viewers complained about an AI-styled animated piece during the opening, calling it low-effort “AI slop,” (which is a brutally honest label for a lot of generated media). While we’re seeing more and more major events using generative AI, audiences still demand taste, restraint, and craft. Read me

🚀 SpaceX acquires xAI in a mega-deal that fuses rockets with models. The move ties frontier model development to satellite infrastructure and a Musk-style vertical stack, with obvious implications for distribution and compute. Regulators will stare at this for a while, for reasons that do not require imagination. Read me


🧪 AI Research of the Week

Artificial Intelligence and the Interpretation of the Past
From University of Maine, University of Chicago

Jake’s Take: This paper measures how far AI answers drift from current scholarship in an interesting way. The authors used OpenAI models to generate hundreds of Neanderthal “day in the life” images (using DALL-E 3) and one-paragraph narratives (ChatGPT API, GPT-3.5), repeating prompts 100 times and toggling whether the prompt demanded “expert knowledge,” plus whether DALL-E’s prompt-rewrite feature got to “help.” Then they compared those outputs against published archaeological writing by embedding both the generated content and the scholarly text into the same vector space (CLIP embeddings), clustering themes (inspired by BERTopic), and checking how closely the AI’s content aligned with the literature.

The generated stuff often matches older eras of scientific framing, with the authors dating GPT-3.5’s narratives to an early-1960s vibe and DALL-E 3’s imagery to late-1980s/early-1990s assumptions. So if you ask a chatbot for “what Neanderthals were like,” you can get confident output that’s basically a retro exhibit label, unless the model has deep access to modern, machine-readable scholarship.

Critically, they used highly outdated models in this study. So while the outcome is true, and expected, from these old models, the study is moot without an analysis of modern models that have prioritized up-to-date knowledge access and lessening of hallucinations.


and then, even more news…

🛠️ Anthropic shows “agent teams” building a C compiler. The engineering write-up is a practical stress test of multi-agent coordination: break a serious systems task into parallel sub-jobs, supervise lightly, and see where things snap. The takeaway is that coordinated tooling in these modern agents beats out raw model IQ in staying sane over long tasks. Read me

Read more