A flurry of minor model launches from Google, OpenAI, and... Meta?

December 30, 2025

🎄Happy holidays from Handy AI! Celebrate the season by grabbing a month-long trial of Handy AI Premium, which includes an extended weekly update and some other festive benefits.

Claim now!

last week’s top stories

🧠 Gemini 3 Flash lands as Google’s speed model. Gemini 3 Flash pushes latency and cost down while keeping strong multimodal throughput, which matters for agents that call tools and stream partial outputs. Google is positioning it as the “default” workhorse for apps that need fast reasoning, high QPS, and predictable spend. The competitive angle is simple: Flash-class models decide who wins distribution, because they are the ones developers ship at scale. Read more

🖼️ GPT Image 1.5 arrives inside ChatGPT. OpenAI shipped a new image model directly in ChatGPT, tightening the loop between prompting, iterative edits, and tool-assisted generation. Expect better text rendering, cleaner composition control, and fewer “model forgot my constraint” moments, especially when you keep context inside a single thread. Read more

🎧 Meta ships SAM Audio for promptable sound segmentation. SAM Audio extends the Segment Anything idea into audio, aiming for “select this sound” interactions across mixes and environments. If it works well, it becomes infrastructure for voice UX, assistive tech, and content tools that need separation, masking, and event extraction. Read more

🧾 Meta moves to buy Manus for over $2B. Meta is paying an estimated $2–3B to pick up Manus, a general-purpose agent company that has been chasing “do the work for me” workflows instead of chat. The near-term play is plugging Manus-style autonomy into Meta AI, WhatsApp SMB flows, and creator tooling where task completion matters more than personality. The deal drags China-ties scrutiny straight into Meta’s consumer stack. Read more

🧱 Nvidia licenses Groq inference tech in a $20B agreement. Nvidia is taking a nonexclusive license to Groq’s inference technology and bringing over key Groq talent, which signals serious urgency around inference economics. Nvidia is buying options on alternative inference architectures while aiming to keep its GPU platform dominant. Read more

🧑‍💻 OpenAI releases GPT-5.2-Codex for agentic coding. GPT-5.2-Codex is an update to OpenAI’s line of coding-specialized models; built specifically for long context, multi-file edits, repo navigation, and tool-driven workflows rather than single-shot snippets. The Codex line’s success is due to training them on how teams build software today: search, plan, change, test, repeat. Read more

🧾 Claude Tasks mode shows up in testing. Early reports point to a Tasks mode that splits planning from execution, turning Claude into a workflow runner with visible steps instead of a single chat blob. If Anthropic leans into this, it becomes a native agent surface where users can intervene mid-run and audit what happened. Read more

🧰 ChatGPT opens app submissions to third parties. OpenAI is moving from “plugins were a thing” to a real app review and directory pipeline, which is the foundation for an ecosystem, pushing developers toward clean auth, scoped permissions, and repeatable UX inside ChatGPT rather than link-spam wrappers. Read more

📉 YouTube’s feed gets flooded with “AI slop”. A study found over 20% of videos shown to new users fall into low-quality AI-generated sludge optimized for clicks and monetization. This is what happens when generation cost goes to near-zero while ranking incentives stay crude. Platforms will either invest in provenance and quality signals or accept a recommendation layer that trains users to leave. Read more

🧪 Anthropic publishes phase two of its shopkeeping test. Project Vend 2 keeps pushing an agent into a messy real-world task: running a small shop with constraints, inventory drift, and social engineering attempts. The value here is the failure analysis, because it exposes where agents lose money, mis-handle edge cases, or follow bad incentives. Treat it as a warning label for “autonomous operations” claims in 2026. Read more

🛡️ OpenAI recruits a head of Preparedness. OpenAI is staffing a senior Preparedness role, signaling more formal gating around frontier releases, eval design, and incident response. This is the job that decides what gets shipped, what gets slowed, and what needs mitigations before broad access. Read more

🦾 Microsoft pushes AI-assisted translation from C and C++ to Rust. Microsoft is exploring AI plus program analysis to migrate legacy C/C++ codebases into Rust, aiming to cut memory-safety bugs at the root. If this works, it becomes a new template for modernization (with AI as the compiler engineer’s multiplier). Read more

🧪 AI Research of the Week

Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
From Wang et al.

Jake’s Take: ARC and ARC-AGI scores (two of the leading benchmarks for tracking how well models perform compared to humans) keep getting treated as simple proof of reasoning, then used in glamorizing product narratives. Wang and team see this as a problem, and propose to split evaluations into two separate sections: a perception stage that converts each puzzle image into an isolated natural-language description, followed by a reasoning stage to induce the rule and produce the output using those descriptions (keeping misleading induction signals out of the initial perception step).

Across Mini-ARC, ACRE, and Bongard-LOGO, this novel two-stage pipeline wins against standard evaluation, and their trace review pins roughly 80% of failures on perception mistakes (like missing objects or misreading colors).

Benchmark owners and model labs should look at these findings as an opportunity to introduce better evaluation measurement and presentation, because “reasoning progress” without a perception accounting isn’t helpful.

and then, even more news…

🧩 Google launches FunctionGemma for on-device tool calling. FunctionGemma is a small Gemma variant tuned for function calling, meant to translate intent into structured tool invocations on edge hardware. This is a practical counterweight to “everything in the cloud,” especially for privacy, latency, and offline agents, with the goal to make tool use reliable with a constrained format and let developers fine-tune into domain agents. Read more