Thinking Machines Introduces Interaction Models, Which Can Respond To Audio And Video Inputs In Real Time

Thinking Machines has been more in the news for its co-founder exits to OpenAI as opposed to its technical breakthroughs, but it finally seems to have revealed some interesting advances in how humans interact with AI.

The San Francisco-based startup, founded by former OpenAI CTO Mira Murati in February 2025, has announced a research preview of what it calls “interaction models” — AI systems that handle real-time audio, video, and text natively, without bolting on external scaffolding. The timing is notable: half of Thinking Machines’ six-person founding team has left, including former CTO Barret Zoph, who returned to OpenAI under a cloud of controversy. Amid that turbulence, this announcement is the company’s clearest technical statement yet.

The Problem With Turn-Based AI

The core argument Thinking Machines is making is straightforward: current AI interfaces are designed around the model’s convenience, not the human’s. Today’s frontier models operate in a single thread — they wait for the user to finish speaking or typing, generate a complete response, and only then perceive the world again. This creates what the company calls a “collaboration bottleneck.”

The analogy they draw is apt: it’s like trying to resolve a crucial disagreement over email instead of in person. The bandwidth is simply too narrow. In real work, users rarely have requirements fully specified upfront — good outcomes require staying in the loop, course-correcting, and giving feedback along the way. The current turn-based paradigm pushes humans out not because the work doesn’t need them, but because the interface has no room for them.

What Interaction Models Do Differently

Thinking Machines’ proposed solution is a model architected around a 200ms “micro-turn” design. Rather than consuming a complete user turn and generating a complete response, both input and output are treated as continuous streams. Every 200 milliseconds, the model processes a chunk of incoming audio or video while simultaneously generating a chunk of output. There are no artificial turn boundaries.

This enables a set of capabilities that would otherwise require a harness of external components.

Dialog management becomes implicit — the model tracks whether you’re thinking, yielding, or inviting a response, with no separate component doing that work. It can also interject mid-sentence rather than waiting for you to finish, and it can speak at the same time as the user, which opens up things like real-time translation or live sports commentary.

Two capabilities stand out as genuinely new. The first is visual proactivity: the model can respond to what it sees changing, not just what it hears. Ask it to count your push-up reps, and it will actually do so — rather than saying “Sure!” and then falling silent while waiting for an audio cue that never arrives. The second is time awareness — the model has a direct sense of elapsed time, enabling things like timed breathing reminders or answering “how long did it take me to write that function?” without any special instrumentation.

Tessa's quality of life has improved a lot with some nagging. pic.twitter.com/583EaSYYzs
— Thinking Machines (@thinkymachines) May 11, 2026

The architecture also avoids large standalone encoders. Audio is ingested as dMel signals via a lightweight embedding layer; images are split into 40×40 patches through an hMLP. Everything is co-trained from scratch with the transformer, rather than stitched together.

The Two-Model Architecture

For tasks requiring deeper reasoning than can be produced instantaneously, the interaction model delegates to an asynchronous background model. The interaction model stays present throughout — holding the conversation thread, taking new input — and weaves in background results as they arrive. This split lets users get the responsiveness of a non-thinking model with the reasoning depth of one that is.

The current production model, TML-Interaction-Small, is a 276B parameter Mixture-of-Experts architecture with 12B active parameters. The company plans to release larger models later this year, though getting them to serve within the strict latency constraints of 200ms micro-turns remains an open engineering challenge.

Benchmarks

On the interactivity-focused FD-bench v1.5, Thinking Machines claims a score of 77.8 — significantly ahead of GPT Realtime-2.0 (minimal) at 46.8 and Gemini-3.1-flash-live-preview at 54.3. On turn-taking latency (FD-bench v1), TML-Interaction-Small clocks in at 0.40 seconds, faster than any competitor listed.

Intelligence benchmarks are more nuanced. On Audio MultiChallenge, the model scores 43.4% — ahead of GPT Realtime-2.0 minimal (37.6%) and Gemini live preview (26.8%), though behind GPT Realtime-2.0 at its highest quality setting (48.5%).

For the novel proactive capabilities — visual counting, time-aware speech, cue-triggered responses — the company reports that no existing competitor model can meaningfully perform these tasks at all. They either stay silent or give incorrect answers.

The Bigger Picture

The AI industry has largely treated autonomous operation as the primary goal, with interactivity as an afterthought. Thinking Machines is making a different bet: that intelligence and interactivity need to scale together, and that the way to achieve that is to make interaction native to the model rather than patched on top of it. Whether that bet pays off commercially remains to be seen, but as a technical direction, it is one of the more coherent critiques of how current AI interfaces are built — and the first serious attempt to do something about it at scale.