ChatGPT Apps

Reliability Is the New Moat: What OpenAI Dev Day Really Signaled

Dheeraj Mundhra

14 Nov 2025 — 5 min read

At OpenAI Dev Day 2025, building agents got easier. The next challenge is making them reliable.

The moment Sam Altman said, “It’s never been faster to go from idea to product,” the entire room at Dev Day nodded in agreement. With Apps SDK, AgentKit, and Codex, OpenAI just turned ChatGPT into the operating system for AI agents. Anyone can now design, deploy, and distribute AI-powered apps in minutes.

But there’s a quiet truth beneath the excitement: when building gets this easy, making them reliable is the hard part. Because what breaks AI agents isn’t the code, it’s the context.

In a world where agent creation is accelerating, reliability becomes the differentiator.

The Dev Day had three major releases. Each one unlocks huge capability and introduces new reliability challenges.

The first was the Apps SDK, which allows developers to build apps directly within ChatGPT. This means that agent builders must now ensure their apps reliably communicate with users based on the context flowing in from ChatGPT, and even more importantly, across conversations that span multiple apps. The future increasingly looks like conversations as the new operating system, replacing Android or iOS as the primary interface. This means app builders will need to focus more on using context to its fullest and make fewer assumptions to deliver the most relevant outcomes for the end user.
The second was AgentKit, where developers can build and evaluate agents. While the demo showed the creation and testing of a Dev Day landing page agent, making these agents reliable for production is an entirely different challenge, which we’ll explore in the next section.
The third was Codex, OpenAI’s latest coding model. The most exciting part was how developers can now use MCPs to simulate users and evaluate agents directly within code-based environments like IDEs or SDKs, essentially building self-improving agents early in the agent development cycle. Let’s deep dive into each one of them.

ChatGPT Apps SDK: When Context Becomes the New Interface

When Alex Christakis from OpenAI demonstrated a user asking ChatGPT to teach them about machine learning through Coursera, it was fascinating to see Coursera take the context of that query and return the right suggestions within ChatGPT. However, if the user had not explicitly mentioned “Coursera,” how would the suggested apps work? What kinds of queries from what types of users should trigger your app to appear as a relevant suggestion, and more importantly, how do you make that happen?

Now imagine the context switching for different user personas. The user asks for a short video summarizing a concept, but the explanation would differ for a 21-year-old engineer compared to a 35-year-old product leader. These are new “skills” that the Coursera agent would need to acquire to ensure users receive the most relevant and engaging experience.

Similarly, when Alex showed the demo of the Zillow app, what stood out was how the query “How far is this home from the dog park?” automatically took context from the UI to identify exactly which “home” the user was referring to and then used maps to generate the answer. The question is, how do you ensure that context transferred between apps consistently produces accurate and satisfying results that keep users coming back to your App?

Ultimately, these experiences, especially how context is transferred and interpreted, will determine how users perceive your app and how personalized it feels. Under the hood, this will depend on how effectively you can maximize the use of contextual signals you receive and the information shared in the output window, which is often limited to just a few sentences, a handful of images, or 10 seconds of interaction.

AgentKit: Where Building Agents Meets Evaluating Them

For the second release, when Christina Huang came in to demo Agent Builder, she built an agent for the OpenAI Dev Day website in just eight minutes, a task that would have seemed impossible before this release. Kudos to the team for making agent building and evaluation so seamless.

Let’s take this a level further. Once you’ve built the agent for the Dev Day website, how do you ensure it performs well across different scenarios? One approach is to design targeted evaluations for each use case. For example, a user might start by asking, “What session should I attend to learn about agents?” and then follow up with, “Who’s speaking at that one?” or “Add it to my calendar.” The agent must ensure it makes the right tool call with the correct parameters while handling tool call failures gracefully. When a user asks, “Show me all afternoon sessions on AgentKit,” but the schedule file is unavailable or outdated, the agent should handle the situation smoothly. It should communicate uncertainty transparently, for example by saying, “I can’t access the latest schedule, but here’s what I found in the stored version,” and offer fallback paths, such as suggesting the Dev Day website link.

How would your agent handle failures? Would it respond gracefully without breaking user trust? These edge cases and moments of recovery will be the true differentiators in a market where building agents has become easy, but making them reliable is what will set winners apart.

Codex SDK: MCPs Enabling User-Level Testing at Dev Time

Romain Huet coded a Dev Day camera controller using voice commands live on stage with Codex, and it worked perfectly for the demo. It was an interesting addition to the growing ecosystem of autonomous coding tools like Claude Code and Cursor already in the market.

One of the key aspects I liked was how Codex dynamically adjusted its thinking time based on the complexity of the task. In the Codex demo, Romain showed a voice conversational bot that controlled the camera on stage. When he asked it to light up the room, the lights responded instantly and it worked perfectly.

But a demo is not production.

In the real world, a conversational agent controlling physical hardware has to behave correctly when things don’t work. The light might not turn on. A tool call might fail. The device could return malformed intensity data. And asking the user, “What should the intensity be?” is not ideal, since the user may not know that beforehand. Instead, automatically lighting the room to an optimal intensity and then asking, “Does this look good, or would you like it brighter or dimmer?” would create a far better user experience.

These are the kinds of edge cases that never surface in perfect demos but always surface in production. That is exactly what user-level simulation during development is designed to catch.

Simulation: The Missing Layer and the Next Differentiator

Apps SDK made context the new interface, AgentKit made building agents visual, and Codex SDK made automation effortless. Yet what connects all three is simulation, the invisible infrastructure that ensures reliability before release. Simulation means replaying real user traces, generating edge cases, and testing how context drifts across apps, tools, and personas.

In a world where speed is no longer the constraint, reliability becomes the moat.

Simulation is the new Turing test for pre-production AI, where agents must not only work but behave as real users expect them to.

Building Reliable AI Agents with UserTrace

At UserTrace, we are building this missing simulation layer for AI agents, helping teams test context, uncover failure paths, and validate reliability before production.

Whether your agent lives inside ChatGPT, AgentKit, or Codex, UserTrace helps you move from “it works in the demo” to “it works in production.”
Book a demo →

Reliability Is the New Moat: What OpenAI Dev Day Really Signaled

Dheeraj Mundhra

ChatGPT Apps SDK: When Context Becomes the New Interface

AgentKit: Where Building Agents Meets Evaluating Them

Codex SDK: MCPs Enabling User-Level Testing at Dev Time

Simulation: The Missing Layer and the Next Differentiator

Building Reliable AI Agents with UserTrace

Read more

Why “Average Accuracy” Is a Dangerous Metric for Healthcare AI Agents

Evaluating Mental-Health AI Agents Through Clinical-Grade Simulation

Evaluating AI Agents Like Products, Not Prompts

How Pre-Release Simulation Makes AI Agents Reliable