Dheeraj Mundhra - UserTrace

clinical evaluation

How do you confidently know which AI model is best for your use case?

Benchmarking GPT-4o, Claude Sonnet 4.6, MedGemma 4B, and MedGemma 27B across 500+ simulated patient conversations on healthcare AI.

AI agents

Why “Average Accuracy” Is a Dangerous Metric for Healthcare AI Agents

If you are building AI agents for the healthcare industry, you have likely already accepted that “average accuracy” is a misleading comfort metric. 💡In healthcare AI agents, hospitals, patients, and buyers are not purchasing technology alone; they are purchasing trust. One unsafe response in a single conversation can escalate and

ChatGPT Apps

Reliability Is the New Moat: What OpenAI Dev Day Really Signaled

The moment Sam Altman said, “It’s never been faster to go from idea to product,” the entire room at Dev Day nodded in agreement. With Apps SDK, AgentKit, and Codex, OpenAI just turned ChatGPT into the operating system for AI agents. Anyone can now design, deploy, and distribute AI-powered

AI user experience

Evaluating AI Agents Like Products, Not Prompts

In the age of Agentic AI, shipping without deeply understanding user experience isn’t just a risk, it’s an existential threat to your product. Product development used to be a fast loop: ship, track, fix, repeat. A few A/B tests, some bugs logged, and you were iterating in

AI agents

How Pre-Release Simulation Makes AI Agents Reliable

One of my biggest fears while releasing AI agents was their unpredictable behaviour in production. You can test in staging, run evals on golden datasets, and even have your team dogfood the agent, yet the moment real users arrive, everything breaks. As AI agents move into customer-facing roles, these unpredictable