Platform

Use Cases

Developers

Resources

Pricing

For AI

Start for free

Book a demo

Platform

Use Cases

Developers

Resources

Pricing

For AI

Start for free

Book a demo

Aug 5, 2025

Duration

40 Mins

Aman Khan

Head of Product | Arize AI

Ethan Lee

Director of Product

What AI Builders get Wrong about Evals

Building AI Products That Last: Evaluations, Product Management, and the Future of AI Development

The success of AI products depends not just on better models, but on robust evaluation systems, thoughtful product management, and understanding the evolving landscape of AI capabilities. In this episode, we explore how AI product managers should approach building, measuring, and iterating on AI agents in an era of rapidly advancing capabilities.

Guest Introduction

Aman Khan is the Head of Product at Arize AI, a platform for developing, observing, and evaluating AI agents. With deep experience in both traditional ML and modern LLM systems, Aman brings unique insights into the product management challenges of building reliable AI systems. Having witnessed the evolution from traditional ML classification models to the current LLM agent landscape, he offers practical guidance on evaluation strategies and product development approaches.

Why AI Product Management is Different

New Skill Requirements: AI PMs need to understand model capabilities, prompt engineering, and evaluation systems - not just traditional product metrics and user research

Prototyping Revolution: Tools like Cursor, Bolt, and Replit have made prototyping so fast that engineers and designers are building demos faster than they can appear on roadmaps

Context Engineering: The core skill is now expressing ideas clearly to AI systems and iterating based on feedback - similar to human communication but with different constraints

Domain Integration: AI PM isn't mutually exclusive with other specializations - being an AI PM in FinTech or HealthTech creates powerful domain-specific advantages

The Evolution of AI Evaluation

LLM-as-Judge Revolution: Traditional ML required expensive human labeling or long feedback loops; LLM systems can be evaluated by other LLMs using natural language criteria

Human-Readable Outputs: Unlike numerical scores from traditional models, LLM outputs can be quickly assessed by humans, enabling faster iteration cycles

Beyond Simple Classification: Modern evals can assess hallucination detection, tool usage, context utilization, and business logic - not just sentiment or helpfulness

Reasoning Tokens for Evals: Evaluation systems can provide explanations for their judgments, making eval results more transparent and trustworthy than traditional metrics

Building Effective Evaluation Systems

Product-First Approach: Start with defining product success metrics before writing evals - evaluation should serve business objectives, not exist for its own sake

Data Quality Foundation: The most important thing PMs can do is ensure high-quality labeled data and involve domain experts in the evaluation process

Iterative Playground Setup: Success requires systems that enable rapid experimentation across prompts, context, tools, and model parameters

Multi-Dimensional Scorecards: Effective evals measure both business logic (tool selection, reasoning) and subjective qualities (tone, helpfulness, correctness)

The AI Product Manager's Evolving Role

Code Familiarity: PMs should use tools like Cursor to ask questions about codebases and understand technical constraints without becoming developers

Prompt Collaboration: Active involvement in prompt iteration and agent development - the "secret sauce" often lives in carefully crafted system prompts

Evaluation Ownership: PMs should be hands-on with labeling data, writing evaluation criteria, and understanding why systems succeed or fail

Infrastructure Thinking: Building the right observability and experimentation systems is as important as the AI features themselves

Practical Evaluation Strategies

Start Simple, Scale Smart: Begin with basic helpful/unhelpful classifications, then expand to domain-specific evals like hallucination detection

Context-Aware Grading: Use available context (documents, conversation history, tool outputs) to assess whether agents are using information correctly

Explanation-Driven Evals: Include reasoning in evaluation outputs to understand why systems make certain judgments

Real-World Data Focus: Prioritize production data and edge cases over toy examples when building evaluation datasets

Navigating Model Capability Evolution

Infrastructure Investment: While models will improve, the scaffolding around them (evaluation, observability, safety systems) will determine product success

Waymo Analogy: Like self-driving cars, AI systems need robust data collection and training pipelines to handle increasingly complex scenarios

Capability Expectation Management: As models improve, user expectations will rise, requiring more sophisticated failure detection and recovery systems

Platform vs. Application Strategy: Success comes from building robust infrastructure for learning from real-world data, not just riding model improvements

Developer and PM Adoption Patterns

Tool Experimentation: More PMs are moving from basic tools (Bolt, Lovable) to development environments (Cursor) for deeper AI integration

Community Learning: The "leaked system prompts" phenomenon shows how much product differentiation lives in prompt engineering and system design

AI Intuition Development: Successful AI PMs develop instincts for how AI products work by reverse-engineering successful applications

Expression as Core Skill: The fundamental capability is clearly communicating intent and requirements to AI systems

Future of AI Product Development

PM as Bottleneck: As Andrew Ng noted, product managers are becoming the critical path in AI development - knowing what to build is the key constraint

Enterprise Adoption Patterns: B2B AI products will need sophisticated evaluation and observability systems to meet enterprise security and compliance requirements

Integration Complexity: AI systems will need to work across dozens of tools and data sources, requiring robust evaluation across all integration points

Human-AI Collaboration: The biggest workforce change will be teaching people to delegate effectively to AI agents, similar to executive-assistant relationships

This episode provides practical frameworks for product managers building in the AI space, emphasizing that success comes from thoughtful evaluation strategies and robust product development practices rather than just access to better models.

Interested in being a guest on Future Proof? Reach out to forrest.herlick@useparagon.com