What AI Builders get Wrong about Evals
Building AI Products That Last: Evaluations, Product Management, and the Future of AI Development
The success of AI products depends not just on better models, but on robust evaluation systems, thoughtful product management, and understanding the evolving landscape of AI capabilities. In this episode, we explore how AI product managers should approach building, measuring, and iterating on AI agents in an era of rapidly advancing capabilities.
Guest Introduction
Aman Khan is the Head of Product at Arize AI, a platform for developing, observing, and evaluating AI agents. With deep experience in both traditional ML and modern LLM systems, Aman brings unique insights into the product management challenges of building reliable AI systems. Having witnessed the evolution from traditional ML classification models to the current LLM agent landscape, he offers practical guidance on evaluation strategies and product development approaches.
Why AI Product Management is Different
New Skill Requirements: AI PMs need to understand model capabilities, prompt engineering, and evaluation systems - not just traditional product metrics and user research
Prototyping Revolution: Tools like Cursor, Bolt, and Replit have made prototyping so fast that engineers and designers are building demos faster than they can appear on roadmaps
Context Engineering: The core skill is now expressing ideas clearly to AI systems and iterating based on feedback - similar to human communication but with different constraints
Domain Integration: AI PM isn't mutually exclusive with other specializations - being an AI PM in FinTech or HealthTech creates powerful domain-specific advantages
The Evolution of AI Evaluation
LLM-as-Judge Revolution: Traditional ML required expensive human labeling or long feedback loops; LLM systems can be evaluated by other LLMs using natural language criteria
Human-Readable Outputs: Unlike numerical scores from traditional models, LLM outputs can be quickly assessed by humans, enabling faster iteration cycles
Beyond Simple Classification: Modern evals can assess hallucination detection, tool usage, context utilization, and business logic - not just sentiment or helpfulness
Reasoning Tokens for Evals: Evaluation systems can provide explanations for their judgments, making eval results more transparent and trustworthy than traditional metrics
Building Effective Evaluation Systems
Product-First Approach: Start with defining product success metrics before writing evals - evaluation should serve business objectives, not exist for its own sake
Data Quality Foundation: The most important thing PMs can do is ensure high-quality labeled data and involve domain experts in the evaluation process
Iterative Playground Setup: Success requires systems that enable rapid experimentation across prompts, context, tools, and model parameters
Multi-Dimensional Scorecards: Effective evals measure both business logic (tool selection, reasoning) and subjective qualities (tone, helpfulness, correctness)
The AI Product Manager's Evolving Role
Code Familiarity: PMs should use tools like Cursor to ask questions about codebases and understand technical constraints without becoming developers
Prompt Collaboration: Active involvement in prompt iteration and agent development - the "secret sauce" often lives in carefully crafted system prompts
Evaluation Ownership: PMs should be hands-on with labeling data, writing evaluation criteria, and understanding why systems succeed or fail
Infrastructure Thinking: Building the right observability and experimentation systems is as important as the AI features themselves
Practical Evaluation Strategies
Start Simple, Scale Smart: Begin with basic helpful/unhelpful classifications, then expand to domain-specific evals like hallucination detection
Context-Aware Grading: Use available context (documents, conversation history, tool outputs) to assess whether agents are using information correctly
Explanation-Driven Evals: Include reasoning in evaluation outputs to understand why systems make certain judgments
Real-World Data Focus: Prioritize production data and edge cases over toy examples when building evaluation datasets
Navigating Model Capability Evolution
Infrastructure Investment: While models will improve, the scaffolding around them (evaluation, observability, safety systems) will determine product success
Waymo Analogy: Like self-driving cars, AI systems need robust data collection and training pipelines to handle increasingly complex scenarios
Capability Expectation Management: As models improve, user expectations will rise, requiring more sophisticated failure detection and recovery systems
Platform vs. Application Strategy: Success comes from building robust infrastructure for learning from real-world data, not just riding model improvements
Developer and PM Adoption Patterns
Tool Experimentation: More PMs are moving from basic tools (Bolt, Lovable) to development environments (Cursor) for deeper AI integration
Community Learning: The "leaked system prompts" phenomenon shows how much product differentiation lives in prompt engineering and system design
AI Intuition Development: Successful AI PMs develop instincts for how AI products work by reverse-engineering successful applications
Expression as Core Skill: The fundamental capability is clearly communicating intent and requirements to AI systems
Future of AI Product Development
PM as Bottleneck: As Andrew Ng noted, product managers are becoming the critical path in AI development - knowing what to build is the key constraint
Enterprise Adoption Patterns: B2B AI products will need sophisticated evaluation and observability systems to meet enterprise security and compliance requirements
Integration Complexity: AI systems will need to work across dozens of tools and data sources, requiring robust evaluation across all integration points
Human-AI Collaboration: The biggest workforce change will be teaching people to delegate effectively to AI agents, similar to executive-assistant relationships
This episode provides practical frameworks for product managers building in the AI space, emphasizing that success comes from thoughtful evaluation strategies and robust product development practices rather than just access to better models.
Interested in being a guest on Future Proof? Reach out to forrest.herlick@useparagon.com