How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models

Learning Hub 2026-03-26 15:57 407

A comprehensive guide to AI product testing covering binary classification, object detection, LLM evaluation, RAG systems, AI agents, and document parsing. Includes metrics, code examples, and testing methodologies for real-world AI applications.

Source: TesterHome Community

Part 1: Why We Rarely Talk About “Bugs” in AI Models
Part 2: A High-Level Look at Metrics
Part 3: Deep Dive into Binary Classification
Part 4: Testing Object Detection Systems
Part 5: Testing Recommendation Systems
Part 6: OCR and Document Parsing Evaluation
Part 7: Large Language Model Evaluation
Part 8: AI Agent Testing

Part 1: Why We Rarely Talk About “Bugs” in AI Models

The biggest difference between traditional software testing and AI testing is:

Traditional Software	AI Systems
Pursues logical correctness	Pursues statistical sufficiency and continuous optimization

The core competency of a professional AI tester isn’t just finding functional defects; it’s translating the uncertain behavior of a model into an observable, comparable, and trackable system of quality metrics.

Why “Bug” Is the Wrong Concept for AI

In the AI field, we rarely judge a model by saying it has a “bug.” Instead, we evaluate:

Whether its effectiveness is good
Whether metrics meet targets
Whether it satisfies business goals

This isn’t lowering the bar; it’s because the nature of models is fundamentally different:

Models learn statistical patterns from historical data, not complete rule sets covering all world knowledge
The world’s data is dynamic — new knowledge, expressions, and scenarios emerge daily
No model can be 100% accurate, especially in open-ended scenarios

The Statistical Mindset for AI Testing

Evaluating a model based on 1-2 samples is unscientific:

A single hit might just be a lucky match with the training distribution
A single failure might just hit a known weakness of the model
A small sample cannot represent the whole or reflect the true boundaries of the model’s capability

AI testing must adhere to a statistical mindset:

Evaluate overall effectiveness with a sufficiently large dataset
Evaluate performance on different sub-scenarios using stratified data
Use consistent metrics for horizontal and vertical comparisons

Key Principle: AI testing isn’t about finding one wrong example to prove a model is bad, but about using large amounts of data to prove whether the model is reliable for the business.

The AI Testing Process

1. Deconstruct Business Scenarios → Identify model type
2. Define Metrics → Based on model type and business goals
3. Collect Data → Large, diverse datasets, stratified by user personas
4. Label Data → Create ground truth (strict isolation from dev teams!)
5. Calculate Metrics → Automated scripts using labeled data

Critical Note: Test data must be strictly isolated from algorithm and development teams. Developers could overfit the model to your specific test data, making it perform well in testing but terribly in production.

Part 2: A High-Level Look at Metrics – Match the Scenario, Then Choose the Metric

Since evaluation relies on large datasets, you must choose metrics that match the scenario. Choosing the wrong metric leads to inaccurate conclusions.

2.1 Classification Scenarios (Binary & Multi-class)

Definition: The output is a discrete, enumerable category.

Typical Scenarios:

Fraud detection (credit card fraud)
Spam email detection
Intent recognition (weather, order, small talk, knowledge retrieval)
Facial identity recognition

Core Metrics:

Precision – Of samples predicted positive, how many were truly positive?
Recall – Of all actual positives, how many were correctly identified?
F1 Score – Harmonic mean of precision and recall
AUC – Especially suitable for binary classification with probability output

2.2 Regression Scenarios

Definition: The output is a continuous numerical prediction.

Typical Scenarios:

Housing price prediction
Sales forecasting
Dynamic pricing
Duration/risk score prediction

Common Metrics:

Metric	Description
MAE	Mean Absolute Error – intuitive and easy to interpret
MSE	Mean Squared Error – penalizes larger errors more heavily
RMSE	Root Mean Squared Error – heavy penalty on large errors, interpretable units
MAPE	Mean Absolute Percentage Error – good for relative error
SMAPE	Symmetric MAPE – improves MAPE’s instability near zero
R²	Coefficient of Determination – explains variance
Median AE	Median Absolute Error – robust to outliers

Practical Advice: Use at least one absolute error metric + one relative error metric + one robustness metric.

2.3 Composite Scenarios (Classification + Regression)

Definition: Tasks involving both “category determination” and “numerical localization/regression.”

Typical Scenario: Object detection in computer vision

Classification: Is there an object? What class?
Regression: Bounding box coordinates (x, y, w, h)

Core Metrics: Combine classification and localization metrics, such as Precision/Recall paired with IoU.

2.4 Text Scenarios

Definition: Tasks focused on text consistency and correctness.

Typical Scenarios:

OCR (Optical Character Recognition)
ASR (Automatic Speech Recognition) transcription
Text translation
Document parsing

Common Metrics:

CER (Character Error Rate)
WER (Word Error Rate)
Levenshtein Distance (Edit Distance)
Text Similarity (character-level, word-level, semantic)

2.5 Generative Model Scenarios

Definition: Outputs are open-ended with large answer spaces.

Typical Scenarios:

Single-modal large model Q&A
Multi-modal Q&A (images, documents, video)
Agent task execution

Complexities:

No single standard answer
Need to distinguish objective correctness from subjective quality
Must incorporate safety, compliance, and stability evaluations

Part 3: Deep Dive into Binary Classification – Why Accuracy Isn’t Enough

3.1 The Value and Limitations of Accuracy

Accuracy is simple but can be misleading in imbalanced class scenarios.

Classic Example: Disease Screening

Metric	Value
Total samples	10,000 people
Healthy	9,500
Sick	500
Model prediction	Everyone as “healthy”
Accuracy	95% (9,500 correct predictions out of 10,000)

But the model missed ALL sick individuals — a complete business failure.

Conclusion: Accuracy can be observed, but it shouldn’t be the sole core decision-making metric.

3.2 Confusion Matrix: The Foundation of Classification Testing

The confusion matrix compares predictions against ground truth:

	Predicted Positive	Predicted Negative
Actual Positive	TP (True Positive)	FN (False Negative)
Actual Negative	FP (False Positive)	TN (True Negative)

Example: Disease Diagnosis

	Predicted Sick	Predicted Healthy	Total
Actually Sick	25 (TP)	5 (FN)	30
Actually Healthy	3 (FP)	67 (TN)	70
Total	28	72	100

3.3 Precision

Formula:

\text{Precision} = \frac{TP}{TP + FP}

Meaning: Of samples predicted as positive, how many were truly positive?

Example Calculation:

\text{Precision} = \frac{25}{25 + 3} = \frac{25}{28} = 0.893 = 89.3%

Interpretation: Of 28 cases predicted as ‘sick’, 25 were actually sick.

Applicable Scenarios:

High cost of false positives (e.g., wrongly banning a user)
“Better to miss some than to be wrong too often”

3.4 Recall

Formula:

\text{Recall} = \frac{TP}{TP + FN}

Meaning: Of all actual positives, how many were correctly identified?

Example Calculation:

\text{Recall} = \frac{25}{25 + 5} = \frac{25}{30} = 0.833 = 83.3%

Interpretation: Of 30 truly sick people, the model found 25. 5 were missed.

Applicable Scenarios:

High cost of false negatives (e.g., missed disease diagnosis)
“Better to flag more than to miss a critical risk”

3.5 F1 Score: Balancing Precision and Recall

Formula:

F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

F1 is the harmonic mean of Precision and Recall, penalizing models skewed toward one metric.

Applicable Scenarios:

When both false positives and false negatives matter
When a balanced overall performance is needed
As a comprehensive metric for comparing model versions

3.6 Thresholds and Business Goals

For the same model, Precision and Recall trade off against each other as the decision threshold changes.

Testers should work with business stakeholders to determine:

Which type of error has higher cost?
Which metric is primary?
Is F1 the balanced metric or a minimum acceptance criterion?

Part 4: Testing Object Detection Systems

Object detection is a classic composite evaluation task in computer vision.

4.1 The Nature of the Task

Object detection answers three questions:

Is there an object in the image?
What class is the object?
Where is it located? (bounding box coordinates)

Evaluation Must Cover:

Classification: Is the object class correct?
Localization: Is the bounding box accurate?

4.2 IoU: The Core Metric for Localization

IoU (Intersection over Union) measures overlap between predicted and ground truth bounding boxes:

\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}

More overlap = higher IoU
IoU ≥ threshold (e.g., 0.5) + correct class = correct detection
Common thresholds: 0.5, 0.75, or evaluated in steps from 0.5 to 0.95

4.3 Data Collection Considerations

Instead of user personas, differentiate based on business-relevant attributes:

Dimension	Examples
By Size	Large objects vs. small objects
By Scenario	Daytime vs. nighttime, occluded vs. clear, backlit vs. well-lit
By Distractors	Test with similar objects (e.g., bicycles when detecting scooters)

4.4 Integration with Peripheral Systems

An AI product is rarely just a model. The model is the core, but it’s supported by a whole system.

Example: Vehicle Illegal Parking Detection

Camera Stream → Frame Extraction → Object Detection → ROI/AOU Logic → Alert

Key System Components:

Component	Description
ROI (Region of Interest)	Manually defined parking zones
AOU (Area Overlap)	Minimum overlap to count as “parked”
Temporal AOU	Track position across frames to distinguish parked vs. passing vehicles

Critical Insight: We test the AI product, not just the model. The model must be tested, but so must the effectiveness of the integrated business system.

4.5 End-to-End Testing Strategy

Data Collection Using Pre-labeling:

Step	Description
1	Use ffmpeg to pull RTSP streams and extract frames
2	Use a model with lowered confidence threshold (e.g., 0.4 vs. business threshold 0.8) to filter candidates
3	Human testers perform secondary screening and labeling

Simulating Video Streams:

ffmpeg -re -stream_loop 100 -i video.mp4 \
  -rtsp_transport tcp -c:v libx264 \
  -f rtsp rtsp://localhost:8554/mystream

This pushes a video file as an RTSP stream to a media server, allowing end-to-end testing without physical cameras.

Part 5: Testing Recommendation Systems

5.1 Reframing the Problem

Recommendation systems are often reframed as ranking problems:

Step	Description
1	Candidate Generation — Filter items based on user profile
2	Binary Classification — Predict click probability for each candidate
3	Scoring & Ranking — Rank by probability, present Top N

5.2 Key Metrics for Retrieval

Metric	Description
Top N Recall	Proportion of items the user liked that appear in Top N recommendations
mAP (Mean Average Precision)	Considers both relevance and order of retrieved items

5.3 The Self-Learning Flywheel

Recommendation systems require high-frequency self-learning cycles:

Data Collection → Model Training → A/B Test → Deployment → Data Reflow

Why This Matters for Testing:

Factor	Implication
Automatic labeling	A click is a label — no manual annotation
Rapidly changing interests	User preferences shift with news and trends
Time sensitivity	No time for offline testing — models become stale quickly
Mitigation strategy	A/B testing and real-time monitoring become critical

Real-World Example: A classmate working on ByteDance’s ad system (annual compensation >1.35M RMB) focuses entirely on A/B testing, data quality monitoring, and real-time metric alerting — there’s no traditional testing role.

Part 6: OCR and Document Parsing Evaluation

6.1 What Is Document Parsing?

Document parsing converts various document formats (PDF, Word, Excel, images) into structured text. This is the first and most critical step in building a knowledge base for RAG (Retrieval-Augmented Generation).

6.2 Why Document Parsing Matters

Reason	Impact
Data Entry Point	Over 90% of enterprise knowledge resides in documents
Foundation of Accuracy	If parsing is wrong, everything downstream is wrong
Cost Savings	Automates extraction, reducing up to 80% of manual entry costs
Efficiency Gains	Batch processes thousands of pages in minutes

6.3 Common Document Type Challenges

Document Type	Difficulty	Key Challenges
Word	⭐⭐	Nested tables, complex styles
PDF	⭐⭐⭐⭐	Scanned copies, multi-column layouts, watermarks
Excel	⭐⭐⭐	Formulas, merged cells, multiple sheets
Image	⭐⭐⭐⭐⭐	Noise, handwriting, skew, blur

6.4 Testing Metrics for Document Parsing

Metric	Description	Target
Text Similarity	How close parsed text is to ground truth	>95%
Table Accuracy	Completeness and correctness of table recognition	>90%
Formula Accuracy	Correctness of mathematical formula recognition	>85%
Layout Fidelity	How well document structure is preserved	>90%

6.5 Text Similarity Calculation (Levenshtein Distance)

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) + 1
    return dp[m][n]

def text_similarity(text1, text2):
    distance = levenshtein_distance(text1, text2)
    max_len = max(len(text1), len(text2))
    return 1 - (distance / max_len) if max_len > 0 else 1.0

Part 7: Large Language Model Evaluation

10 Core Evaluation Categories for LLMs

#	Category	Description
1	Instruction Following & Execution	Can the model follow basic and complex instructions?
2	Knowledge Q&A & Understanding	Factual accuracy across domains
3	Logical Reasoning & Mathematics	Multi-step reasoning and math problem solving
4	Code Generation & Understanding	Writing, debugging, and explaining code
5	Role-Playing & Emotional Intelligence	Maintaining character and handling emotional contexts
6	Creative Writing & Generation	Stories, poems, marketing copy
7	Multi-turn Dialogue & Context	Maintaining context across conversations
8	Multi-modal Understanding	Images, charts, video comprehension
9	Safety & Ethics	Refusing harmful content, bias detection
10	Professional Domain Applications	Legal, medical, financial expertise

7.1 Recommended Datasets by Category

Category	Key Datasets
Instruction Following	Alpaca, FLAN, Super-NaturalInstructions
Knowledge Q&A	MMLU, C-Eval, CMMLU, TriviaQA
Reasoning & Math	GSM8K, MATH, HellaSwag, LogiQA
Code Generation	HumanEval, MBPP, CodeXGLUE
Role-Playing	EmpatheticDialogues, PersonaChat
Multi-modal	VQA v2, COCO, TextVQA, ChartQA
Safety & Ethics	RealToxicityPrompts, BOLD, TruthfulQA

7.2 Prompt Injection Attack Testing

Prompt injection is a core security testing area for LLMs. Attackers craft inputs to make models forget safety rules.

Attack Vector Categories:

Attack Type	Description
Basic Jailbreak (DAN)	Directly ask the model to ignore restrictions
Role-Playing	Act as a character with no restrictions
System Prompt Override	Forge system messages to overwrite original instructions
Context Poisoning	Embed malicious instructions into retrieved documents
Multi-Turn Attacks	Gradually lower the model’s guard through conversation
Encoding Obfuscation	Use Base64, ROT13, etc., to hide prompts
Indirect Injection	Hide instructions in external data sources (websites, APIs)

Defense Evaluation Checklist:

Check	Description
☐ Recognition	Does the model identify malicious input?
☐ Refusal	Does it explicitly refuse to execute?
☐ Explanation	Does it explain why it’s refusing?
☐ No Information Leakage	Does it avoid leaking sensitive info?
☐ Consistency	Is defense consistent across multi-turn dialogues?

7.3 Using an AI Judge for Automated Evaluation

An AI Judge (another LLM) can automate evaluation against predefined rules:

{
  "system": "You are an expert AI evaluator...",
  "user": "User Question: {query}\n\nAI Answer: {model_output}",
  "output_format": {
    "dimension_scores": {
      "accuracy": 4,
      "relevance": 5,
      "completeness": 4,
      "fluency": 5,
      "usefulness": 4
    },
    "overall_score": 4.4,
    "strengths": [...],
    "weaknesses": [...]
  }
}

Part 8: AI Agent Testing

8.1 What Is an AI Agent?

An AI Agent is a system that can perceive its environment, make decisions, and take actions to achieve goals:

Capability	Description
Tool Calling	Search engines, calculators, databases, APIs
Task Execution	Book flights, query databases, analyze data
Multi-step Reasoning	Break down complex problems
Autonomous Planning	Create and execute action plans

8.2 Core Components of an Agent System

┌─────────────────────────────────────────────┐
│              AI Agent Architecture           │
├─────────────────────────────────────────────┤
│  ┌─────────┐    ┌──────────┐                │
│  │  User   │───▶│ Dialogue │                │
│  └─────────┘    │ Manager  │                │
│                 └──────────┘                │
│       ┌────────────┴────────────┐           │
│       ▼                         ▼           │
│  ┌─────────┐               ┌─────────┐      │
│  │   LLM   │◀─────────────▶│Knowledge│      │
│  │         │               │   Base  │      │
│  └─────────┘               └─────────┘      │
│       │                         │           │
│       ▼                         ▼           │
│  ┌─────────┐               ┌─────────┐      │
│  │  Tools  │               │ Vector  │      │
│  │         │               │ Retrieval│     │
│  └─────────┘               └─────────┘      │
└─────────────────────────────────────────────┘

8.3 Key Testing Areas for Agents

Area	Description
Intent Recognition	Correctly understand user goals and choose the right tool
Tool Testing	Test each tool’s integration, input/output, and call correctness
Context Engineering	Manage long-term memory and avoid context confusion across turns
Knowledge Base (RAG)	Document parsing, chunking, embedding, retrieval accuracy

8.4 RAG Deep Dive

RAG = Retrieval + Generation — allowing LLMs to “look up” information in real-time.

Why RAG Matters:

Problem	How RAG Solves It
Outdated knowledge	Retrieves current information from knowledge base
Lack of private data	Accesses enterprise documents and internal sources
Hallucinations	Grounds answers in retrieved context

The RAG Pipeline:

User Question → Retrieve Relevant Documents → Build Prompt → LLM → Answer

8.5 Embedding & Semantic Search

Embedding converts text into dense vectors (digital fingerprints) that capture semantic meaning. Similar texts have similar vectors.

Cosine Similarity measures vector similarity:

\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \times |\mathbf{B}|}

Testing the Retrieval Pipeline:

Aspect	Details
Key Metric	Top N Recall — of pre-labeled relevant chunks, how many appear in Top N retrieved results?
Tooling	Automated scripts using scikit-learn to compute cosine similarity and mAP

Summary & Key Takeaways

The AI Testing Mindset

Traditional Testing	AI Testing
Find bugs	Evaluate statistical effectiveness
Single test cases	Large, diverse datasets
Binary pass/fail	Metrics-based evaluation
Logic verification	Continuous optimization

Core Principles

Match metrics to scenarios — classification, regression, generation each need different metrics
Think statistically — evaluate on large datasets, not individual examples
Isolate test data — prevent overfitting by keeping test data from development teams
Test the system, not just the model — peripheral systems are critical
Automate where possible — use pre-labeling, AI judges, and monitoring systems

Career Progression for AI Testers

Level	Skills
Entry	Understand metrics and evaluation
Intermediate	Data collection, cleaning, statistical analysis
Advanced	Deep AI principles, performance testing, monitoring systems
Expert	System architecture, A/B testing frameworks, real-time alerting

Real-World Insight: Top AI testers at companies like ByteDance focus entirely on A/B testing frameworks and real-time monitoring systems, with compensation exceeding 1.35M RMB annually.

Read Previous Post >>

How to Utilize CrashSight's Symbol Table Tool for Efficient Debugging

How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models

Table of Contents

Part 1: Why We Rarely Talk About “Bugs” in AI Models

Why “Bug” Is the Wrong Concept for AI

The Statistical Mindset for AI Testing

The AI Testing Process

Part 2: A High-Level Look at Metrics – Match the Scenario, Then Choose the Metric

2.1 Classification Scenarios (Binary & Multi-class)

2.2 Regression Scenarios

2.3 Composite Scenarios (Classification + Regression)

2.4 Text Scenarios

2.5 Generative Model Scenarios

Part 3: Deep Dive into Binary Classification – Why Accuracy Isn’t Enough

3.1 The Value and Limitations of Accuracy

3.2 Confusion Matrix: The Foundation of Classification Testing

3.3 Precision

3.4 Recall

3.5 F1 Score: Balancing Precision and Recall

3.6 Thresholds and Business Goals

Part 4: Testing Object Detection Systems

4.1 The Nature of the Task

4.2 IoU: The Core Metric for Localization

4.3 Data Collection Considerations

4.4 Integration with Peripheral Systems

4.5 End-to-End Testing Strategy

Part 5: Testing Recommendation Systems

5.1 Reframing the Problem

5.2 Key Metrics for Retrieval

5.3 The Self-Learning Flywheel

Part 6: OCR and Document Parsing Evaluation

6.1 What Is Document Parsing?

6.2 Why Document Parsing Matters

6.3 Common Document Type Challenges

6.4 Testing Metrics for Document Parsing

6.5 Text Similarity Calculation (Levenshtein Distance)

Part 7: Large Language Model Evaluation

10 Core Evaluation Categories for LLMs

7.1 Recommended Datasets by Category

7.2 Prompt Injection Attack Testing

7.3 Using an AI Judge for Automated Evaluation

Part 8: AI Agent Testing

8.1 What Is an AI Agent?

8.2 Core Components of an Agent System

8.3 Key Testing Areas for Agents

8.4 RAG Deep Dive

8.5 Embedding & Semantic Search

Summary & Key Takeaways

The AI Testing Mindset

Core Principles

Career Progression for AI Testers

Related Content