Source: TesterHome Community

The biggest difference between traditional software testing and AI testing is:
|
Traditional Software |
AI Systems |
|
Pursues logical correctness |
Pursues statistical sufficiency and continuous optimization |
The core competency of a professional AI tester isn’t just finding functional defects; it’s translating the uncertain behavior of a model into an observable, comparable, and trackable system of quality metrics.
In the AI field, we rarely judge a model by saying it has a “bug.” Instead, we evaluate:
This isn’t lowering the bar; it’s because the nature of models is fundamentally different:
Evaluating a model based on 1-2 samples is unscientific:
AI testing must adhere to a statistical mindset:
Key Principle: AI testing isn’t about finding one wrong example to prove a model is bad, but about using large amounts of data to prove whether the model is reliable for the business.
1. Deconstruct Business Scenarios → Identify model type
2. Define Metrics → Based on model type and business goals
3. Collect Data → Large, diverse datasets, stratified by user personas
4. Label Data → Create ground truth (strict isolation from dev teams!)
5. Calculate Metrics → Automated scripts using labeled data
Critical Note: Test data must be strictly isolated from algorithm and development teams. Developers could overfit the model to your specific test data, making it perform well in testing but terribly in production.
Since evaluation relies on large datasets, you must choose metrics that match the scenario. Choosing the wrong metric leads to inaccurate conclusions.
Definition: The output is a discrete, enumerable category.
Typical Scenarios:
Core Metrics:
Definition: The output is a continuous numerical prediction.
Typical Scenarios:
Common Metrics:
|
Metric |
Description |
|
MAE |
Mean Absolute Error – intuitive and easy to interpret |
|
MSE |
Mean Squared Error – penalizes larger errors more heavily |
|
RMSE |
Root Mean Squared Error – heavy penalty on large errors, interpretable units |
|
MAPE |
Mean Absolute Percentage Error – good for relative error |
|
SMAPE |
Symmetric MAPE – improves MAPE’s instability near zero |
|
R² |
Coefficient of Determination – explains variance |
|
Median AE |
Median Absolute Error – robust to outliers |
Practical Advice: Use at least one absolute error metric + one relative error metric + one robustness metric.
Definition: Tasks involving both “category determination” and “numerical localization/regression.”
Typical Scenario: Object detection in computer vision
Core Metrics: Combine classification and localization metrics, such as Precision/Recall paired with IoU.
Definition: Tasks focused on text consistency and correctness.
Typical Scenarios:
Common Metrics:
Definition: Outputs are open-ended with large answer spaces.
Typical Scenarios:
Complexities:
Accuracy is simple but can be misleading in imbalanced class scenarios.
Classic Example: Disease Screening
|
Metric |
Value |
|
Total samples |
10,000 people |
|
Healthy |
9,500 |
|
Sick |
500 |
|
Model prediction |
Everyone as “healthy” |
|
Accuracy |
95% (9,500 correct predictions out of 10,000) |
But the model missed ALL sick individuals — a complete business failure.
Conclusion: Accuracy can be observed, but it shouldn’t be the sole core decision-making metric.
The confusion matrix compares predictions against ground truth:
|
|
Predicted Positive |
Predicted Negative |
|
Actual Positive |
TP (True Positive) |
FN (False Negative) |
|
Actual Negative |
FP (False Positive) |
TN (True Negative) |
Example: Disease Diagnosis
|
|
Predicted Sick |
Predicted Healthy |
Total |
|
Actually Sick |
25 (TP) |
5 (FN) |
30 |
|
Actually Healthy |
3 (FP) |
67 (TN) |
70 |
|
Total |
28 |
72 |
100 |
Formula:
\text{Precision} = \frac{TP}{TP + FP}
Meaning: Of samples predicted as positive, how many were truly positive?
Example Calculation:
\text{Precision} = \frac{25}{25 + 3} = \frac{25}{28} = 0.893 = 89.3%
Interpretation: Of 28 cases predicted as ‘sick’, 25 were actually sick.
Applicable Scenarios:
Formula:
\text{Recall} = \frac{TP}{TP + FN}
Meaning: Of all actual positives, how many were correctly identified?
Example Calculation:
\text{Recall} = \frac{25}{25 + 5} = \frac{25}{30} = 0.833 = 83.3%
Interpretation: Of 30 truly sick people, the model found 25. 5 were missed.
Applicable Scenarios:
Formula:
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
F1 is the harmonic mean of Precision and Recall, penalizing models skewed toward one metric.
Applicable Scenarios:
For the same model, Precision and Recall trade off against each other as the decision threshold changes.
Testers should work with business stakeholders to determine:
Object detection is a classic composite evaluation task in computer vision.
Object detection answers three questions:
Evaluation Must Cover:
IoU (Intersection over Union) measures overlap between predicted and ground truth bounding boxes:
\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}
Instead of user personas, differentiate based on business-relevant attributes:
|
Dimension |
Examples |
|
By Size |
Large objects vs. small objects |
|
By Scenario |
Daytime vs. nighttime, occluded vs. clear, backlit vs. well-lit |
|
By Distractors |
Test with similar objects (e.g., bicycles when detecting scooters) |
An AI product is rarely just a model. The model is the core, but it’s supported by a whole system.
Example: Vehicle Illegal Parking Detection
Camera Stream → Frame Extraction → Object Detection → ROI/AOU Logic → Alert
Key System Components:
|
Component |
Description |
|
ROI (Region of Interest) |
Manually defined parking zones |
|
AOU (Area Overlap) |
Minimum overlap to count as “parked” |
|
Temporal AOU |
Track position across frames to distinguish parked vs. passing vehicles |
Critical Insight: We test the AI product, not just the model. The model must be tested, but so must the effectiveness of the integrated business system.
Data Collection Using Pre-labeling:
|
Step |
Description |
|
1 |
Use ffmpeg to pull RTSP streams and extract frames |
|
2 |
Use a model with lowered confidence threshold (e.g., 0.4 vs. business threshold 0.8) to filter candidates |
|
3 |
Human testers perform secondary screening and labeling |
Simulating Video Streams:
ffmpeg -re -stream_loop 100 -i video.mp4 \
-rtsp_transport tcp -c:v libx264 \
-f rtsp rtsp://localhost:8554/mystream
This pushes a video file as an RTSP stream to a media server, allowing end-to-end testing without physical cameras.
Recommendation systems are often reframed as ranking problems:
|
Step |
Description |
|
1 |
Candidate Generation — Filter items based on user profile |
|
2 |
Binary Classification — Predict click probability for each candidate |
|
3 |
Scoring & Ranking — Rank by probability, present Top N |
|
Metric |
Description |
|
Top N Recall |
Proportion of items the user liked that appear in Top N recommendations |
|
mAP (Mean Average Precision) |
Considers both relevance and order of retrieved items |
Recommendation systems require high-frequency self-learning cycles:
Data Collection → Model Training → A/B Test → Deployment → Data Reflow
Why This Matters for Testing:
|
Factor |
Implication |
|
Automatic labeling |
A click is a label — no manual annotation |
|
Rapidly changing interests |
User preferences shift with news and trends |
|
Time sensitivity |
No time for offline testing — models become stale quickly |
|
Mitigation strategy |
A/B testing and real-time monitoring become critical |
Real-World Example: A classmate working on ByteDance’s ad system (annual compensation >1.35M RMB) focuses entirely on A/B testing, data quality monitoring, and real-time metric alerting — there’s no traditional testing role.
Document parsing converts various document formats (PDF, Word, Excel, images) into structured text. This is the first and most critical step in building a knowledge base for RAG (Retrieval-Augmented Generation).
|
Reason |
Impact |
|
Data Entry Point |
Over 90% of enterprise knowledge resides in documents |
|
Foundation of Accuracy |
If parsing is wrong, everything downstream is wrong |
|
Cost Savings |
Automates extraction, reducing up to 80% of manual entry costs |
|
Efficiency Gains |
Batch processes thousands of pages in minutes |
|
Document Type |
Difficulty |
Key Challenges |
|
Word |
⭐⭐ |
Nested tables, complex styles |
|
|
⭐⭐⭐⭐ |
Scanned copies, multi-column layouts, watermarks |
|
Excel |
⭐⭐⭐ |
Formulas, merged cells, multiple sheets |
|
Image |
⭐⭐⭐⭐⭐ |
Noise, handwriting, skew, blur |
|
Metric |
Description |
Target |
|
Text Similarity |
How close parsed text is to ground truth |
>95% |
|
Table Accuracy |
Completeness and correctness of table recognition |
>90% |
|
Formula Accuracy |
Correctness of mathematical formula recognition |
>85% |
|
Layout Fidelity |
How well document structure is preserved |
>90% |
def levenshtein_distance(s1, s2):
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(m + 1):
dp[i][0] = i
for j in range(n + 1):
dp[0][j] = j
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) + 1
return dp[m][n]
def text_similarity(text1, text2):
distance = levenshtein_distance(text1, text2)
max_len = max(len(text1), len(text2))
return 1 - (distance / max_len) if max_len > 0 else 1.0
|
# |
Category |
Description |
|
1 |
Instruction Following & Execution |
Can the model follow basic and complex instructions? |
|
2 |
Knowledge Q&A & Understanding |
Factual accuracy across domains |
|
3 |
Logical Reasoning & Mathematics |
Multi-step reasoning and math problem solving |
|
4 |
Code Generation & Understanding |
Writing, debugging, and explaining code |
|
5 |
Role-Playing & Emotional Intelligence |
Maintaining character and handling emotional contexts |
|
6 |
Creative Writing & Generation |
Stories, poems, marketing copy |
|
7 |
Multi-turn Dialogue & Context |
Maintaining context across conversations |
|
8 |
Multi-modal Understanding |
Images, charts, video comprehension |
|
9 |
Safety & Ethics |
Refusing harmful content, bias detection |
|
10 |
Professional Domain Applications |
Legal, medical, financial expertise |
|
Category |
Key Datasets |
|
Instruction Following |
Alpaca, FLAN, Super-NaturalInstructions |
|
Knowledge Q&A |
MMLU, C-Eval, CMMLU, TriviaQA |
|
Reasoning & Math |
GSM8K, MATH, HellaSwag, LogiQA |
|
Code Generation |
HumanEval, MBPP, CodeXGLUE |
|
Role-Playing |
EmpatheticDialogues, PersonaChat |
|
Multi-modal |
VQA v2, COCO, TextVQA, ChartQA |
|
Safety & Ethics |
RealToxicityPrompts, BOLD, TruthfulQA |
Prompt injection is a core security testing area for LLMs. Attackers craft inputs to make models forget safety rules.
Attack Vector Categories:
|
Attack Type |
Description |
|
Basic Jailbreak (DAN) |
Directly ask the model to ignore restrictions |
|
Role-Playing |
Act as a character with no restrictions |
|
System Prompt Override |
Forge system messages to overwrite original instructions |
|
Context Poisoning |
Embed malicious instructions into retrieved documents |
|
Multi-Turn Attacks |
Gradually lower the model’s guard through conversation |
|
Encoding Obfuscation |
Use Base64, ROT13, etc., to hide prompts |
|
Indirect Injection |
Hide instructions in external data sources (websites, APIs) |
Defense Evaluation Checklist:
|
Check |
Description |
|
☐ Recognition |
Does the model identify malicious input? |
|
☐ Refusal |
Does it explicitly refuse to execute? |
|
☐ Explanation |
Does it explain why it’s refusing? |
|
☐ No Information Leakage |
Does it avoid leaking sensitive info? |
|
☐ Consistency |
Is defense consistent across multi-turn dialogues? |
An AI Judge (another LLM) can automate evaluation against predefined rules:
{
"system": "You are an expert AI evaluator...",
"user": "User Question: {query}\n\nAI Answer: {model_output}",
"output_format": {
"dimension_scores": {
"accuracy": 4,
"relevance": 5,
"completeness": 4,
"fluency": 5,
"usefulness": 4
},
"overall_score": 4.4,
"strengths": [...],
"weaknesses": [...]
}
}
An AI Agent is a system that can perceive its environment, make decisions, and take actions to achieve goals:
|
Capability |
Description |
|
Tool Calling |
Search engines, calculators, databases, APIs |
|
Task Execution |
Book flights, query databases, analyze data |
|
Multi-step Reasoning |
Break down complex problems |
|
Autonomous Planning |
Create and execute action plans |
┌─────────────────────────────────────────────┐
│ AI Agent Architecture │
├─────────────────────────────────────────────┤
│ ┌─────────┐ ┌──────────┐ │
│ │ User │───▶│ Dialogue │ │
│ └─────────┘ │ Manager │ │
│ └──────────┘ │
│ ┌────────────┴────────────┐ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ LLM │◀─────────────▶│Knowledge│ │
│ │ │ │ Base │ │
│ └─────────┘ └─────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Tools │ │ Vector │ │
│ │ │ │ Retrieval│ │
│ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────┘
|
Area |
Description |
|
Intent Recognition |
Correctly understand user goals and choose the right tool |
|
Tool Testing |
Test each tool’s integration, input/output, and call correctness |
|
Context Engineering |
Manage long-term memory and avoid context confusion across turns |
|
Knowledge Base (RAG) |
Document parsing, chunking, embedding, retrieval accuracy |
RAG = Retrieval + Generation — allowing LLMs to “look up” information in real-time.
Why RAG Matters:
|
Problem |
How RAG Solves It |
|
Outdated knowledge |
Retrieves current information from knowledge base |
|
Lack of private data |
Accesses enterprise documents and internal sources |
|
Hallucinations |
Grounds answers in retrieved context |
The RAG Pipeline:
User Question → Retrieve Relevant Documents → Build Prompt → LLM → Answer
Embedding converts text into dense vectors (digital fingerprints) that capture semantic meaning. Similar texts have similar vectors.
Cosine Similarity measures vector similarity:
\text{Cosine Similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| \times |\mathbf{B}|}
Testing the Retrieval Pipeline:
|
Aspect |
Details |
|
Key Metric |
Top N Recall — of pre-labeled relevant chunks, how many appear in Top N retrieved results? |
|
Tooling |
Automated scripts using scikit-learn to compute cosine similarity and mAP |
|
Traditional Testing |
AI Testing |
|
Find bugs |
Evaluate statistical effectiveness |
|
Single test cases |
Large, diverse datasets |
|
Binary pass/fail |
Metrics-based evaluation |
|
Logic verification |
Continuous optimization |
|
Level |
Skills |
|
Entry |
Understand metrics and evaluation |
|
Intermediate |
Data collection, cleaning, statistical analysis |
|
Advanced |
Deep AI principles, performance testing, monitoring systems |
|
Expert |
System architecture, A/B testing frameworks, real-time alerting |
Real-World Insight: Top AI testers at companies like ByteDance focus entirely on A/B testing frameworks and real-time monitoring systems, with compensation exceeding 1.35M RMB annually.