Intelligent Test Grading & Release Risk Assessment | Quality Score Model

Learning Hub 2026-06-30 11:25 216

Learn how Baidu’s Quality Score Model enables intelligent test grading, release risk assessment, and data-driven QA automation to boost software delivery efficiency & quality control.

Source: TesterHome

1. Editor’s Preface

QA and R&D teams repeatedly encounter unsolved quality & efficiency bottlenecks in daily release testing. The most representative puzzles include 4 core pain points:

For minor code changes, is full regression testing mandatory? How to control risks if test scope is reduced?
Even after complete pre-release testing, how to quantify undiscovered hidden defects and residual online risks?
As automated testing matures, many QA workflows run unattended. Current evaluation standards only rely on case pass rate and coverage; there is no way to judge real automation effectiveness.
Some long-running automated test suites rarely detect new defects, becoming redundant overhead. Teams dare not delete them due to unknown potential risks.

To address the above challenges, Intelligent Testing Team launched R&D on the Quality Score Model, with three clear core goals:

Use full-link R&D logs, developer self-test records and automation data to support data-based test scope decisions. Cut redundant testing work and shorten iteration cycles.
Aggregate all development and test metadata to predict project release risks, identify latent defects in advance and stabilize overall risk control capacity.
Dynamically schedule automated test tasks based on model risk output, match optimal test cases and execution cycles, and maximize testing input-output ratio.

After more than one year of algorithm research, system development and repeated online verification, the team completed model landing based on a unified quality efficiency platform, realizing standardized data collection and automatic risk modeling. The solution has been piloted and fully deployed across multiple Baidu commercial business lines.

This article is the first chapter of a serialized technical series, focusing on intelligent test grading and release risk sign-off logic. Subsequent serial articles will detail unattended delivery and intelligent automation scheduling.

Full Serial Article Roadmap

Quality Score Model Series 1: Intelligent Test Grading & Risk Assessment – Commercial Platform Landing Practice
Quality Score Model Series 2: Large-Scale Model Deployment Based on Unified Quality Efficiency Platform
Quality Score Model Series 3: Research & Production Practice of Release Risk Prediction Algorithms
Quality Score Model Series 4: Multi-Dimensional Feature Mining Framework for Quality Scoring
Quality Score Model Series 5: Risk-Driven Fully Unattended Software Delivery Mechanism

2. Article Overview

This paper elaborates the two core business scenarios supported by the Quality Score Model: test tier classification and release risk gate review.

The model replaces subjective manual experience judgment with objective full-link data and standardized algorithm risk scoring. It avoids two extreme losses caused by human evaluation: risk leakage due to over-optimism, and efficiency waste due to over-conservative testing. The model balances delivery speed and product quality, and achieves unified quality & efficiency targets for all R&D projects.

3. Research Background & Traditional Testing Pain Points

For all testing teams, the long-term core optimization target is to quickly deliver high-value business features without sacrificing online product quality. This target is also the core demand of Agile development and continuous delivery systems.

3.1 Definition of Self-Service Testing Workflow

Under traditional R&D processes, product, front-end, back-end and QA staff will jointly confirm development cycles and test scope after requirement review.

For small iteration updates, most teams adopt lightweight release logic: developers finish self-testing and pass automated test suites, then directly launch to production. This manpower-saving mode is critical for fast iteration under high developer-to-QA staffing ratios.

With mature, product-level automated testing infrastructure widely deployed, testing work is largely covered by machines, and developers bear more pre-release verification responsibilities. Based on this trend, Baidu launched the Self-Service Testing system in 2019:

Self-Service Testing: A standardized R&D workflow where developers rely on internal unified testing services and stable automation assets to complete full quality verification. The Quality Score risk assessment engine or QA reviewer finally issues a pass/fail release judgment.

3.2 Two Core Bottlenecks of Manual Risk Evaluation

Large-scale promotion of self-service testing exposed two fatal defects brought by manual subjective evaluation:

Bottleneck 1: Manual test eligibility judgment carries obvious subjective bias

Teams only rely on staff experience to decide whether a project can enter self-service release, which triggers two types of loss:

Risk leakage (false negative): Underestimated project risks. In microservice and central platform architectures, modifying a single core function may link dozens of external APIs and third-party components, easily triggering unforeseen online failures.
Efficiency loss (false positive): Overestimated change risks. Even trivial code revisions require full QA regression testing, greatly slowing overall release throughput.

Manual judgment standards cannot be unified, especially for teams with a large number of new employees lacking cross-system business cognition.

Bottleneck 2: Manual release gate review leads to inconsistent risk sign-off standards

A developer passing self-test results does not represent sufficient quality verification. Current release approval fully depends on QA manually analyzing test reports, resulting in two outcomes:

Uncovered residual risks: Human negligence misses critical untested paths, causing defects to leak online.
Wasted R&D resources: Fully verified low-risk projects face redundant long-cycle regression testing under conservative manual rules.

3.3 Root Cause of Inconsistent Release Risk Control

All test grading and release decisions depend on personal experience, business familiarity and individual risk tolerance of evaluators. Different people produce completely different judgments for similar projects, resulting in uneven quality control standards across the enterprise.

The team drew design inspiration from risk control logic in the financial industry to build a fully data-driven alternative solution.

4. Core Design Idea: Reference Credit Risk Scoring from Financial Industry

4.1 Internet financial lending platforms use big data credit scoring models to replace artificial loan review.

They aggregate multi-dimensional user consumption, transaction and payment data to generate objective credit scores, which reduce bad debt risks while speeding up audit efficiency. The team transplanted this mature logic to software release risk control:

R&D iteration projects = loan applicants with differentiated risk attributes
Full lifecycle R&D behavior data = user identity and behavior data
Quality Score Model = credit risk scoring engine, outputting quantitative risk values to support automatic release decisions

The core binary question solved by the model: Can this project skip exclusive manual QA testing and complete release via self-service verification?

Low-risk signals: Small code modification volume, stable experienced developers & product managers, complete self-test regression coverage.
High-risk signals: Concealed core logic changes, inexperienced project owners, numerous downstream API and third-party service dependencies.

Enterprises generate massive structured R&D monitoring data in the digital transformation process. The solution collects full-lifecycle development data to construct fine-grained project risk portraits, trains machine learning scoring models, and outputs interpretable quantitative risk scores for automatic test grading and release gate control, fully visualizing invisible quality risks.

4.2 Official Definition of Project Quality Score Model

A project-centered quantitative analysis framework that aggregates structured metadata from requirement planning, coding development, test submission and verification phases to generate unified standardized risk scores. The scores support objective automated decisions for test tier division and release approval, and quantify hidden quality risks of each software iteration.

5. End-to-End Technical Implementation Framework

5.1 Precondition: Standardized Unified R&D Workflow

Stable deployment of the Quality Score Model relies on standardized end-to-end development processes, with two indispensable values:

5.1.1 Unified Data Aggregation Foundation

Unified unique identifiers (requirement ID, commit ID, branch tag) connect scattered data from requirement management, task tracking, code submission, CI pipeline, test environment and automation platforms to corresponding projects. Each manual or automated test action can be traced back to the source requirement, supporting accurate calculation of test coverage corresponding to each code change.

If the tool chain lacks unified association standards (e.g., one requirement ticket contains multiple unrelated feature developments, mixed code branches for different projects), cross-stage data cannot be aggregated by project dimension, and post-fault root cause analysis will fail.

5.1.2 Closed-Loop Model Optimization Mechanism

Model pre-judgment acts as the primary risk filter, and manual review serves as the final safety barrier for edge cases. Manually calibrated release results are marked as training samples and fed back to the model for iterative retraining, forming a self-optimizing closed data loop to continuously improve prediction accuracy.

5.2 Multi-Dimensional Feature Selection & Full-Link Data Collection Pipeline

All objective risk scoring results are supported by multi-dimensional structured metadata. The team divides risk features into three mutually independent categories corresponding to development stages.

5.2.1 Category 1: Requirement Dimension Risk Features

Metrics reflecting demand-side risk intensity:

Requirement scale, cross-team collaboration quantity, requirement modification frequency, PM seniority, historical delivery defect rate.

Risk increases with more collaborators, frequent scope changes and inexperienced product managers.

5.2.2 Category 2: Code Development Dimension Risk Features

Metrics measuring code change risk:

Changed code lines, cyclomatic complexity of modified functions, call chain impact scope, affected internal/external APIs, third-party service dependencies, development cycle length, code review results, static analysis alarms, historical defect density of developers.

Risk increases with large code diffs, core API changes, third-party component access and junior developers.

5.2.3 Category 3: Test Verification Dimension Risk Features

Metrics reflecting risk elimination effectiveness:

Executed test case quantity, CI pass rate, incremental code coverage, bug convergence curve, defects per thousand lines of code (KLOC bug density).

Lower release risk corresponds to complete test coverage, stable CI execution, high incremental coverage and rapidly converging bug volume.

5.2.4 Unified Platform Data Collection Logic + JaCoCo Coverage Acquisition Case

All core feature data is pulled from the enterprise unified R&D tool chain through API subscription and event stream. Raw data is split into two types:

Direct available basic metrics: Requirement scale, modification count, development cycle duration, total changed code lines.
Composite derived metrics requiring offline mining: Code cyclomatic complexity, KLOC bug density, daily bug convergence trend, incremental test coverage.

The platform realizes transparent full-data collection without manual operation by QA and R&D teams. Manual data entry will bring heavy operational costs and hinder enterprise-wide large-scale promotion. Unstable automated pipelines with high false alarm rates will generate noisy dirty data and damage model prediction precision.

Typical implementation case: Automatic Java incremental coverage collection based on JaCoCo

All test environments carry JaCoCo agent by default during initialization. Environment module, branch and commit information synchronize to the quality score database. Manual and automated test operations generate real-time coverage snapshots cached in server memory.
Lightweight model client regularly pulls raw .exec coverage files from all running service instances. Pre-offline shutdown hooks trigger forced coverage export to avoid data loss during service restarts.
The client assembles coverage files, source code and compiled JAR packages, and carries out secondary encapsulation based on the native JaCoCo framework to generate project-level incremental coverage reports. Parsed coverage metrics are imported into the model data warehouse for subsequent feature calculation.

5.3 Two-Stage Modeling Scheme: Rule Base + Machine Learning

The model R&D process is divided into two evolutionary phases to solve the cold start and high-precision prediction demands respectively.

5.3.1 Stage 1: Cold Start Rule-Based Model & Limitations

Adopt simple if-else threshold judgment logic to realize initial risk classification and accumulate labeled historical data in the early stage.

Main defects of rule-based models:Threshold configuration highly relies on deep business domain experience.
Any rule adjustment requires additional R&D development, with poor flexibility.

The team divides intelligent testing maturity into three stages:

Human-led collaboration: Data provides auxiliary reference, humans make all final release decisions.
Automatic execution layer: Fixed static rules drive machine judgment with minimal manual intervention.
Full intelligence: Machine learning driven by mass data replaces rigid manual rule sets.

Rule models only serve as the cold start foundation. After accumulating sufficient labeled samples, the team upgrades to machine learning classification models.

5.3.2 Stage 2: Binary Classification Machine Learning Model

Judging whether a project supports self-service release is a standard binary classification task:

Label 0: High risk, mandatory full manual QA testing
Label 1: Low risk, allow direct self-service release

The team evaluated multiple mainstream classification algorithms: KNN, Naive Bayes, Decision Tree, Logistic Regression, SVM, ensemble models (Random Forest, XGBoost, AdaBoost).

Considering data volume, model interpretability and unbalanced defect sample distribution, the production environment selects Logistic Regression and financial credit scorecard models as core prediction schemes, with Logistic Regression as the primary online model.

5.3.2.1 Logistic Regression Risk Prediction Principle

Logistic Regression is a linear binary classification algorithm. Linear regression outputs unlimited continuous values, while the sigmoid activation function of Logistic Regression compresses results into the probability interval [0, 1] to complete risk binary division.

The sigmoid function maps the linear combination of multi-dimensional features into a 0–1 risk probability value. The system configures a threshold (default 0.5) for judgment:

Score above threshold: Low release risk
Score below threshold: High risk, need supplementary testing

In actual release scenarios, 100% full coverage testing is uneconomical and impractical. The Logistic Regression model dynamically matches differentiated verification standards according to project risk grades:

Low-risk minor modifications: Relaxed incremental coverage and small regression scope standards
High-risk core module reconstruction: Mandatory high incremental coverage, cross-system joint testing and full third-party dependency regression

5.3.2.2 Financial Credit-Style Scorecard Model Calculation Logic

The solution borrows the widely used loan application scorecard (A-Score) in financial risk control, which aggregates multi-dimensional user attributes into a single readable integer score to quantify default probability. Transplanted calculation formula for software quality:

Final Project Quality Score = Base Score + Requirement Risk Score + Code Modification Score + Test Coverage Adjustment Score

When the total score exceeds the preset low-risk threshold, the project can be directly released. If not, teams must add manual verification, third-party joint testing or automated regression suites to improve coverage and raise the total quality score.

Standard scorecard development flow:

Data Collection: Aggregate historical project metadata to build labeled risk samples, including offline mining of legacy R&D logs and manually marked release records during pilot launch.
Exploratory Data Analysis: Analyze feature missing values, outliers and distribution characteristics through statistical charts to formulate data preprocessing schemes.
Data Preprocessing: Missing value filling (mean/median/random forest prediction), outlier filtering based on business rules.
WOE binning + Logistic Regression training: Discretize continuous features through equal frequency/equal interval/K-means binning, use Weight of Evidence coding to reduce overfitting and unify feature correlation standards.
Scorecard calibration: Convert Logistic Regression log-odds output into readable integer scores through fixed constants A and B, define PDO (Points to Double Odds) parameters to control score risk sensitivity. Each feature bin corresponds to a fixed score weight; the sum of all weights plus base score equals the final quality score. Higher total scores represent lower release risks.

5.3.3 Model Evaluation Indicators & Algorithm Performance Benchmark

The team adopts standard binary classification confusion matrix to measure core indicators: Precision, Recall, Overall Accuracy, AUC (Area Under ROC Curve).

All candidate algorithms are trained under identical business line labeled datasets for horizontal comparison:

Algorithm	Precision	Recall	Accuracy	AUC
KNN	90.00%	82.89%	79.27%	0.7686
Naive Bayes	82.16%	100.00%	82.90%	0.9278
SVM	97.52%	77.63%	80.83%	0.9020
Logistic Regression	91.46%	98.68%	91.71%	0.9437
CART Decision Tree	94.12%	94.74%	91.19%	0.9090

Benchmark result analysis:

Logistic Regression achieves the highest AUC and nearly full recall, matching the core demand of QA teams to avoid high-risk project leakage.
The model supports dynamic release threshold adjustment: negative risk factors (large code changes, high cyclomatic complexity, slow bug convergence) reduce low-risk probability prediction; positive verification factors (high incremental coverage, complete automation regression) offset risk scores automatically.
Scorecard models perform better under incomplete feature data, but require heavy manual binning work. Model accuracy declines if bin boundaries break risk monotonic trends, which remains an ongoing optimization direction.

5.4 5-Layer Production Deployment Architecture

The complete Quality Score Model online system is divided into five independent layers from front-end to back-end:

Visualization Platform (UI Layer)
- Unify multi-module project data display: code branches, commit records, bug logs, release status
- Centralize all testing capability entrances, recommend matching test tools according to project risk level
- Generate standardized risk score reports, interact with model strategy backends, and return human-audited release labels to the data warehouse for iterative training
Unified R&D Tool Chain Layer
Source of all raw monitoring data: requirement management, code repository, CI/CD pipeline, test environment scheduling, case library, automation execution platform, impact analysis engineRaw Data Collection Layer
Extract basic indicators from tool chains and complete lightweight secondary calculation for composite featuresFeature Engineering Layer
Uniform preprocessing flow: outlier elimination, missing value filling, feature binning, WOE encoding to generate standardized training dataModel Training & Inference Service Layer
Complete post-training verification through test sets before publishing models to the central strategy platform. The front-end queries real-time risk scores and targeted testing suggestions by transmitting project feature vectors. All human-verified release results are appended to training datasets for continuous model iteration.

6. Measurable Business Benefits After Online Deployment

The Quality Score Model was first piloted on Baidu commercial platforms, bringing double improvements in delivery efficiency and defect interception capacity with quantifiable data.

6.1 Higher Self-Service Testing Coverage & Delivery Efficiency

Before model deployment: Only 25% of small iterations passed manual evaluation and entered self-service release.

After algorithm risk scoring launch: Self-service release proportion rose to 60%.

With continuous feature iteration and model optimization, the team estimates an additional 10%–20% increase in self-service throughput, shortening average delivery cycles and lifting overall release volume.

6.2 Stronger Online Defect Interception Capability

Model automatic risk alarms guide teams to carry out targeted supplementary testing. During the pilot period, more than 40 latent defects that would have flowed online under manual review were intercepted in advance, realizing faster iteration without lowering quality safety standards.

7. Core Conclusion & Preconditions for Model Landing

All R&D teams pursuing faster delivery face a universal pain point: manual test grading and release sign-off rely on inconsistent personal experience, resulting in either unblocked online defects or redundant low-value regression testing.

The Quality Score Model solves this contradiction. It aggregates full-lifecycle R&D monitoring data to build refined project risk portraits, outputs objective algorithmic risk scores to realize automatic test classification and release gate control, and quantifies previously invisible software quality risks.

Three non-negotiable preconditions for successful enterprise-wide model promotion:

Fully standardized end-to-end R&D processes
Stable and reliable automated testing infrastructure
Unified testing platform supporting low-noise, full-coverage automatic data collection

Without the above basic construction, the model cannot obtain complete and accurate aggregated feature data to support reliable risk prediction.

The team welcomes industry QA and algorithm practitioners to exchange experience and co-research risk-driven intelligent testing solutions.

8. Follow-Up Serial Article Roadmap Recap

This paper is the first part of the Quality Score Model technical series. Subsequent articles will deeply unpack large-scale model deployment, risk prediction algorithms, feature mining frameworks and unattended delivery full-process solutions. Stay tuned for follow-up releases.

QA Automation Practices intelligent testing

Read Previous Post >>

Test Platform Controversies: Pain Points & Low-Code Solutions