Source: TesterHome
QA and R&D teams repeatedly encounter unsolved quality & efficiency bottlenecks in daily release testing. The most representative puzzles include 4 core pain points:
To address the above challenges, Intelligent Testing Team launched R&D on the Quality Score Model, with three clear core goals:
After more than one year of algorithm research, system development and repeated online verification, the team completed model landing based on a unified quality efficiency platform, realizing standardized data collection and automatic risk modeling. The solution has been piloted and fully deployed across multiple Baidu commercial business lines.
This article is the first chapter of a serialized technical series, focusing on intelligent test grading and release risk sign-off logic. Subsequent serial articles will detail unattended delivery and intelligent automation scheduling.
This paper elaborates the two core business scenarios supported by the Quality Score Model: test tier classification and release risk gate review.
The model replaces subjective manual experience judgment with objective full-link data and standardized algorithm risk scoring. It avoids two extreme losses caused by human evaluation: risk leakage due to over-optimism, and efficiency waste due to over-conservative testing. The model balances delivery speed and product quality, and achieves unified quality & efficiency targets for all R&D projects.
For all testing teams, the long-term core optimization target is to quickly deliver high-value business features without sacrificing online product quality. This target is also the core demand of Agile development and continuous delivery systems.
Under traditional R&D processes, product, front-end, back-end and QA staff will jointly confirm development cycles and test scope after requirement review.
For small iteration updates, most teams adopt lightweight release logic: developers finish self-testing and pass automated test suites, then directly launch to production. This manpower-saving mode is critical for fast iteration under high developer-to-QA staffing ratios.
With mature, product-level automated testing infrastructure widely deployed, testing work is largely covered by machines, and developers bear more pre-release verification responsibilities. Based on this trend, Baidu launched the Self-Service Testing system in 2019:
Self-Service Testing: A standardized R&D workflow where developers rely on internal unified testing services and stable automation assets to complete full quality verification. The Quality Score risk assessment engine or QA reviewer finally issues a pass/fail release judgment.
Large-scale promotion of self-service testing exposed two fatal defects brought by manual subjective evaluation:
Teams only rely on staff experience to decide whether a project can enter self-service release, which triggers two types of loss:
Manual judgment standards cannot be unified, especially for teams with a large number of new employees lacking cross-system business cognition.
A developer passing self-test results does not represent sufficient quality verification. Current release approval fully depends on QA manually analyzing test reports, resulting in two outcomes:
All test grading and release decisions depend on personal experience, business familiarity and individual risk tolerance of evaluators. Different people produce completely different judgments for similar projects, resulting in uneven quality control standards across the enterprise.
The team drew design inspiration from risk control logic in the financial industry to build a fully data-driven alternative solution.
They aggregate multi-dimensional user consumption, transaction and payment data to generate objective credit scores, which reduce bad debt risks while speeding up audit efficiency. The team transplanted this mature logic to software release risk control:
The core binary question solved by the model: Can this project skip exclusive manual QA testing and complete release via self-service verification?
Enterprises generate massive structured R&D monitoring data in the digital transformation process. The solution collects full-lifecycle development data to construct fine-grained project risk portraits, trains machine learning scoring models, and outputs interpretable quantitative risk scores for automatic test grading and release gate control, fully visualizing invisible quality risks.
A project-centered quantitative analysis framework that aggregates structured metadata from requirement planning, coding development, test submission and verification phases to generate unified standardized risk scores. The scores support objective automated decisions for test tier division and release approval, and quantify hidden quality risks of each software iteration.
Stable deployment of the Quality Score Model relies on standardized end-to-end development processes, with two indispensable values:
Unified unique identifiers (requirement ID, commit ID, branch tag) connect scattered data from requirement management, task tracking, code submission, CI pipeline, test environment and automation platforms to corresponding projects. Each manual or automated test action can be traced back to the source requirement, supporting accurate calculation of test coverage corresponding to each code change.
If the tool chain lacks unified association standards (e.g., one requirement ticket contains multiple unrelated feature developments, mixed code branches for different projects), cross-stage data cannot be aggregated by project dimension, and post-fault root cause analysis will fail.
Model pre-judgment acts as the primary risk filter, and manual review serves as the final safety barrier for edge cases. Manually calibrated release results are marked as training samples and fed back to the model for iterative retraining, forming a self-optimizing closed data loop to continuously improve prediction accuracy.
All objective risk scoring results are supported by multi-dimensional structured metadata. The team divides risk features into three mutually independent categories corresponding to development stages.
Metrics reflecting demand-side risk intensity:
Requirement scale, cross-team collaboration quantity, requirement modification frequency, PM seniority, historical delivery defect rate.
Risk increases with more collaborators, frequent scope changes and inexperienced product managers.
Metrics measuring code change risk:
Changed code lines, cyclomatic complexity of modified functions, call chain impact scope, affected internal/external APIs, third-party service dependencies, development cycle length, code review results, static analysis alarms, historical defect density of developers.
Risk increases with large code diffs, core API changes, third-party component access and junior developers.
Metrics reflecting risk elimination effectiveness:
Executed test case quantity, CI pass rate, incremental code coverage, bug convergence curve, defects per thousand lines of code (KLOC bug density).
Lower release risk corresponds to complete test coverage, stable CI execution, high incremental coverage and rapidly converging bug volume.
All core feature data is pulled from the enterprise unified R&D tool chain through API subscription and event stream. Raw data is split into two types:
The platform realizes transparent full-data collection without manual operation by QA and R&D teams. Manual data entry will bring heavy operational costs and hinder enterprise-wide large-scale promotion. Unstable automated pipelines with high false alarm rates will generate noisy dirty data and damage model prediction precision.
Typical implementation case: Automatic Java incremental coverage collection based on JaCoCo
The model R&D process is divided into two evolutionary phases to solve the cold start and high-precision prediction demands respectively.
Adopt simple if-else threshold judgment logic to realize initial risk classification and accumulate labeled historical data in the early stage.
The team divides intelligent testing maturity into three stages:
Rule models only serve as the cold start foundation. After accumulating sufficient labeled samples, the team upgrades to machine learning classification models.
Judging whether a project supports self-service release is a standard binary classification task:
The team evaluated multiple mainstream classification algorithms: KNN, Naive Bayes, Decision Tree, Logistic Regression, SVM, ensemble models (Random Forest, XGBoost, AdaBoost).
Considering data volume, model interpretability and unbalanced defect sample distribution, the production environment selects Logistic Regression and financial credit scorecard models as core prediction schemes, with Logistic Regression as the primary online model.
Logistic Regression is a linear binary classification algorithm. Linear regression outputs unlimited continuous values, while the sigmoid activation function of Logistic Regression compresses results into the probability interval [0, 1] to complete risk binary division.
The sigmoid function maps the linear combination of multi-dimensional features into a 0–1 risk probability value. The system configures a threshold (default 0.5) for judgment:
In actual release scenarios, 100% full coverage testing is uneconomical and impractical. The Logistic Regression model dynamically matches differentiated verification standards according to project risk grades:
The solution borrows the widely used loan application scorecard (A-Score) in financial risk control, which aggregates multi-dimensional user attributes into a single readable integer score to quantify default probability. Transplanted calculation formula for software quality:
Final Project Quality Score = Base Score + Requirement Risk Score + Code Modification Score + Test Coverage Adjustment Score
When the total score exceeds the preset low-risk threshold, the project can be directly released. If not, teams must add manual verification, third-party joint testing or automated regression suites to improve coverage and raise the total quality score.
Standard scorecard development flow:
The team adopts standard binary classification confusion matrix to measure core indicators: Precision, Recall, Overall Accuracy, AUC (Area Under ROC Curve).
All candidate algorithms are trained under identical business line labeled datasets for horizontal comparison:
|
Algorithm |
Precision |
Recall |
Accuracy |
AUC |
|
KNN |
90.00% |
82.89% |
79.27% |
0.7686 |
|
Naive Bayes |
82.16% |
100.00% |
82.90% |
0.9278 |
|
SVM |
97.52% |
77.63% |
80.83% |
0.9020 |
|
Logistic Regression |
91.46% |
98.68% |
91.71% |
0.9437 |
|
CART Decision Tree |
94.12% |
94.74% |
91.19% |
0.9090 |
Benchmark result analysis:
The complete Quality Score Model online system is divided into five independent layers from front-end to back-end:
The Quality Score Model was first piloted on Baidu commercial platforms, bringing double improvements in delivery efficiency and defect interception capacity with quantifiable data.
Before model deployment: Only 25% of small iterations passed manual evaluation and entered self-service release.
After algorithm risk scoring launch: Self-service release proportion rose to 60%.
With continuous feature iteration and model optimization, the team estimates an additional 10%–20% increase in self-service throughput, shortening average delivery cycles and lifting overall release volume.
Model automatic risk alarms guide teams to carry out targeted supplementary testing. During the pilot period, more than 40 latent defects that would have flowed online under manual review were intercepted in advance, realizing faster iteration without lowering quality safety standards.
All R&D teams pursuing faster delivery face a universal pain point: manual test grading and release sign-off rely on inconsistent personal experience, resulting in either unblocked online defects or redundant low-value regression testing.
The Quality Score Model solves this contradiction. It aggregates full-lifecycle R&D monitoring data to build refined project risk portraits, outputs objective algorithmic risk scores to realize automatic test classification and release gate control, and quantifies previously invisible software quality risks.
Three non-negotiable preconditions for successful enterprise-wide model promotion:
Without the above basic construction, the model cannot obtain complete and accurate aggregated feature data to support reliable risk prediction.
The team welcomes industry QA and algorithm practitioners to exchange experience and co-research risk-driven intelligent testing solutions.
This paper is the first part of the Quality Score Model technical series. Subsequent articles will deeply unpack large-scale model deployment, risk prediction algorithms, feature mining frameworks and unattended delivery full-process solutions. Stay tuned for follow-up releases.