Source: TesterHome Community
As DevOps practices spread, shift-left testing and developers owning quality have taken hold in many engineering teams. This article walks through the DevOps journey of a real-world project — LogReplay — to show how improving testability, embracing automated testing, and building a solid CI/CD pipeline lead to high-quality, continuous, and fully automated deployment of backend microservices.
Shift-left testing is a key part of developers taking genuine ownership of quality under DevOps. One effective tactic runs meaningful automated tests early and often throughout development, catching issues and providing feedback as soon as possible. Section 2 covers this in detail.
Software testability is the foundation of high-quality, high-efficiency delivery. Poor testability raises testing costs, makes results harder to verify, and discourages developers from testing — or pushes testing later in the cycle. Improving testability must come before any serious automation effort. See Section 1.
With thorough automated testing, slow and error-prone manual validation becomes unnecessary. By plugging tests directly into a CI/CD pipeline, teams trigger builds and tests immediately after code commits, promote artifacts through environments only when all tests pass, and release to production automatically. Section 3 covers CI/CD.
Testability measures how easy it is to test a software system. Poor testability drives up costs, makes results hard to verify, and leads engineers to skip testing or push it later in the cycle.
At the API test level:
At the code-under-test level:
Observability means how easily a program’s behavior, inputs, and outputs can be observed — how easily external systems obtain important state and information.
Every operation or input should produce a clear, predictable response or output. That output must be both visible and queryable. Invisible or unqueryable means undiscoverable, harming observability and therefore testability.
Visibility starts with output. Improve observability by emitting more — structured event logs, distributed tracing information, aggregated metrics. Provide testability interfaces to expose internal state and report system self-checks. When something goes wrong, output should be easy to recognize through automated log analysis or UI highlighting.
In our project, we focused on:
1) Converging API return status codes
More downstream dependencies mean more potential failure points. Direct dependencies add failure points linearly. Indirect dependencies multiply them. Passing every downstream error verbatim to the client is impractical — clients rarely understand all errors or know how to react differently. Status codes must be converged.
2) Always propagating failures upstream
The upstream caller does not need the exact failure point — end-to-end return information may lack precision. But it must receive the failure. Swallowing failures internally leaves callers unsure if the request succeeded or what action to take.
In tRPC services, an error consists of a code and a msg string. Use the framework’s errs.New to return both. If a downstream service returns an error without errs.New, the upstream receives code 999.
func (s *helloServerImpl) SayHello(ctx context.Context, req *pb.HelloRequest, rsp *pb.HelloReply) error {
if failed { // business logic fails
return errs.New(your-int-code, "your business error message")
}
return nil // success
}
3) Integrating distributed log collection
Finding the exact failure point requires logs. Record failure points with logging, different error messages under the same error code, or distributed tracing. Distributed log collection maximizes diagnostic information retention. For example, configure tRPC services to report logs to a centralized system like Kibana.
4) Integrating a distributed tracing system
Status codes and messages are client-oriented. They may lack precision for failure location. Distributed tracing is immensely valuable. Any modern backend system should implement OpenTelemetry — its universal protocol ensures wide tool support. Every serious developer should understand tracing. When debugging a tough production issue, you will appreciate it.
After integrating tracing with an OpenTelemetry backend, print the Trace ID to test logs during API and end-to-end tests. When a test fails, use that Trace ID to pinpoint the root cause quickly.
Understandability means how easily information about the system-under-test can be obtained, how complete that information is, and how easy it is to comprehend. For example, does the system have documentation, and is that documentation readable and up-to-date?
Key aspects include:
Our practical experience in this area remains limited.
Controllability means how easy it is to control a program’s behavior, inputs, and outputs — whether the system-under-test can be set to a desired state for testing. Highly controllable systems are easier to test and automate.
Controllability includes:
To improve middleware isolation and test data construction, we implemented:
1) Using naming services for addressing
In a microservice architecture, fixed ip:port addressing for middleware is inflexible — it cannot handle scaling or cluster management. Use naming services with uniform addressing via namespace + env, eliminating per-environment ip:port configuration.
2) Standardizing access clients
Use a consistent internal middleware client module (e.g., trpc-database). Benefits include covering most middleware types, reducing bugs from feature/usage variations across community implementations, providing built-in observability (monitoring, tracing), and allowing filters for flexible traffic manipulation like route modification.
3) Physically isolating middleware instances between production and test environments
Strictly separate middleware used in production (Production), baseline development (Development), and automated testing environments. Physical isolation is the only reliable way to prevent test behavior from affecting production.
We have further work to do on controllability and will share more as we gain experience.
In a microservice architecture, testing typically has three levels:
Ease of implementation increases from E2E down to unit tests, but effectiveness decreases. E2E tests are most expensive but provide highest confidence when they pass. Unit tests are easiest and fastest but cannot guarantee the whole system works correctly.
No silver bullet exists. All three levels must be combined.
The real question: when should each type be written, and how many?
Our practice suggests:
We use manual writing and tools like TestOne that auto-generate unit test cases. Manual methods are well covered elsewhere. We follow five principles from the PCG Testability certification: focus on behavior, explicit dependencies, encapsulation, single responsibility, and readability.
For legacy codebases with few or no unit tests — code lacking regression safety nets when logic changes — we use tools like TestOne to improve unit test efficiency, quality, and automation coverage.
1) New code scenarios
For incremental new code, scaffolding tools generate unit test templates. Compared to basic generators like gotests, these provide dependency analysis, call chain analysis, mock generation, and pointer type assertion analysis. This simplifies test data, improves test effectiveness and readability, and boosts overall efficiency and quality.
Example: For business code adding a user, the generated scaffolding expands test data fields one by one for manual filling, analyzes dependencies and prompts assertions for input parameters that are written to, auto-generates mock frameworks for detected tRPC calls, and adds //FIXME comments to remind developers to verify test logic.
2) Legacy code scenarios
For legacy codebases where unit tests are scarce, auto-generation quickly builds a quality safety net. This provides basic protection when code changes later. LogReplay’s unit tests now cover most lines of code and run both locally and in CI pipelines daily.
Key lessons from our practice:
Getting started
Here is a simple API test example in Go using TestOne SDK to bridge internal network restrictions:
func TestDemo(t *testing.T) {
// client options omitted
request := &pb.HelloRequest{Msg: "my test message"}
rsp, err := pb.NewHelloClientProxy().SayHello(context.Background(), request, opts...)
assert.NoError(t, err)
assert.NotEmpty(t, rsp.Msg)
}
Using mocks for stability
When running API tests in MR stages — where runs are frequent and failures highly visible — and when dependencies are under development or unstable, we encountered problems:
Solutions: improve test case quality, use sandboxed test environments (e.g., TestOne Sandbox), leverage mocking capabilities from the TestOne API Test SDK, and apply middleware governance.
Mocking an HTTP downstream:
m := mock.NewHTTP("hello.world.com", env)
err := m.URI("/path/hello").
Rule(mock.Any()).
Return(`{"status": "ok", "token": 1, "value": "2"}`)
Mocking a tRPC downstream: Configure mock rules so the downstream service interface always returns needed data, avoiding issues from unready or changing dependencies.
Mocking middleware (e.g., MySQL): When the test environment’s MySQL is unstable, data is frequently modified, or specific data (like a large count) is hard to trigger, mock the middleware (e.g., making count(*) return 9).
Sandboxed environments dramatically improve stability for high-frequency MR runs. Mocking solves the dependency-not-ready problem and enables earlier test writing.
Improving efficiency with auto-generation
Using API coverage to set strategy
Use API coverage metrics to set goals: prioritize high-call-volume interfaces, use traffic-to-case tools, and mock downstream dependencies for stability. Results: high API coverage, over half of cases using mocks or sandbox environments, significantly better stability for cases with mocks.
Writing E2E tests is similar to API tests with differences:
Challenges faced:
Solutions:
The bottom line: Do not write too many E2E tests. Cover only the most critical core scenarios. Replace everything else with simpler, more maintainable API tests. After adopting this principle, our E2E tests remain highly stable while covering most core scenarios.
All test types run directly with go test.
For API testing, a CLI automatically creates a stable sandbox environment, runs tests, destroys the environment, and generates a report. Define a TESTPLAN file (suite name, case path, plan details like type, sandbox config, app info, build method), then run:
guitar test -p //TESTPLAN -n api_test
Run tests directly from the IDE while writing code without commands. The plugin displays test reports automatically after execution.
When a test fails, first check logs. If the error originated downstream, use distributed tracing to find the last service returning an error. For frequent errors over time, aggregate error codes. For failures after refactoring, use request/response diffing.
Test execution logs show three error types:
In tRPC, business errors typically use codes > 10000. Framework errors use 1–200 and 999.
|
Error Code |
Meaning |
Common Cause |
|
141 |
tcp client transport ReadFrame... |
Protocol mismatch — client using tRPC to talk to an HTTP endpoint |
|
111 |
service timeout |
Service timeout, client timeout, or upstream context exhausted timeout |
|
999 |
Generic error |
Downstream returned errors.New(msg) without status code instead of errs.New(code, msg) |
With tracing integrated, the Trace ID prints to test logs. On failure, find the Trace ID in the report, click to jump to the tracing UI, and quickly locate the cause — for example, the last service returning an error, often an environment issue or version mismatch.
For frequent errors over a period, aggregate downstream errors by upstream calling interface to identify recurring downstream problems.
When a test passes before a refactor but fails after, use a diff tool to compare protocol requests/responses field by field across two runs. This often reveals subtle changes like an extra comma in a returned message.
Despite high coverage, some logic bugs still escaped — even when covered by automation. A review revealed:
Solutions:
Test code needs as rigorous review as production code. Require CR approval before merging. Review rules include:
Review production defects and on-call tickets. Ask why detection did not happen earlier and why automated tests did not catch the issue. Then supplement or update test cases accordingly.
Use tools that detect ineffective tests upfront:
Run static scans in MR pipelines for quick feedback on incremental changes. Run scheduled dynamic injection for continuous improvement.
Work with your test platform to provide execution statistics: rates, counts, failure distribution. Review data regularly and optimize.
Unstable microservices cause random test failures that block CI/CD.
Steps taken:
Continuously optimize based on monitoring. Achieved and sustained >99.99% stability.
Unit test stability:
API test stability:
Handling flaky tests (E2E/API)
Flaky tests — sometimes passing, sometimes failing for the same code — destroy confidence. Use a flakiness mitigation scheme (e.g., TestOne Flakiness): monitor each test’s reliability score. If below a threshold, automatically remove the test from the critical path (stop running it or stop treating its result as a gate). This boosted critical-path E2E test stability to over 99%.
Standardize environments:
Define strict entry and exit criteria for promoting changes between environments.
|
Environment |
Entry Criteria |
Exit Criteria |
|
Sandbox |
Build succeeds, 100% unit tests pass |
100% API tests pass |
|
Test |
Code merged to trunk, 100% API tests pass |
100% API tests pass (regression) |
|
Staging |
100% integration/E2E tests pass, on-call integrated |
100% integration/E2E tests pass, sufficient duration/traffic (e.g., 6h/100 accesses), no on-call tickets |
|
Canary |
Performance tests pass, on-call integrated |
100% integration/E2E tests pass, sufficient duration/traffic (e.g., 6h/100 accesses), no on-call tickets |
Following these criteria and sequential promotion (Test → Staging → Canary → Prod) keeps environments synchronized and prevents inconsistencies.
CI continuously merges code to trunk and uses builds plus automated tests to enforce quality.
Pipelines use a consistent CLI tool (e.g., TestOne Guitar), keeping configuration minimal — specify the testplan file after checkout. Our CI process is stable.
CD extends CI, continuously and automatically deploying microservices to test and production without manual intervention.
Grayscale strategy for Production:
|
Node Count |
Deployment Progression |
|
< 10 nodes |
1-2 → 3-5 → 6-9 nodes |
|
≥ 10 nodes |
10% → 30% → 60% → remaining nodes |
Monitoring during grayscale:
Targeted testing: Run API tests safe for production data on grayscale nodes to verify service works correctly with production configurations and data.
Grayscale outcome:
Our CD process is stable. Past rollbacks were caused by deployment order issues (e.g., service A deploying before its dependency service B) or configuration changes requiring new production data.
With the LogReplay project, we have largely achieved continuous deployment for microservice code changes. After a code MR merges to trunk, the process runs fully automatically — extensive automated tests, a robust CI/CD pipeline, and auto-rollback when issues occur.
Work remains. While code changes are fully automated, configuration and database changes still require manual steps. We plan to explore continuous, automated deployment for those as well.
Different businesses and scenarios have different needs. Our practices may not apply universally. But the shared goal — higher quality and faster delivery — is universal, and both depend heavily on automation. We hope more teams explore, practice, and share their experiences with backend automated testing and continuous deployment.
Testing tools used — Most tools mentioned are proprietary internal Tencent products (e.g., TestOne: one-stop testing platform).
|
Term |
Definition |
|
CI |
Continuous Integration |
|
CD |
Continuous Deployment |
|
Mock server |
A service that implements mocking behavior for other services |
|
Sandbox / Test / Staging / Canary / Production environments |
Isolated, baseline, pre-release, canary, and production environments |
|
Flaky test |
A test with non-deterministic outcomes — for the same code, sometimes passes, sometimes fails |