Customer Cases
Pricing

CAP & BASE Theory: Distributed System High Availability & Chaos Engineering

Learn the CAP and BASE theories for distributed systems, including Consistency, Availability, Partition Tolerance, and practical chaos engineering testing strategies for Kubernetes and MySQL architectures.
 
Source: TesterHome Community
 

 

Introduction

I recently finished work on a Kubernetes (K8s) monitoring platform and have since shifted focus to building a chaos engineering platform and organizing associated testing projects. Coincidentally, I am currently writing a book that includes a chapter on chaos engineering. To consolidate my knowledge and keep key concepts clear, I have compiled and documented these core theories in this post.

 

Why Explore the CAP Theorem

This post covers the CAP theorem, alongside practical high-availability design patterns widely used in Kubernetes — content I will also detail in my upcoming book.

I have always adhered to one core principle: effective testing starts with thorough understanding of the system under test. Just as you must master a product’s features before validating its functionality, you need to fully grasp a system’s high-availability architecture when executing high-availability testing tasks. This foundational knowledge allows you to align testing goals accurately and design targeted test scenarios.

 

What Is the CAP Theorem

Many new technical terms in the internet industry can be confusing, and I even spent time deciding whether to title this content around chaos engineering or high-availability testing. Rather than copying generic online definitions, I will explain the CAP theorem based on practical experience.

CAP is an acronym for Consistency, Availability and Partition Tolerance. It defines the three core capabilities that underpin high-availability design for all distributed systems. A fundamental rule applies here: these three capabilities cannot be fully satisfied simultaneously. When system failures occur, at least one capability must be compromised.

Consistency

In this context, consistency refers strictly to data consistency. Distributed systems typically replicate data across multiple servers. Data consistency ensures every read request returns the most up-to-date data. Note that uncommitted data within distributed transactions is not recognized as valid latest data.

Data inconsistency most commonly appears in MySQL master-slave architectures. All write requests are directed to the master node, while read requests can be processed by either master or slave nodes. After new data is written to the master node, an automated mechanism synchronizes the data to all slave nodes.

This synchronization process always carries latency. Under normal network conditions, the delay is only a few milliseconds. However, if network failures break communication between the master and slave nodes, the gap between datasets will continue to grow and cause severe inconsistency.

Availability

Availability means healthy nodes can return valid responses within a reasonable timeframe when other nodes fail. Failed requests, error messages and timeouts are classified as invalid responses.

In plain terms, the system continues delivering services even when partial nodes go offline. This logic guides standard service deployment strategies: when individual service instances malfunction, traffic is rerouted to healthy instances to maintain uninterrupted service for end users.

Partition Tolerance

Partition Tolerance is the most complex concept within the CAP theorem. A network partition occurs when faults cut off communication between internal nodes, splitting the entire distributed system into multiple isolated node groups.

We can use the MySQL master-slave architecture as a typical example. A network failure may disconnect the master node from its slave nodes. Critically, both master and slave nodes remain fully operational and accessible to end users — only inter-node communication is broken. This is a standard network partition.

Partition Tolerance requires the entire system to keep operating normally even when such network partitions take place.

 

Choosing Two of the Three CAP Capabilities

The CAP theorem confirms that Consistency, Availability and Partition Tolerance cannot coexist perfectly. System architects must sacrifice at least one capability.

In real-world distributed systems, developers can only choose to drop either Consistency or Availability. When a network partition emerges, maintaining strict data consistency means disabling all write requests. Since cross-node data synchronization fails during a partition, allowing write operations will inevitably create inconsistent data. Disabling writes, in turn, directly breaks system availability.

For this reason, only two viable combinations exist for production systems: CP and AP. A pure CA (Consistency + Availability) architecture is not achievable for distributed systems.

AP (Availability + Partition Tolerance)

The MySQL master-slave architecture is a classic AP implementation. During a network partition, the master node can no longer sync new data to slave nodes, leading to data inconsistency across nodes. Even so, users can still read data from both master and slave nodes, so availability and partition tolerance remain intact.

AP architectures are ideal for business scenarios where strict real-time data consistency is not a critical requirement.

CP (Consistency + Partition Tolerance)

CP architectures prioritize data consistency by sacrificing availability during failures.

Take the MySQL master-slave setup again. Once a network partition is detected, the system can return error responses for all requests sent to slave nodes. This prevents users from accessing unsynchronized stale data and preserves strong data consistency.

Another common CP design is data partitioning by user ID or order ID. For example, data with IDs from 0 to 1000 is stored exclusively on Node 1, while data with IDs from 1000 to 2000 resides on Node 2. Since no data replicas exist across nodes, users must access the designated node to retrieve specific data, which enforces strong consistency.

The major tradeoff is reduced availability. If one partition node fails, all data stored on that node becomes completely inaccessible. This explains partial service outages reported on major online platforms after data center faults — these issues result directly from strong-consistency CP design.

 

Critical Takeaways for Implementation and Testing

  1. The CAP theorem centers on data, not general system features. It can be applied at a fine-grained level across a single system. You do not need to enforce a universal CP or AP strategy for all data. Different datasets can adopt different rules.
    • For e-commerce platforms, mission-critical data such as product inventory and user account balances require a CP strategy.
    • Non-essential data, including user profile details like gender, hobbies, addresses and phone numbers, can safely use the AP strategy.
  2. Systems can pursue both Consistency and Availability when no network partition exists. This point requires special attention during testing:
    • After injecting network partition faults, validate system performance against business rules. Verify data consistency for CP-focused services, and confirm continuous service availability for AP-focused services.
    • When no network partition is present, ensure both consistency and availability meet predefined standards.
  3. A perfect CP system does not exist in practice. Network latency is unavoidable, so cross-node data synchronization can never achieve 100% real-time replication, regardless of network speed.

 

The BASE Theory

The BASE theory is a complementary framework to CAP. I first learned this concept in an online technical course. Given that fully ideal systems are nearly impossible to build, BASE provides practical guidance for real-world distributed system design.

BASE stands for Basically Available, Soft State and Eventual Consistency. Its core principle is simple: if a system cannot maintain the strong consistency defined in CAP, teams can implement reasonable mechanisms to achieve eventual data consistency instead.

Basically Available

A system includes core business modules and secondary auxiliary features. When abnormal conditions occur, the system only needs to guarantee stable operation of core businesses. Trying to maintain full availability for every function will bring unnecessary operational costs.

Soft State

Soft state allows systems to operate with intermediate states. Temporary data inconsistency is permitted in these states, and it will not impact overall service availability. This aligns with the earlier conclusion that flawless CP systems do not exist — data synchronization always comes with inherent latency.

Eventual Consistency

All data replicas across a distributed system will eventually synchronize and reach a consistent state after a certain period.

For AP-based systems, relaxing real-time consistency does not mean permanent data divergence. Systems must include dedicated mechanisms — either automated workflows or manual operations — to unify all data in the end.

MySQL master-slave and master-backup architectures serve as good examples. Network partitions or master node outages will definitely cause data inconsistency. Operation and maintenance teams can restore data using binary logs from the original master node and re-establish full data synchronization between nodes. This is a widely accepted way to realize eventual consistency.

 

Summary

The BASE theory acts as a practical supplement to the CAP theorem, especially for AP-oriented architectures. The CAP theorem is built on ideal theoretical assumptions, and implementing pure CP or AP systems in production is extremely challenging. As a result, most real-world distributed systems run following the BASE model.

 

Practical Reflections for Testers

Understanding CAP and BASE is essential for testing teams. These theories guide overall testing strategies, help analyze system architecture and clarify high-availability goals.

Before designing test cases, testers must confirm business requirements: whether the system follows CP or AP rules, and whether strong consistency or eventual consistency is required. Below are two typical real-world testing scenarios I have encountered.

Scenario 1: Systems Using Raft Consensus Algorithm for Leader Election

Many distributed systems adopt the Raft algorithm to elect a leader node. The leader coordinates regular operations. If the leader fails, all follower nodes will initiate a new election to select a replacement.

Leader health checks rely on periodic heartbeat messages sent from the leader to followers. Systems built on Raft are highly vulnerable to network partition faults.

In a common failure scenario, the leader node remains functional, but network isolation cuts its connection to part of the follower nodes. The isolated followers will misjudge the leader as offline and trigger an unnecessary re-election. For this reason, network partition fault testing is a mandatory item for all Raft-based systems.

Scenario 2: MySQL Master-Backup / Master-Slave Architectures

Systems built on MySQL master-backup or master-slave modes inherently follow the BASE model and cannot support strong data consistency.

We can simulate a complete failure chain for disaster recovery testing:

  1. Inject a network partition between master and backup nodes to amplify data inconsistency.
  2. Shut down the master node and trigger a failover to promote the backup node as the new master.

In this scenario, the new master runs on outdated data, which leads to permanent data loss. Most enterprises maintain official disaster recovery operation manuals for such incidents.

As testers, we need to replicate the above failure sequence, then verify whether on-site engineers can restore data within the required time by following the manual. The core verification target is the eventual consistency defined by the BASE theory.

 

 

Latest Posts
1CAP & BASE Theory: Distributed System High Availability & Chaos Engineering Learn the CAP and BASE theories for distributed systems, including Consistency, Availability, Partition Tolerance, and practical chaos engineering testing strategies for Kubernetes and MySQL architectures.
2LLM-Powered Test Case Generation & Optimization: Full QA Practical Guide Master LLM-powered test case generation & full lifecycle optimization. Learn standardized workflows, edge case design, enterprise implementation & common pitfalls for modern QA teams.
3How to Build a Complete Performance Testing Knowledge System Learn how to build a systematic performance testing knowledge system, master core terminology, pressure models, system architecture, monitoring, troubleshooting and practical testing skills.
4Prompt Engineering for Intelligent Testing: LLM Optimization & Cases Master 6 core prompt optimization techniques for AI-powered intelligent testing. Explore real enterprise cases, common pitfalls and best practices to stabilize LLM outputs for software testing.
5Online Game Protocol Testing: Complete Interface Testing Guide Learn online game protocol testing basics, common TCP/UDP/WebSocket protocols, packet capture & injection methods, and practical test case design for game interface testing.