I recently finished work on a Kubernetes (K8s) monitoring platform and have since shifted focus to building a chaos engineering platform and organizing associated testing projects. Coincidentally, I am currently writing a book that includes a chapter on chaos engineering. To consolidate my knowledge and keep key concepts clear, I have compiled and documented these core theories in this post.
This post covers the CAP theorem, alongside practical high-availability design patterns widely used in Kubernetes — content I will also detail in my upcoming book.
I have always adhered to one core principle: effective testing starts with thorough understanding of the system under test. Just as you must master a product’s features before validating its functionality, you need to fully grasp a system’s high-availability architecture when executing high-availability testing tasks. This foundational knowledge allows you to align testing goals accurately and design targeted test scenarios.
Many new technical terms in the internet industry can be confusing, and I even spent time deciding whether to title this content around chaos engineering or high-availability testing. Rather than copying generic online definitions, I will explain the CAP theorem based on practical experience.
CAP is an acronym for Consistency, Availability and Partition Tolerance. It defines the three core capabilities that underpin high-availability design for all distributed systems. A fundamental rule applies here: these three capabilities cannot be fully satisfied simultaneously. When system failures occur, at least one capability must be compromised.
In this context, consistency refers strictly to data consistency. Distributed systems typically replicate data across multiple servers. Data consistency ensures every read request returns the most up-to-date data. Note that uncommitted data within distributed transactions is not recognized as valid latest data.
Data inconsistency most commonly appears in MySQL master-slave architectures. All write requests are directed to the master node, while read requests can be processed by either master or slave nodes. After new data is written to the master node, an automated mechanism synchronizes the data to all slave nodes.
This synchronization process always carries latency. Under normal network conditions, the delay is only a few milliseconds. However, if network failures break communication between the master and slave nodes, the gap between datasets will continue to grow and cause severe inconsistency.
Availability means healthy nodes can return valid responses within a reasonable timeframe when other nodes fail. Failed requests, error messages and timeouts are classified as invalid responses.
In plain terms, the system continues delivering services even when partial nodes go offline. This logic guides standard service deployment strategies: when individual service instances malfunction, traffic is rerouted to healthy instances to maintain uninterrupted service for end users.
Partition Tolerance is the most complex concept within the CAP theorem. A network partition occurs when faults cut off communication between internal nodes, splitting the entire distributed system into multiple isolated node groups.
We can use the MySQL master-slave architecture as a typical example. A network failure may disconnect the master node from its slave nodes. Critically, both master and slave nodes remain fully operational and accessible to end users — only inter-node communication is broken. This is a standard network partition.
Partition Tolerance requires the entire system to keep operating normally even when such network partitions take place.
The CAP theorem confirms that Consistency, Availability and Partition Tolerance cannot coexist perfectly. System architects must sacrifice at least one capability.
In real-world distributed systems, developers can only choose to drop either Consistency or Availability. When a network partition emerges, maintaining strict data consistency means disabling all write requests. Since cross-node data synchronization fails during a partition, allowing write operations will inevitably create inconsistent data. Disabling writes, in turn, directly breaks system availability.
For this reason, only two viable combinations exist for production systems: CP and AP. A pure CA (Consistency + Availability) architecture is not achievable for distributed systems.
The MySQL master-slave architecture is a classic AP implementation. During a network partition, the master node can no longer sync new data to slave nodes, leading to data inconsistency across nodes. Even so, users can still read data from both master and slave nodes, so availability and partition tolerance remain intact.
AP architectures are ideal for business scenarios where strict real-time data consistency is not a critical requirement.
CP architectures prioritize data consistency by sacrificing availability during failures.
Take the MySQL master-slave setup again. Once a network partition is detected, the system can return error responses for all requests sent to slave nodes. This prevents users from accessing unsynchronized stale data and preserves strong data consistency.
Another common CP design is data partitioning by user ID or order ID. For example, data with IDs from 0 to 1000 is stored exclusively on Node 1, while data with IDs from 1000 to 2000 resides on Node 2. Since no data replicas exist across nodes, users must access the designated node to retrieve specific data, which enforces strong consistency.
The major tradeoff is reduced availability. If one partition node fails, all data stored on that node becomes completely inaccessible. This explains partial service outages reported on major online platforms after data center faults — these issues result directly from strong-consistency CP design.
The BASE theory is a complementary framework to CAP. I first learned this concept in an online technical course. Given that fully ideal systems are nearly impossible to build, BASE provides practical guidance for real-world distributed system design.
BASE stands for Basically Available, Soft State and Eventual Consistency. Its core principle is simple: if a system cannot maintain the strong consistency defined in CAP, teams can implement reasonable mechanisms to achieve eventual data consistency instead.
A system includes core business modules and secondary auxiliary features. When abnormal conditions occur, the system only needs to guarantee stable operation of core businesses. Trying to maintain full availability for every function will bring unnecessary operational costs.
Soft state allows systems to operate with intermediate states. Temporary data inconsistency is permitted in these states, and it will not impact overall service availability. This aligns with the earlier conclusion that flawless CP systems do not exist — data synchronization always comes with inherent latency.
All data replicas across a distributed system will eventually synchronize and reach a consistent state after a certain period.
For AP-based systems, relaxing real-time consistency does not mean permanent data divergence. Systems must include dedicated mechanisms — either automated workflows or manual operations — to unify all data in the end.
MySQL master-slave and master-backup architectures serve as good examples. Network partitions or master node outages will definitely cause data inconsistency. Operation and maintenance teams can restore data using binary logs from the original master node and re-establish full data synchronization between nodes. This is a widely accepted way to realize eventual consistency.
The BASE theory acts as a practical supplement to the CAP theorem, especially for AP-oriented architectures. The CAP theorem is built on ideal theoretical assumptions, and implementing pure CP or AP systems in production is extremely challenging. As a result, most real-world distributed systems run following the BASE model.
Understanding CAP and BASE is essential for testing teams. These theories guide overall testing strategies, help analyze system architecture and clarify high-availability goals.
Before designing test cases, testers must confirm business requirements: whether the system follows CP or AP rules, and whether strong consistency or eventual consistency is required. Below are two typical real-world testing scenarios I have encountered.
Many distributed systems adopt the Raft algorithm to elect a leader node. The leader coordinates regular operations. If the leader fails, all follower nodes will initiate a new election to select a replacement.
Leader health checks rely on periodic heartbeat messages sent from the leader to followers. Systems built on Raft are highly vulnerable to network partition faults.
In a common failure scenario, the leader node remains functional, but network isolation cuts its connection to part of the follower nodes. The isolated followers will misjudge the leader as offline and trigger an unnecessary re-election. For this reason, network partition fault testing is a mandatory item for all Raft-based systems.
Systems built on MySQL master-backup or master-slave modes inherently follow the BASE model and cannot support strong data consistency.
We can simulate a complete failure chain for disaster recovery testing:
In this scenario, the new master runs on outdated data, which leads to permanent data loss. Most enterprises maintain official disaster recovery operation manuals for such incidents.
As testers, we need to replicate the above failure sequence, then verify whether on-site engineers can restore data within the required time by following the manual. The core verification target is the eventual consistency defined by the BASE theory.