VII. Decouple Space

Create flexibility by embracing the network

We can only create a resilient system if we allow it to live in multiple locations so that it can function when parts of the underlying hardware malfunction or are inaccessible; in other words, we need to distribute the parts across space. Once distributed, the now autonomous components collaborate, as loosely coupled as is possible for the given use-case, to make maximal use of the newly won independence from one specific location.

This spatial decoupling makes use of network communication to connect the potentially remote pieces again. Since all networks function by passing messages between nodes and since this message-passing takes time, spatial decoupling introduces asynchronous message-passing new tab on a foundational level. Higher-level representations of this aspect are gRPC new tab, NATS.io new tab, Apache Kafka new tab, HTTP & REST — the question of whether and how the asynchronous nature of the network is surfaced in the local APIs is up to each component to decide.

A key aspect of asynchronous messaging and/or APIs is that it makes the network, with all its constraints, explicit and first-class in the design. It forces you to design for failure and uncertainty instead of pretending that the network is not there and trying to hide it behind a leaky local abstraction new tab (e.g. network-attached disks), just to see it fall apart in the face of partial failures, message loss, or reordering.

It also allows for location transparency new tab, which gives you one single abstraction for all component interactions, regardless of whether the component is co-located on the same physical machine, in another rack, or even another data center. Asynchronous APIs allow cloud infrastructures such as discovery services and load balancers to route requests to wherever the container or VM is running while embracing the likelihood of ever-changing latency and failure characteristics. This provides one programming model with a single set of semantics regardless of how the system is deployed or what topology it currently has (which can change with its usage).

Spatial decoupling enables replication, which ultimately increases the resilience of the system and availability. By running multiple instances of a component, these instances can share the load. Thanks to location transparency, the rest of the system does not need to know where these instances are located but the capacity of the system can be increased transparently, on-demand. If one instance crashes or is undeployed, the other replicas continue to operate and share the load. This capability to fail-over is essential to avoid service disruption.