← All insights

Architecture

Fault domains in hybrid cloud

Field note · 10 min read

“We are multi-AZ, so we are fine.” That statement ages poorly the day a shared control plane, global load balancer, or a common key management dependency fails-and every service that “should” be independent rediscovers a hidden edge. Fault domains are the art of drawing boundaries where a failure (deploy bug, backhoe, DDoS, or human error) stops-without re-architecting the world every quarter.

1. Multi-AZ is a floor, not a ceiling

Availability zones help with power and many classes of local hardware failure. They do not protect you from software bugs that go out in every copy, or from a regional API that your entire “microservices” map shares because everyone imported the same client default. The first pass is not “more boxes”; it is independence in deployment, credentials, and blast radius for state.

2. Cells, shards, and the human org chart

A cell (or other tenant/partitioning pattern) is not only a scaling trick. It is a policy object: a unit that can be upgraded in waves, throttled, or fail without taking every customer down. The catch: the team that owns a cell also must own a clear SLO, runbook, and paging story-see error budgets-or you get a beautiful diagram and an ugly on-call life.

3. Hybrid and multi-cloud: the network is a fault domain

When a workload spans a private data center, one hyperscaler, and a SaaS identity layer, the weakest link is often the path between them-not CPU in any one building. Resilience work here is:

4. Ownership and escalation

Large incidents become political when the on-call rotation does not line up with code ownership or the ability to revert a shared config. A fault domain strategy without an escalation map (who can flip what break-glass) is a diagram, not a plan. For regulated environments, pair this with evidence that the break-glass path was rehearsed, not only documented.

5. Do not outsmart your SLOs

Sometimes the right “architecture” is fewer moving parts: a boring monolith with a rock-solid SLO in one place can outlive a tangle of nanoservices that all share a flaky queue. Scale concerns from at-scale practice and resilience concerns meet here: choose boundaries that match the org’s ability to operate them.

Takeaway

Fault domains are a design conversation and an org conversation. The drawing is 10% of the work. The other 90% is getting teams to ship, page, and rehearse within the boundaries you picked-so the next outage is smaller, shorter, and less surprising. That is the definition of a resilient system, hybrid or not.

Architecture review →