Skip to main content
Cloud

Cloud Storage Replication: Seven Patterns and Their Real RPOs

Every cloud provider offers half a dozen replication options. Here is what each one actually gives you — in minutes of data loss, in dollars per month, and in the failure modes the sales deck leaves out.

John Lane 2023-06-29 6 min read
Cloud Storage Replication: Seven Patterns and Their Real RPOs

"Replication" is one of those words in cloud documentation that covers so many different things that using it without qualification is nearly meaningless. When a customer tells us their data is "replicated," the follow-up questions are always the same: replicated where, replicated how, with what recovery point objective, and what happens when the replica is corrupted along with the primary. The answers separate the deployments that actually survive an incident from the ones that just looked like they would.

Here are seven replication patterns we see in production, what each one gives you, and where it breaks.

Pattern One: Single-Region, Multi-AZ Redundant Storage

This is the default for most cloud object storage services. Azure's LRS and ZRS, AWS S3 Standard, GCP's regional storage — all of them write data to multiple devices across one or more availability zones within a single region. The RPO is essentially zero (writes are confirmed only after durable replication), the RTO is zero (the service is continuously available), and you pay the base storage rate with no meaningful premium.

The failure mode is a region outage. If the entire region goes down — and hyperscaler regions do go down, including in well-publicized events in the last several years — a single-region deployment is offline until the region recovers. For most workloads that is a tolerable risk; for critical workloads it is not. Decide deliberately which category your workload is in, and do not assume "multi-AZ means multi-region" because it does not.

Pattern Two: Geo-Redundant Object Storage, Async

This is Azure GRS, AWS S3 Cross-Region Replication, GCP's dual-region or multi-region buckets. Data is asynchronously replicated to a second region, typically the paired region. RPO is seconds to minutes for normal operation, occasionally higher during provider events. RTO depends on whether the secondary is read-only (Azure RA-GRS gives read access to the secondary at all times) or only accessible after a failover event.

The honest caveat is that asynchronous geo-replication is not synchronous. In the worst case — a regional event that takes the primary offline mid-transaction — you can lose the last few minutes of writes. For most data that is acceptable. For financial transactions, ordering systems, or anything where "I confirmed the write but it was lost" is a compliance problem, it is not.

Cost for GRS-style replication is roughly 2x the base storage cost, plus inter-region bandwidth on the writes. Budget it explicitly.

Pattern Three: Synchronous Block Storage Replication

For databases and applications that cannot tolerate async loss, synchronous replication across zones or short-distance regions is the pattern. Azure Ultra Disk with zone-redundant replication, AWS EBS Multi-Attach or managed database synchronous replicas, and the cloud-native managed database services (Azure SQL zone redundant, Aurora Multi-AZ, Cloud SQL HA) all fall into this category.

Synchronous replication gives you zero RPO within the replication domain — if a zone fails, no committed data is lost. The tradeoff is latency. Every write must be confirmed on both primary and replica before the application sees success, which adds a few milliseconds to every write at best and tens of milliseconds at worst over longer distances. Synchronous cross-region replication generally is not offered because the latency penalty breaks application assumptions.

The failure mode is correlated failure. If the primary and replica are in "different zones" that share underlying infrastructure — a power distribution unit, a top-of-rack switch, a network fabric — a single failure can take both down. Hyperscalers design zones to avoid this, but "designed to avoid" is not the same as "cannot happen." Real-world incident reports suggest correlated zone failures are rare but not zero.

Pattern Four: Database-Level Logical Replication

This is replication implemented by the database engine rather than the storage layer. PostgreSQL logical replication, MySQL GTID-based replication, SQL Server Availability Groups, MongoDB replica sets. The database engine writes a change log and applies it to replicas, which can be in the same region, a different region, or a different cloud.

The advantages are flexibility and portability: the replicas do not have to be the same storage tier, or the same provider, or even the same cloud. The disadvantages are operational complexity (replication lag monitoring, failover coordination, read-your-writes semantics) and the fact that logical replication is usually async, with the same potential data loss as other async patterns.

We recommend this pattern for customers who want cross-cloud resilience or who need to migrate between providers without a hard cutover. It is more work to operate than the managed alternatives, and the work never stops.

Pattern Five: Immutable Backup as Replication

A pattern that does not get called "replication" but functions as one: scheduled backups to an immutable object storage tier, in a different region or account from the primary. Azure Backup with immutable vaults, AWS Backup with Vault Lock, or a third-party tool writing to a WORM-enabled bucket.

The RPO here is the backup interval — typically 15 minutes to 24 hours depending on the workload — and the RTO is the time to restore, which can be hours or days depending on the data volume. This is not the fastest recovery option, but it is the only one that protects against ransomware and insider attacks that a synchronous or async replica would faithfully propagate to the secondary.

Every serious backup strategy includes an immutable tier. If your "replication" strategy does not, you do not have a backup strategy, you have a mirror of your corrupted data.

Pattern Six: Active-Active Multi-Region

The pattern everyone wants and few need. Writes are accepted in multiple regions simultaneously, and a conflict resolution strategy reconciles divergent updates. Cosmos DB multi-region writes, DynamoDB Global Tables, Cloud Spanner multi-region configurations.

The honest assessment is that this pattern works, but it requires the application to be designed around conflict resolution from day one. If your application assumes a single source of truth and does read-then-write operations, making it active-active after the fact is painful. The managed services handle the data layer for you, but the application layer still has to decide what to do when two regions both accepted a write against the same row.

Use this pattern when you have a genuine business requirement for zero-downtime, geographically distributed writes. Do not use it as a fancier version of DR, because the operational complexity is an order of magnitude higher than the simpler patterns.

Pattern Seven: Cross-Provider or Cloud-to-On-Prem

The most paranoid pattern, and the one we recommend more often than customers expect. Replicate critical data from one cloud provider to a different provider (AWS to Azure, Azure to GCP) or from cloud to an on-prem or colo-based object store (MinIO, Cloudian, NetApp StorageGRID). The goal is not just disaster recovery but provider independence — the ability to survive not just a regional outage but a prolonged provider-level incident, a billing dispute, or a compliance event that affects your ability to access the provider at all.

The cost is real (double storage, inter-provider egress) and the operational complexity is significant, but for genuinely critical data the protection against provider concentration risk is the point. We recommend at minimum that the "last line of defense" backup — the one you would restore from if everything else failed — lives outside your primary provider's control plane. This is unfashionable but defensible, and it has saved customers from incidents that the primary provider's published SLAs would not have covered.

What We Would Actually Recommend

For a typical mid-market customer with a mix of applications and data criticality, the pattern we recommend is a layered approach.

  • Default storage: multi-AZ or GRS-class. The base level for everything, almost for free.
  • Critical transactional data: synchronous replication within region, async cross-region, plus an immutable backup tier. Three layers, because each one protects against a different class of failure.
  • Immutable backups, always, in a different account or subscription from the primary. This is the ransomware line of defense. Do not skip it.
  • For genuinely critical data, a cross-provider or on-prem copy. Expensive, and the right answer for a smaller subset than customers expect — but for the right workload, the insurance is worth it.
  • Test restores quarterly, not annually. A replica you have never restored from is not a replica, it is a prayer.

The uncomfortable truth about replication is that every pattern has a failure mode, and the only real protection comes from layering patterns so that no single failure class takes out all your copies at once. "Replicated" is not a strategy. "Replicated three ways against three different failure classes and tested quarterly" is.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →