Skip to main content
Cloud

Managed Cloud Solutions: Three Methods That Stop the 3 AM Pages

The point of managed cloud is not a nicer dashboard. It's fewer incidents. Here are the three methods that actually reduce the number of times you get paged at 3 AM.

John Lane 2024-06-17 6 min read
Managed Cloud Solutions: Three Methods That Stop the 3 AM Pages

When a customer starts a conversation about managed cloud, the stated reason is usually "we want to focus on our business, not infrastructure." The actual reason, almost always, is that someone on their team is tired of getting paged at 3 AM. The team has had too many weekends ruined, too many incidents where the fix was obvious in hindsight, and too many outages where the cause was something nobody had prioritized.

Managed cloud, done well, is a deliberate attempt to reduce the page count. It is not about prettier dashboards or fancier tools. It is about three specific methods that actually change the odds of your phone ringing at the wrong hour. Here they are.

Method 1: Ruthless elimination of silent failure modes

The biggest source of 3 AM pages is failure modes that sat hidden for months. Disks filled up because nobody was alerting on the 75 percent threshold. Backups silently stopped working two months ago and nobody noticed until the day a restore was needed. Certificates expired because the renewal job had been misconfigured since installation. The database has been running in degraded mode since a node died three weeks ago and nobody looked at the health metric.

These are not sexy incidents. They are boring. And they are the overwhelming majority of what happens in a poorly managed environment. The method for fixing them is equally boring — and it is the single most valuable thing a managed cloud provider can do for a customer.

Every managed environment we run goes through a systematic audit to find silent failure modes. We look for:

  • Anything that is supposed to run on a schedule. Backups, snapshots, cleanup jobs, certificate renewals, database maintenance, log rotation. Every one of these gets a "last successful run" metric and an alert that fires if it's too old.
  • Anything with a finite resource. Disk, memory, connection pools, inode tables, file descriptor limits, API rate quotas. Every one of these gets a threshold alert at 75 percent, not 95 percent. You want to know before it's a fire.
  • Anything redundant. If you have two of something for high availability, you need an alert when you have one of them. Because "we have two" quietly becomes "we have one" and nobody notices until the remaining one dies.
  • Anything that depends on an external service. DNS, certificate authorities, upstream APIs, package mirrors. These break in ways that only show up when the dependent thing tries to use them. Active synthetic checks are the only way to catch it early.

The first month of a new managed engagement is typically dominated by this work. It is not glamorous, but it closes the failure modes that were going to cause the next three outages before they happen.

Method 2: Playbooks for the things that will happen anyway

You cannot prevent every incident. Hardware fails, ISPs drop, third-party APIs have bad days, certificates expire in rare weird ways, and sometimes a deploy goes wrong and has to be rolled back at 2 AM. The question is not "how do we prevent every incident?" It is "when something does happen, how fast do we recover?"

The method here is playbooks — written, tested, specific procedures for the failure modes that are likely given the shape of the environment. Not generic "how to triage an incident" documents. Actual "if the database primary is unreachable, here are the exact steps, the exact commands, the exact decision points" documents.

A good playbook has three properties. It is specific enough that a tired engineer at 3 AM can follow it without having to think. It has been rehearsed recently enough that the commands still work. And it includes the judgment calls — when to fail over, when to wait, when to wake up the customer, when to roll back.

We maintain playbooks for each managed environment covering the top twenty or so failure modes for that specific stack. When one of them fires, the on-call engineer does not have to reinvent the response. They open the playbook, follow the steps, and are back in bed in forty minutes instead of four hours.

There is a secondary effect that matters just as much. When a playbook exists, incidents get shorter, and when incidents get shorter, the on-call rotation is less exhausting, and engineers stay longer. High turnover on an on-call team is itself a cause of incidents — fresh engineers miss things. A thick playbook library is partially a retention tool.

Method 3: Capacity decisions made in advance, not during incidents

The worst time to decide whether to add capacity is during an outage. You are under pressure, you cannot see clearly, and whatever you provision in a panic is probably not the right size or the right configuration. But that is exactly when most ad-hoc cloud environments make capacity decisions, because nobody had the cycles to do it in advance.

The method is to do the capacity planning on Tuesday afternoon, not Saturday at midnight. Every managed environment gets a monthly capacity review. We look at utilization trends over the last thirty and ninety days, projected workload changes the customer has told us about, and the leading indicators that usually predict a scaling problem — connection pool saturation, garbage collection time, queue depth, tail latency. Based on that, we make capacity decisions before the problem becomes visible.

What this looks like in practice:

  • Scheduled headroom increases before known peak periods, not during them. If a retail customer's traffic doubles in Q4, the capacity is in place on November 1, not on Black Friday at noon.
  • Proactive replacement of saturated components before they become the binding constraint. If a database is at 80 percent of its connection pool limit on average, we upgrade it or add a read replica on a scheduled change window, not after it starts timing out.
  • Early warning when growth is outpacing capacity. The customer gets a monthly note that says "at current growth, you'll hit the current database size limit in about seven months — let's plan the upgrade now."

None of this prevents every scaling incident. But it moves the capacity decisions from the high-stress, low-visibility moments to the calm, well-informed ones, and that alone cuts the number of capacity-driven pages to nearly zero.

The thing nobody talks about

Managed cloud is sold on features — tools, dashboards, SLAs, certifications. But if you ask a customer two years into a good managed engagement what changed, they will not talk about any of those things. They will talk about the fact that they sleep through the night now. That the on-call rotation stopped being the thing everybody dreaded. That the team started taking real vacations again because nothing was going to catch fire while they were gone.

That is the benefit. The three methods above are how you get it. Everything else is packaging.

If you are evaluating a managed cloud provider, ask them not about their tooling but about their page count. How many incidents did they run last month for similar customers? How many of those were silent failures caught after the fact versus caught by alerting? How many playbooks do they maintain? How often do they do capacity reviews? The answers will tell you more than any product demo.

Talk with us about your infrastructure

Schedule a consultation with a solutions architect.

Schedule a Consultation
Talk to an expert →