Availability Engineering
Authored on 31 January, 2025 by Naresh V
Introduction
Availability is the measure of how long the site is "up" or online in a given period of time. The real time duration can be any value ranging from a week, a month, quarter or a year. Usually expressed as a percentage, availability is one of the most essential metrics a business must focus its resources towards. For example, below are some of the uptime targets promised by some popular SaaS solutions:
  1. Google Workspaces Productivity Suite: 99.9% monthly uptime [1]
  2. MS Office 365 Productivity Suite: At most 99.9% monthly uptime [2]
  3. Zoho Productivity Suite: 99.9% monthly uptime [3]
  4. Amazon S3 Object Storage: At most 99.9% monthly uptime [4]
The table below acts as reference to how much down time is acceptable for each uptime target, accurate up to the minute:
Uptime Week Month Quarter Year
99 % 1:40 hour 7:12 hours 21:36 hours 3 days, 15:36 hours
99.9 % 10 min 43 min 2:09 hours 8:45 hours
99.99 % 1 min 4 min 13 min 52 mins
99.999 % 6 seconds 25 seconds 1 min 5 min
A decrease in the availability of a site translates to almost certain revenue loss. The quantity of loss varies from nature of business to the volume of business, but a dent is definitely made. Further more, a site that is down erodes customer satisfaction, pushing them away to your competitors. Some businesses offer to provide service credits to their customers as a way to compensate for the downtime.
However, in the same breath, it is impossible to achieve zero downtime. The IE team must implement workarounds for each of these probable causes so as to ensure business continuity.
A fact that any IE engineer must always remember is that anything can fail at any time. Below are some causes of downtime:
What is High Availability?
High availability, put simply, is "having copies of everything elsewhere." Let us break down this statement.
Copies of everything - A website has several components depending on the functionality it provides, and each component has data. Each of these components can fail, and hence must have a copy of its data. The scale of this component can vary from a single virtual machine to an entire cloud provider region impacting millions of VMs.
Store the copies elsewhere - This is a cloud-age example of not "putting all eggs in one basket". An online system that is constantly changing its data must not be used to store its backup data as well. The backup must be stored in another system that is physically and logically separated from the primary system.
Planning High Availability
The degree of High Availability is decided usually by managerial members of engineering teams and the IE team itself. Not every system needs to have the same degree of availability. One of the major constraining factors of having more 9's in your availability targets is money. Having a redundant copy of each system increases your cloud expenditure. Hence, the target availability of an organization is set based on the budgets allocated for site hosting and maintenance by your organization.
At the end of discussions for availability, the availability target must be included in the Service Level Agreement (SLA) document of your organization.
Availability Strategies in Brief
Before we venture into the strategies for achieving high availability, let us explore the two types of service configurations based on their local data. Stateful services store data on the local disk need to inherently support availability features like data replication, hot standby and automatic failover.
Stateless services store their data in an external system like a DB or object storage, if at all.
  1. Virtual Machines: Availability of a VM depends on the application running inside it. Only stateless services and stateful distributed systems can be horizontally scaled. At least two VMs each in different AZs are required to achieve HA.
  2. Load Balancers: Load balancer software (Nginx, Apache HTTPD, HAProxy) are generally stateless applications and can be horizontally scaled.
  3. Databases: Databases are stateful. Hence, to achieve HA, one needs to use the DB's inbuilt replication features. Most popular RDBMS engines like MySQL, Postgresql, MS-SQL offer replication features for hot and cold standbys.
    In distributed databases like Cassandra or MySQL NDB engine, HA is achieved by adding more nodes to the cluster. Writes and reads can be done on any of the nodes, unlike in the writer-reader style DBs.
  4. Caches: In memory caches like Redis or Memcached are used to improve performance of the web app. Caches can be considered both stateful and stateless simultaneously. While they do store stateful data in their cache stores, this data is generally retrieved from a persistent database. Hence, if a cache instance crashes, it can still fetch data from the DB, but nonetheless impacting website performance. DB-backed caches can hence be theoretically horizontally scaled behind a load balancer to achieve HA; yet, requests' RTT may be erratic as requests may encounter a cache-hit or a cache-miss, the probability of a cache-hit increasing over time.
  5. Object storage: Object storage services like S3, Azure Blob Storage, Google Buckets are SaaS solutions, hence provide High Availability within the region. To ensure availability despite a region failure, cross-region replication must be enabled.
  6. Zone Level HA: Zones are isolated data centres (DC) within a region. DCs always have redundancy in power supply, network connectivity, cooling and stringent security. However, in the event of an AZ/DC failure, all resources in that AZ/DC can go down. Data in that AZ has the potential to be permanently lost unless it was backed up. Hence, applications must be spread across multiple AZs to tackle this possibility.[5][6][7]
  7. Region Level HA: Although rarer than a zone failure, a cloud Region can fail in case of more larger events. Natural disasters like earthquakes, volcanic activity, etc or human activities like nuclear war can cause a region failure. Adjacent Regions are generally constructed on seismically isolated geographic areas.
    Cross region HA is achieved using a wide set of procedures collectively called the "Disaster Recovery Plan." [5][6][7]
References
  1. https://workspace.google.com/terms/sla/
  2. https://www.microsoft.com/licensing/docs/view/Service-Level-Agreements-SLA-for-Online-Services
  3. https://www.zoho.com/security-faq.html
  4. https://aws.amazon.com/s3/sla/
  5. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-availability-zones
  6. https://cloud.google.com/compute/docs/regions-zones
  7. https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview