Understanding High Availability for VMs in Azure using Availability Sets

If our computer datacentre is a single point of failure for our business then we might consider moving our applications to Azure virtual machines -VMs. Let’s assume that we decide to migrate on-premise VMs to an Azure region in a single datacentre. To migrate correctly it requires planning to minimize future downtime. Downtime includes unforeseen component failures alongside the normal planned or unplanned server updates. Also there are requirements to meet the Azure SLA.

The first step is to create an availability set. Followed closely by creating each VM in the availability set. This helps us twofold:

  1. VMs are physically distributed across separate hardware – fault domains. Dependent resources such as CPU, network switches, power supplies etc can hosted placed together as a unit in the datacentre.
  2. VMs are logically distributed across update domains so that Microsoft can perform planned and unplanned updates – carefully avoiding updating all  VMs for a single application simultaneously.

Let’s imagine we have a very busy on-premise website that we want to migrate to Azure VMs with the following requirement:

  • 20 web servers
  • A maximum of 3 servers should be offline at any one time in the event of an unplanned component failure –  not a regional or data-center failure.
  • A maximum of 2  servers should be offline at any one time for planned maintenance.

If you read the documentation, then for each availability set, by default, there are 2 fault domains and 5 update domains. This means that if we create an availability set, then the first 2 VMs can be evenly distributed on separate hardware but any more will then be distributed back across the same hardware. Also the first 5 VMs will be logically distributed such that they are never be updated or rebooted at the same time. Any further VMs are evenly distributed alongside the previous 5.

If we have many VMs we can reconfigure up to 3 fault domains and 20 update domains.

So how many availability sets do we need to create to meet our requirement and is it possible?

If we consider the 3 availability sets option. If we assume that none of the VMs in availability sets q and r are in the same rack as VM1 then this appears to meet the requirement, otherwise having more than one availability set doesn’t seem to help us and we cannot meet the requirement.

MS are now rolling out availability zones. This means that we can additionally distribute our VMs across multiple datacentres in the same region.

The purpose of this article is to begin to understand how MS provide high availability to VMs in a datacentre. We have assumed that we actually need to manually provision 20 VMs. If these VMs are identical, then why not instead use VM Scale Sets? Or migrate to platform as a service Web Apps and scale out on CPU demand? Most importantly we have avoided  considering regional or datacentre outages. If we are serious about availability then we must include a multi region strategy including technologies such as Traffic Manager.

When we talk about availability it is generally discussed in terms of business continuity and not the number of servers that we need in case of failure. Hence the discussions generally focus around SLA up time, usually 99.9% or 99.95%.

Thanks for reading!

Leave a Reply

Your email address will not be published. Required fields are marked *