Terraform and Azure - Automated Deployment of Site To Site VPNs

The creation of an Azure Site to Site VPN is (even by Software Defined Networking standards)…involved. This isn’t a problem unique to Azure and isn’t aided by the desire by vendors to call all of their components something unusual rather than the terminology that already exists. Setup is a very manual and time consuming process, however Terraform can completely automate and codify the process.

Example code for this post can be found in my GitHub at here.

Before Jumping In…

We need to define the usual settings, the local gateway (usually an on-premise firewall), the VPN Gateway (Azure’s VPN Gateway) and the Connection (the VPN connection between the two), however all three of these need to be defined in Azure, this can lead to some confusion as on the surface you might assume that the Local Gateway has no business being defined in Azure since it’s not a Cloud item (not to mention the various SKU oddities that crop up along the way).

Despite the Local Gateway being defined in Azure, this isn’t some kind of magic self configuring and self routing VPN, you will still need to configure your actual local device(s) to do their part, Microsoft have tried to lay out a good chunk of a assistance in providing configuration guides for supported devices in their documentation (though I know from experience that “unsupported” devices will work with varying degrees of success as long as you can make the protocols and proposals match).

It is also critical to know that Azure has a mandatory requirement for an entire /24 Transport Subnet inside the Address Space your VNet has been created in named GatewaySubnet, if this isn’t in place when you attempt to create your first VPN you’ll get nowhere.

Finally, I’m assuming that authentication is going to be done with Pre-Shared Keys of a good length, since the key needs to be pre-shared, I’m going to have it entered at run time rather than randomly generated using Terraform’s pseudorandom generation utilities.

How Does The VPN Look?

According to Microsoft, the VPN should look something like this:

Simple right?

…except that simplistic view of things isn’t exactly how anything works, how could it? The Local Network Gateway isn’t a real device, it’s just a digital representation of a real network appliance. We’re also not seeing any mention of our transport subnet. It’s more reasonable to say that the real setup looks like:

Not as pretty, but accurate

Let’s Try and Make Something

With all of this in mind, let’s try and make something.

The code can get a little long to read for a simple blog entry so let’s just look at automating the creation of a single VPN entry, adding loops and counts is simple enough but is only going to confuse the matter right now.

Below is the standard providers.tf, simple enough, just a single Provider for AzureRM:

#--provider.tf

provider "azurerm" {
    version = "=2.1.0"
    features {}
    subscription_id = var.subscription_id
    tenant_id       = var.tenant_id
    client_id       = var.client_id
    client_secret   = var.client_secret
}

As usual, we want to define as much as possible in variables, this will aid with parameterisation and allow us to scale the routine if we want to add loops and counts later:

#--variables.tf

#--Primary Location
variable "location" {
    type        = string
    description = "Primary Location"
    default      = "uksouth"
}

#--Subscription
variable "subscription_id" {
    type        = string
    description = "Subscription id"
}

#--Tenant
variable "tenant_id" {
    type        = string
    description = "Tenant id"
}

##############################
#---Auth and Secret Params---#
##############################

#--Service Principle AppID
variable "client_id" {
    type        = string
    description = "Client id"
}

#--Service Principle Secret
variable "client_secret" {
    type        = string
    description = "Client secret"
}

#--Service Principle Secret
variable "vpn_psk" {
    type        = string
    description = "VPN PSK"
}

#####################
#---Deploy Params---#
#####################

#--Resource Groups
variable "resource_group" {
    description = "Resource Group"
    type        = string
    default     = "tinfoil_network_rg"
}

#--Base VNet
variable "vnet" {
    description = "Base vnet"
    type        = string
    default     = "tinfoil_vnet"
}

#--Subnet Address Spaces
variable "peer_subnet_address_spaces" {
    description = "All peer subnets"
    type        = list(string)
    default     = ["172.16.1.0/24",]
}

#--Transport Subnet Address Space
variable "transport_subnet_address_space" {
    description = "All subnets"
    type        = list(string)
    default     = ["10.0.3.0/24"]
}

#--VPN Gateway
variable "vpn_gateway" {
    description = "VPN Gateway"
    type        = string
    default     = "tinfoil_vpn_gateway"
}

#--Peer VPN Gateway
variable "peer_vpn_gateway" {
    description = "Peer VPN Gateway"
    type        = string
    default     = "madcaplaughs_vpn_gateway"
}

#--VPN Connection
variable "vpn_connection" {
    description = "VPN Connection"
    type        = string
    default     = "tinfoil_vpn_connection"
}

#--VPN Connection
variable "vpn_public_ip" {
    description = "VPN Public IP"
    type        = string
    default     = "tinfoil_vpn_ip"
}

With everything in place, we can now use our main.tf for the deployment of the Azure VPN components, there’s a few things to be aware of so I’ve added commends in-line:

data "azurerm_subnet" "tinfoilvpn" { #--We need to look this up as as list as we need to get the ID of the Subnet
    name                 = var.transport_subnet_address_space[count.index]
    count                = length(var.transport_subnet_address_space)
    resource_group_name  = var.resource_group
    virtual_network_name = var.vnet
}

resource "azurerm_local_network_gateway" "madcaplaughs" {
    name                = var.peer_vpn_gateway
    location            = var.location
    resource_group_name = var.resource_group
    gateway_address     = "xx.xx.xx.xx" #--Your local device public IP here
    address_space       = var.peer_subnet_address_spaces
}

resource "azurerm_public_ip" "tinfoilvpn" {
    name                = var.vpn_public_ip
    location            = var.location
    resource_group_name = var.resource_group
    allocation_method   = "Dynamic" #--Dynamic set means Azure will generate an IP for your Azure VPN Gateway
}

resource "azurerm_virtual_network_gateway" "tinfoilvpn" {
    name                    = var.vpn_gateway
    location                = var.location
    resource_group_name     = var.resource_group
    type                    = "Vpn" #--Other option is ExpressRoute, predictably for ExpressRoute VPNs
    vpn_type                = "RouteBased" #--Policy based is also acceptable here, depending on your use case
    active_active           = false
    enable_bgp              = false
    sku                     = "Basic" #--A whole load of oddities occur around SKUs, see MS Docs for details
    ip_configuration {
        public_ip_address_id          = azurerm_public_ip.tinfoilvpn.id
        private_ip_address_allocation = "Dynamic"
        subnet_id                     = data.azurerm_subnet.tinfoilvpn.0.id #--There's that ID we needed, for the Transport Subnet
    }
}

resource "azurerm_virtual_network_gateway_connection" "tinfoilvpn" {
    name                       = var.vpn_connection
    location                   = var.location
    resource_group_name        = var.resource_group
    type                       = "IPsec"
    virtual_network_gateway_id = azurerm_virtual_network_gateway.tinfoilvpn.id
    local_network_gateway_id   = azurerm_local_network_gateway.madcaplaughs.id
    shared_key                 = var.vpn_psk #-Provided at run time
}

Now when we terraform init we will load the AzureRM backend, and when we terraform apply get ready for a very long wait as the provisioning of these resources takes a good long time (seriously expect it to be up to 30 minutes for the provisioning of the Azure Virtual Network Gateway and then around 15-30 minutes further before the Azure RM starts to show any traffic in or out. This isn’t a Terraform limitation, this is the speed of Azure:

Running all the way...

If we look in to the AzureRM now at our active VPN connections, we can see that the connection has been created, and our Remote and Local gateways are on either end of it (IP addresses redacted for privacy):

Future Considerations

I would also add that it’s ill advised to link the creation of VNets, address spaces and subnets to the creation of the VPNs themselves as when you modify the configurations and reapply the entire state will be modified and you will end up reprovisioning any and all VPNs defined by the configuration, and at around an hour per VPN that’s a tedious waste of time you could well do without.

After all, you don’t want to interrupt services or waste your time watching progress counters tick along forever!

Written on May 28, 2020