Like most sysadmins, one of my primary responsibilities is ensuring high-availability in our environments. Recently, I’ve been working a lot more with Amazon AWS. Amazon recently began forcing new accounts to make use of VPC. When you create a VPC, an Internet Gateway must be provisioned to route traffic to the Internet. VPC’s utilize subnet constructs for virtual networking. Subnets are assigned a routing table, and in the case of a Public subnet, the default route of this table is pointed at the Internet Gateway. Instances in this public subnet are assigned public, non-RFC1918 Elastic IP addresses. At the moment, only 5 Elastic IP addresses may be requested per account. You can request more via support, but obviously they are trying to ween people away from using them for everything. Consequently, NAT & supporting instances must be in place to facilitate external communication for non-public subnets.
In the case of these subnets, the default-route should be pointed at a NAT instance residing in the Public subnet. This brings about a single point of failure. Should the NAT instance go down, nothing in that subnet can speak to the outside world; the default-route becomes a black-hole. In order to combat this, multiple NAT instances can be provisioned in different availability zones, and with a little magic, configured to take over each others traffic-routing responsibilities on-demand.
Amazon has furnished a document with a workaround for this situation. Essentially, a script running on each NAT instance performs a health check on the other NAT instance, and should the other instance go down, the healthy instance will take over. It does so by adjusting the routing tables via AWS API calls. The script will also attempt to bring the failed instance back online.
UPDATE: The NAT Monitor script outlined by Amazon has a flaw. The ec2-describe-instances call to determine the state of the other NAT instance does not function properly. The documentation references using $5 instead of $4 to set the NAT_STATE variable, however I have found $6 to work best, but test this because your EC2-API-tools version might yield different results. I also highly suggest the –show-empty-fields argument because if the number of fields changes, the awk statement could potentially grab the incorrect field.
NAT_STATE=`/opt/aws/bin/ec2-describe-instances $NAT_ID -U $EC2_URL --show-empty-fields | grep INSTANCE | awk '{print $6;}'`
There is one issue with the configuration outlined by the Amazon document; the IAM roles permissions are too loose. Using the policy defined in the document, the NAT instance is granted permissions to restart every instance belonging to the account. Additionally, the NAT instance could modify any and all routing tables, such as those in other regions, VPC’s, etc. You probably don’t want your NAT instances in US-West-2 making any modifications whatsoever to US-East-1. The below policy is an attempt to restrict permissions as best as permitted by supported IAM policy conditions. Just substitute/replace the region and VPC information with your own. Also, tag the NAT instances with a ‘Type’ and ‘VPC’ field, setting ‘Type’ to ‘NAT’ and ‘VPC’ with the VPC’s ID.
Restricted IAM Policy
{
"Statement":[
{
"Sid":"DescribeStuff",
"Action":[
"ec2:DescribeInstances"
],
"Effect":"Allow",
"Resource":"*",
"Condition":{
"StringLike":{
"ec2:Region":"us-west-2",
"ec2:ResourceTag/VPC":"vpc-abcd1234"
}
}
},
{
"Sid":"RoutingTableAccess",
"Action":[
"ec2:CreateRoute",
"ec2:ReplaceRoute"
],
"Effect":"Allow",
"Resource":"*",
"Condition":{
"StringEquals":{
"ec2:Region":"us-west-2"
}
}
},
{
"Sid":"NATInstanceControl",
"Action":[
"ec2:StartInstances",
"ec2:StopInstances"
],
"Effect":"Allow",
"Resource":"arn:aws:ec2:us-west-2:*",
"Condition":{
"StringLike":{
"ec2:ResourceTag/Type":"NAT",
"ec2:ResourceTag/VPC":"vpc-abcd1234"
}
}
}
]
}