Musings of a PC

Thoughts about Windows, TV and technology in general

Automated AWS EBS expansion with xMatters

The complete workflow in xMatters

Introduction

xMatters is a powerful incident management tool, used by Linaro to take various sources of alarms and coordinate them into alerts for the appropriate on-call staff. There are many such tools available on the market to choose from; the main reason why xMatters was picked was the flexibility provided by being able to write custom steps in Javascript. As a result, workflows can be very flexible.

Linaro uses Amazon Web Services (AWS) for most of its infrastructure, predominantly EC2 instances. One of the tasks I often find myself dealing with is responding to a low free disc space alarm. The alarm is generated by CloudWatch Alarms as a result of metrics submitted by the CloudWatch Agent running on each instance. The alarm is fed by a SNS topic directly to a xMatters webhook and I then get notified on my phone and by email.

Increasing the size of an EBS volume isn’t hard or onerous – expand the EBS volume then get the OS to grow the corresponding partition (if required) and resize the filing system. Ideally, though, I wouldn’t have to do it manually – particularly if the alarm goes off at 3am!

This article looks at the challenges around automating the process and how it has been solved with a xMatters workflow. Whilst there are many ways that the process could be automated, e.g. writing a script in Lambda, I wanted to try and solve it entirely within xMatters so that if the process completes automatically, nothing happens but if the process fails at any step, a xMatters event is still created.

Assumptions and prerequisites

  • This workflow has only been tested on Ubuntu instances. It should work on other variants of Linux but not with Windows due to the hard-wired commands being run at certain points.
  • For any non-NVMe volumes, there is an assumption that AWS sees the device as sdf but the operating system sees the device as xvdf.
  • For NVMe volumes, there is an assumption that only the root device will have a partition on it.
  • Instances will need to have the SSM agent running on them in order to be able to execute commands on the operating system.

Not quite a blank canvas

The workflow starts from the xMatters CloudWatch Integration. This provides the framework to receive the alarm via SNS and then process it.

Authentication

The first thing to deal with is credentials for any of the interactions with AWS. For a simple environment, an AWS IAM user could be created with appropriate permissions and the static access key and secret key then used. For Linaro’s environment, that isn’t going to work. We have multiple accounts so we’d need a user per account and that then becomes more unwieldy when it comes to using those credentials within xMatters.

To solve the multiple account issue, roles are used instead, with an account being able to assume the role. That still requires a user with static credentials, which is not ideal, particularly when the recommendation is to rotate those credentials on a regular basis. A suitable IAM policy for the role is:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Sid": "VisualEditor0",
             "Effect": "Allow",
             "Action": [
                 "ec2:DescribeInstances",
                 "ec2:DescribeVolumes",
                 "ec2:DescribeVolumesModifications",
                 "ec2:ModifyVolume",
                 "ssm:GetCommandInvocation",
                 "ssm:SendCommand"
             ],
             "Resource": "*"
         }
     ]
}

To avoid the need to have an AWS IAM user for the workflow, Linaro uses Hashicorp Vault instead. There is a single Vault AWS IAM user, with the Vault software rotating the access key regularly so that it is kept safe. To use this approach, a step was written in the workflow that has inputs for the Vault authentication values plus the desired Vault role to assume. The step outputs the STS-provided access key, secret key and session token.

Vault-AssumeAWSRole

By using xMatter’s ability to merge free text with values from other steps, a Vault role value can be provided that is a combination of the AWS account ID for the affected volume plus the fixed string -EBSResizeAutomationRole.

Getting details from the SNS message

In order to grow the affected EBS volume, the following information is needed:

  • The AWS account ID
  • The AWS region
  • The EC2 instance ID
  • The device name on the instance for the volume

The account ID is provided as an output value (AWSAccountId) from the SNS step in the workflow. The other values need a further custom step script in order to extract the information from the the Trigger Dimensions (Trigger.Dimensions from the SNS step) and the SNS Topic ARN (TopicArn).

Inputs on SNS-ExtractValues step

There is one further piece of information required – the volume ID – but that isn’t provided by CloudWatch as part of the alarm information. There are two potential options to get it – use the alarm description as an additional field and store it there or script a solution. The former is easier to use but has the drawback that if the underlying volume’s ID ever changes someone needs to remember to update the alarm description. The latter, as will be explained, is rather tricky to solve …

Getting the volume ID

On the face of it, getting the volume IDs for attached volumes on an EC2 instance looks like being a straightforward task. The describe-instance API call returns the details of the attached blocks, like this:

<blockDeviceMapping>
    <item>
        <deviceName>/dev/sda1</deviceName>
        <ebs>
            <volumeId>vol-0f5bc3e51714c5d5f</volumeId>
            <status>attached</status>
            <attachTime>2020-11-18T11:07:51.000Z</attachTime>
            <deleteOnTermination>true</deleteOnTermination>
        </ebs>
    </item>
</blockDeviceMapping>

So, for a given device name from the SNS topic, it should be simple enough to find the volume ID … except for the fact that the device name given in the SNS topic never matches the device name in the block device mapping information. Sometimes, it is quite straightforward to resolve – the block device mapping uses a name like “/dev/sda1” and the SNS topic’s device name is “xda1”. Consistent and easy to code around, so long as that mapping is the correct assumption to make.

The introduction of NVMe block devices on Nitro systems is a completely different kettle of fish, though. For example, the block device mapping example above clearly states that the device name is “/dev/sda1”. What is provided in the SNS topic? “nvme0n1p1”

The implemented solution is to use Systems Manager (SSM) to run a command on the affected instance so that the NVMe device can be translated to the associated volume ID. This does require that the SSM agent is installed and running on the instance. If anyone knows a better solution that works on Ubuntu, do let me know!

So, the workflow needs to look at the device name from the SNS alarm information and branch depending on that device name. Either route will then give us the corresponding volume ID. To make that branch operation simpler, the SNS-ExtractValues step also provides an output called NVMEdevice which is set to true if the device name starts “nvme”, otherwise it is set to false.

Use the appropriate method to get the volume ID
Get the volume ID for a NVMe device
Get the volume ID for a non-NVMe device

Expanding the volume

Once the volume ID is known, the serious stuff can start.

AWS has the ability to allow an EBS volume to be expanded without any downtime … but it comes with the penalty that, after expansion has been requested, you have to wait for the background optimisation process to complete before you can request another expansion. Even then, there is a maximum modification rate per volume limit (which appears to be one), after which you have to wait at least 6 hours before trying to modify the volume again.

So, the first thing to be done is check the status of the volume and skip to raising a xMatters alert if the volume is still being optimised. If an already-expanded volume has run out of space that quickly, there may be a bigger problem for someone to investigate.

Check that the volume isn’t being optimised

If the volume is not being optimised, the workflow moves on to modify the volume. The approach taken is to multiply the current size by a factor, e.g. 2 to double the volume’s size. Rather than hard-code that into the script, it can be configured as an input value.

EC2-ModifyVolume

If the request to modify the size fails because the modification rate per volume limit has been reached, a xMatters alert is raised.

Growing & resizing the filing system

To grow and resize the filing system, commands need to be run on the operating system of the affected instance. The “grow” part only needs to happen if the filing system is on a partition of a larger volume. This only seems to happen on the root device and, as such, there is already a tool installed on the server that can be used to grow the partition – cloud-init:

cloud-init single -n growpart

That command will grow the root partition if it needs to be grown. Otherwise, it will just exit without error.

Once that command finishes, the filing system itself is resized with:

resize2fs /dev/<device name>

Both of these commands are run by using the SSM Run Command functionality that was referenced earlier in regards to getting the volume ID for a NVMe device.

SSM-GrowAndResize

If the commands do not succeed, a xMatters alert is created, otherwise the workflow ends and, eventually, CloudWatch will realise that the volume has be resized and clear the alarm.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: