I have a SanDisk Sport Plus which I use to play audio books. It is a great little device but it has a couple of flaws:
When a book finishes, the player starts playing the same book all over again, rather than moving on to the next book.
There is a hard limit of 1,000 files of each filetype. If you exceed that limit, the player is unable to display all of the books, but you don’t get any errors.
So I’ve taken to combining the MP3 files for a single book into a single file, and then tagging each book for a series, e.g. James Bond, so that the device interprets each file as a separate “chapter” in the same book. This then results in the player playing one book (a single file) then moving on to the next book, and so on.
I found a really good primer for combining MP3 files losslessly but, when dealing with a lot of MP3 files, a simple cat command becomes unwieldy. Using a wildcard, e.g. *.mp3, may work except when filenames are not in strict alphabetical order. For example, if files have numbers like “1”, “2”, …, “10”, “11” then the actual filename sequence may change to “1”, “10”, “11”, “2”, etc.
So I’ve now devised a longer “recipe” which I’m documenting so that I don’t have to devise it all again in the future.
Create a list of all of the MP3 files ls *.mp3 > list.txt
Use an editor to reorder the lines so that they are in numerical order
Combine the files xargs -d "\n" -a list.txt cat | mp3cat - - > ~/tmp.mp3
Copy the metadata over from the first file id3cp <first file> ~/tmp.mp3
This code should be fairly self-explanatory but, in summary, it takes the access key & secret key, region and request parameters and returns the appropriate request headers and the required host (endpoint) for the request.
AWS has different endpoints not just for each service but also for each region. So, for example, making an EC2 call for us-east-1 requires using ec2.us-east-1.amazonaws.com, while making a SSM call for eu-west-2 requires using ssm.eu-west-2.amazonaws.com. xMatters doesn’t allow scripts to dynamically reference endpoints. Instead the endpoints must be separate configured as part of the workflow and the script then dynamically changes which xMatters endpoint is referenced:
So, we’re almost there. We can now execute an EC2 call like this:
var ec2Response = executeEc2Action(access_key, secret_key, region, "Action=ModifyVolume&Size="+volume_size+"&Version=2016-11-15&VolumeId="+volume_id);
A really important point to make here: the different keys used in request_parameters function parameter must, repeat MUST, be in alphabetical order. In other words: Action, Size, Version, VolumeId. If they are not in alphabetical order, the call will fail with “AuthFailure – AWS was not able to validate the provided access credentials”.
In trying to troubleshoot that particular problem, I came across AWS Signature v4 Calculator (com.s3-website-us-west-2.amazonaws.com) which shows the result of the signature calculations at each step, thus making it easier to pinpoint where the code might be wrong. If you find yourself debugging/troubleshooting in this area, just remember to keep the date & time the same in both the website and the code as the signature calculations do rely on them so the slightly variance will give different results.
So, we now have the ability to call any AWS API so long as we present the parameters correctly. The final piece of the puzzle is decoding what comes back from AWS. If you’ve ever used boto3, you’ll know that it returns JSON. Curiously, the AWS APIs do not … they return XML! I’m not strong at parsing XML paths but, thankfully, xMatters includes a number of libraries for XML manipulation, including JXON, a library to convert XML to JSON.
var json_response = JXON.parse(ec2Response.body);
xMatters doesn’t allow one library to reference another library, unfortunately, which means that all of the AWS code needs to be duplicated in each script. Apart from that, though, it should now be quite straightforward to call any AWS API from within a xMatters workflow.
In part 1, I wrote about a workflow created for xMatters that reacted to CloudWatch alarms delivered via SNS when the free space on a server was running low.
Since writing that, a bug was discovered (now fixed) that prevented the filing system associated with the volume from being resized if the volume was not stored on a NVMe device. That bug resulted in the realisation that there was a “gap” in the workflow:
CloudWatch alarm goes off, triggering the workflow
Workflow expands the volume but doesn’t resize the filing system
CloudWatch alarm clears due to free space increasing
Time passes …
CloudWatch alarm goes off, triggering the workflow
… and around we go again
In other words, there was the risk that the workflow would continue to grow the volume without a corresponding resizing of the filing system, thereby never stopping the alarm loop.
To correct that behaviour, a new step has been added to the workflow:
The new step – SSM-CheckFS – takes the following actions:
Runs some commands (see below) on the host to determine the size of the filing system.
Compares that size with the size of the volume.
Sets an output to indicate whether or not to proceed with the workflow.
Linaro uses Amazon Web Services (AWS) for most of its infrastructure, predominantly EC2 instances. One of the tasks I often find myself dealing with is responding to a low free disc space alarm. The alarm is generated by CloudWatch Alarms as a result of metrics submitted by the CloudWatch Agent running on each instance. The alarm is fed by a SNS topic directly to a xMatters webhook and I then get notified on my phone and by email.
Increasing the size of an EBS volume isn’t hard or onerous – expand the EBS volume then get the OS to grow the corresponding partition (if required) and resize the filing system. Ideally, though, I wouldn’t have to do it manually – particularly if the alarm goes off at 3am!
This article looks at the challenges around automating the process and how it has been solved with a xMatters workflow. Whilst there are many ways that the process could be automated, e.g. writing a script in Lambda, I wanted to try and solve it entirely within xMatters so that if the process completes automatically, nothing happens but if the process fails at any step, a xMatters event is still created.
Assumptions and prerequisites
This workflow has only been tested on Ubuntu instances. It should work on other variants of Linux but not with Windows due to the hard-wired commands being run at certain points.
For any non-NVMe volumes, there is an assumption that AWS sees the device as sdf but the operating system sees the device as xvdf.
For NVMe volumes, there is an assumption that only the root device will have a partition on it.
Instances will need to have the SSM agent running on them in order to be able to execute commands on the operating system.
The first thing to deal with is credentials for any of the interactions with AWS. For a simple environment, an AWS IAM user could be created with appropriate permissions and the static access key and secret key then used. For Linaro’s environment, that isn’t going to work. We have multiple accounts so we’d need a user per account and that then becomes more unwieldy when it comes to using those credentials within xMatters.
To solve the multiple account issue, roles are used instead, with an account being able to assume the role. That still requires a user with static credentials, which is not ideal, particularly when the recommendation is to rotate those credentials on a regular basis. A suitable IAM policy for the role is:
To avoid the need to have an AWS IAM user for the workflow, Linaro uses Hashicorp Vault instead. There is a single Vault AWS IAM user, with the Vault software rotating the access key regularly so that it is kept safe. To use this approach, a step was written in the workflow that has inputs for the Vault authentication values plus the desired Vault role to assume. The step outputs the STS-provided access key, secret key and session token.
By using xMatter’s ability to merge free text with values from other steps, a Vault role value can be provided that is a combination of the AWS account ID for the affected volume plus the fixed string -EBSResizeAutomationRole.
Getting details from the SNS message
In order to grow the affected EBS volume, the following information is needed:
The AWS account ID
The AWS region
The EC2 instance ID
The device name on the instance for the volume
The account ID is provided as an output value (AWSAccountId) from the SNS step in the workflow. The other values need a further custom step script in order to extract the information from the the Trigger Dimensions (Trigger.Dimensions from the SNS step) and the SNS Topic ARN (TopicArn).
There is one further piece of information required – the volume ID – but that isn’t provided by CloudWatch as part of the alarm information. There are two potential options to get it – use the alarm description as an additional field and store it there or script a solution. The former is easier to use but has the drawback that if the underlying volume’s ID ever changes someone needs to remember to update the alarm description. The latter, as will be explained, is rather tricky to solve …
Getting the volume ID
On the face of it, getting the volume IDs for attached volumes on an EC2 instance looks like being a straightforward task. The describe-instance API call returns the details of the attached blocks, like this:
So, for a given device name from the SNS topic, it should be simple enough to find the volume ID … except for the fact that the device name given in the SNS topic never matches the device name in the block device mapping information. Sometimes, it is quite straightforward to resolve – the block device mapping uses a name like “/dev/sda1” and the SNS topic’s device name is “xda1”. Consistent and easy to code around, so long as that mapping is the correct assumption to make.
The introduction of NVMe block devices on Nitro systems is a completely different kettle of fish, though. For example, the block device mapping example above clearly states that the device name is “/dev/sda1”. What is provided in the SNS topic? “nvme0n1p1”
The implemented solution is to use Systems Manager (SSM) to run a command on the affected instance so that the NVMe device can be translated to the associated volume ID. This does require that the SSM agent is installed and running on the instance. If anyone knows a better solution that works on Ubuntu, do let me know!
So, the workflow needs to look at the device name from the SNS alarm information and branch depending on that device name. Either route will then give us the corresponding volume ID. To make that branch operation simpler, the SNS-ExtractValues step also provides an output called NVMEdevice which is set to true if the device name starts “nvme”, otherwise it is set to false.
Expanding the volume
Once the volume ID is known, the serious stuff can start.
AWS has the ability to allow an EBS volume to be expanded without any downtime … but it comes with the penalty that, after expansion has been requested, you have to wait for the background optimisation process to complete before you can request another expansion. Even then, there is a maximum modification rate per volume limit (which appears to be one), after which you have to wait at least 6 hours before trying to modify the volume again.
So, the first thing to be done is check the status of the volume and skip to raising a xMatters alert if the volume is still being optimised. If an already-expanded volume has run out of space that quickly, there may be a bigger problem for someone to investigate.
If the volume is not being optimised, the workflow moves on to modify the volume. The approach taken is to multiply the current size by a factor, e.g. 2 to double the volume’s size. Rather than hard-code that into the script, it can be configured as an input value.
If the request to modify the size fails because the modification rate per volume limit has been reached, a xMatters alert is raised.
Growing & resizing the filing system
To grow and resize the filing system, commands need to be run on the operating system of the affected instance. The “grow” part only needs to happen if the filing system is on a partition of a larger volume. This only seems to happen on the root device and, as such, there is already a tool installed on the server that can be used to grow the partition – cloud-init:
cloud-init single -n growpart
That command will grow the root partition if it needs to be grown. Otherwise, it will just exit without error.
Once that command finishes, the filing system itself is resized with:
resize2fs /dev/<device name>
Both of these commands are run by using the SSM Run Command functionality that was referenced earlier in regards to getting the volume ID for a NVMe device.
If the commands do not succeed, a xMatters alert is created, otherwise the workflow ends and, eventually, CloudWatch will realise that the volume has be resized and clear the alarm.
Before I get into the nitty-gritty, a brief recap of how things are working before any of the changes described in this article …
A Linaro static website consists of one or more git repositories, with potentially one being hosted as a private repository on Linaro’s BitBucket server and the others being hosted on GitHub as public repositories. Bamboo, the CI/CD tool chosen by Linaro’s IT Services to build the sites, monitors these repositories for changes and, when a change is identified, it runs the build plan for the web site associated with the changed repositories. If the build plan is successful, the staging or production web site gets updated, depending on which branch of the repository has been updated (develop or master, respectively).
All well and good but it does mean that if someone commits to a repository a breaking change (e.g. a broken link or some malformed YAML) then no other updates can be made to that website until that specific problem has been resolved.
To solve this required several changes being made that, together, helped to ensure that breaking changes couldn’t end up in the develop or master branches unless someone broke the rules by bypassing the protection. The changes we made were:
Using pull requests to carry out peer reviews of changes before they got committed into the develop or master branch.
Getting GitHub to trigger a custom build in Bamboo so that the proposed changes were used to drive a “test” build in Bamboo, thereby assisting the peer review by showing whether or not the test build would actually be successful.
Using branch protection rules in GitHub to enforce requirements such as needing the tests to succeed and needing code reviews.
Pull requests are not a native part of the git toolset but they have been implemented by a number of the git hosting platforms like GitHub, GitLab, BitBucket and others. They may vary in the approach taken but, essentially, one or more people are asked to look at the differences between the incoming changes and the existing files to see if anything wrong can be identified.
That, in itself, can be a laborious and not always successful process at spotting problems which is why there is an increasing use of automation to assist. GitHub’s approach is to have webhooks or apps trigger an external activity that might perform some testing and then report back on the results.
We opted to use webhooks to get GitHub to trigger the custom builds in Bamboo. They are called custom builds because one or more Bamboo variables are explicitly defined in order to change the behaviour of the build plan. I’ll talk more about them in a subsequent article.
The final piece of the puzzle was implementing branch protection rules. I’ve linked to the GitHub documentation above but I’ll pick out the key rules we’ve used:
Require pull request reviews before merging. When enabled, all commits must be made to a non-protected branch and submitted via a pull request with the required number of approving reviews.
Require status checks to pass before merging. Choose which status checks must pass before branches can be merged into a branch that matches this rule.
There is a further option that has been tried in the past which is “Include administrators”. This enforces all configured restrictions for administrators. Unfortunately, too many of the administrators have pushed back against this (normally because of the pull request review requirement) so we tend to leave it turned off now. That isn’t to say, though, that administrators get a “free ride”. If a pull request requires a review, an administrator can merge the pull request but GitHub doesn’t make it too easy:
Clicking on Merge pull request, highlighted in “warning red”, results in the expected merge dialog but with extra red bits:
So an administrator does have to tick the box to say they are aware they are using their admin privilege, after which step they can then complete the merge:
If an administrator pushes through a pull request that doesn’t build then they are in what I describe as the “you broke it, you fix it” scenario. After all, the protections are there for a good reason 😊.
In migrating the first Linaro site from WordPress to Jekyll, it quickly became apparent that part of the process of building the site needed to be a “check for broken links” phase. The intention was that the build plan would stop if any broken links were detected so that a “faulty” website would not be published.
Link-checking a website that is currently being built potentially brings problems, in that if you reference a new page, it won’t yet have been published and therefore if you rely on checking http(s) URLs alone, you won’t find the new page and an erroneous broken link is reported.
You want to be able to scan the pages that have been built by Jekyll, on the understanding that a relative link (e.g. /welcome/index.html instead of https://mysite.com/welcome/index.html) can be checked by looking for a file called index.html within a directory called welcome, and that anything that is an absolute link (e.g. it does start with http or https) is checked against an external site.
I cannot remember which tool we started using to try to solve this. I do remember that it had command-line flags for “internal” and “external” link checking but testing showed that it didn’t do what we wanted it to do.
So an in-house solution was created. It was probably (at the time), the most complex bit of Python code I’d written and involved learning about things like how to run multiple threads in parallel so that the external link checking doesn’t take too long. Some of our websites have a lot of external links!
Over time, the tool has gained various additional options to control the checking behaviour, like producing warnings instead of errors for broken external links, which allows the 96Boards team to submit changes/new pages to their website without having to spend time fixing broken external links first.
The tool is run as part of the Bamboo plan for all of the sites we build and it ensures that the link quality is as high as possible.
Triggering a test build on Bamboo now ensures that a GitHub Pull Request is checked for broken links before the changes are merged into the branch. We’ve also published the script as a standalone Docker container to make it easier for site contributors to run the same tool on their computer without needing to worry about which Python libraries are needed.
The private git server used was Atlassian BitBucket – the self-hosted version, not the cloud version. Although Linaro’s IT Services department is very much an AWS customer, we had already deployed BitBucket as an in-house private git service so it seemed to make more sense to use that rather that choose to pay an additional fee for an alternative means of hosting private repositories like CodeCommit or GitHub.
So what to do about the build automation? An option would have been to look at CodeBuild but, as Linaro manages a number of Open Source projects, we benefit from Atlassian’s very kind support of the Open Source community, which meant we could use Atlassian Bamboo on the same server hosting BitBucket and it wouldn’t cost us any more money.
For each of the websites we build, there is a build plan. The plans are largely identical to each other and go through the following steps, essentially emulating what a human would do:
Check out the source code repositories
Merge the content into a single directory
Ensure that Jekyll and any required gems are installed
Build the site
Upload the site to the appropriate S3 bucket
Invalidate the CloudFront cache
Each of these is a separate task within the build plan and Bamboo halts the build process whenever a task fails.
There isn’t anything particularly magical about any of the above – it is what CI/CD systems are all about. I’m just sharing the basic details of the approach that was taken.
Most of the tasks in the build plan are what Bamboo calls a script task, where it executes a script. The script can either be written inline within the task or you can point Bamboo at a file on the server and it runs that. In order to keep the build plans as identical as possible to each other, most of the script tasks run files rather than using inline scripting. This minimises the duplication of scripting across the plans and greatly reduces the administrative overhead of changing the scripts when new functionality is needed or a bug is encountered.
To help those scripts work across different build plans, we rely on Bamboo’s plan variables, where you define a variable name and an associated value. Those are then accessible by the scripts as environment variables.
We then extended the build plans to work on both the develop and master branches. Here, Bamboo allows you to override the value of specified variables. For example, the build plan might default to specifying that jekyll_conf_file has a value of “_config.yml,_config-staging.yml”. The master branch variant would then override that value to be “_config.yml,_config-production.yml”.
The method used to trigger the builds automatically has changed over time because we’ve changed the repository structure, GitHub has changed the service offerings and we’ve started doing more to tightly integrate Bamboo with GitHub so I’m not going to go into the details on that just yet.
To go along with the series of posts on building websites with Jekyll, I thought I’d also collect together all of the websites built using Linaro’s tools and processes, and the GitHub repositories used to build them.
Back in 2014, the company I work for – Linaro – was using WordPress to host its websites. WordPress is a very powerful piece of software and very flexible but it did present some challenges to us:
Both WordPress and MySQL needed regular patching to minimise vulnerabilities.
It could be quite a resource hog if you were trying to get an optimal end-user experience from it.
It was difficult to make a WordPress site run across multiple servers (to avoid having single points of failure, resulting in an inaccessible site).
Towards the end of 2014, I attended the AWS re:Invent conference and happened to attend a session that would ultimately change how Linaro delivers its websites:
The basis of the idea presented in this session is to use a static site generator which takes your content, turns it into HTML pages and stores it in a S3 bucket from where it can be hosted/accessed by your customers.
By doing so, it eliminates the “retrieve the data from a database and convert it to a web page on the fly” process and thereby eliminates a database platform (e.g. MySQL) and the conversion software (e.g. WordPress). The up-front conversion is a one-off time hit, compared to the per-page time hit that a system like WordPress endures.
It is worth emphasising that although the session was at an Amazon conference, the underlying premise and the tools being discussed can be used on any cloud provider.
Earlier, I said that this session would ultimately change how Linaro delivers its websites because it took a bit of persuading … In fact, the following year, I shared this article with the staff who managed the content of the websites:
The challenge was that everyone was used to using WordPress and switching to a static site generator was going to be quite an upheaval in terms of workflow, content creation and management.
We got there, though.
We ended up choosing Jekyll as our static site generator. One of the reasons is because it is the technology used to drive GitHub Pages and, as such, gets a lot of use. For the rest of the infrastructure, we did use S3 and CloudFront to provide the hosting infrastructure and, as expected, this turned out to be a lot cheaper and a lot faster than using WordPress.
To migrate the websites to Jekyll, the Marketing team started by building out a Jekyll theme to manage the look and feel of the sites. Initially, this was kept in a private git repository on one of Linaro’s private git servers. The content was always managed as public git repositories on GitHub.
That split of repositories actually caused a couple of headaches for us:
Building the site required both repositories to be retrieved from the git servers and the content merged.
If we wanted to automate the building of the website, we’d need tools that could work with our private git server.
This is a collection of articles about how Linaro uses Jekyll and other tools to build its websites. This particular post will be the main index page and will link out to the other posts.
It should be noted that I will be focusing on the tools and technology, rather than tips on Jekyll itself (like how to build a theme). There are better qualified people than myself to write about such topics 😊