In migrating the first Linaro site from WordPress to Jekyll, it quickly became apparent that part of the process of building the site needed to be a “check for broken links” phase. The intention was that the build plan would stop if any broken links were detected so that a “faulty” website would not be published.
Link-checking a website that is currently being built potentially brings problems, in that if you reference a new page, it won’t yet have been published and therefore if you rely on checking http(s) URLs alone, you won’t find the new page and an erroneous broken link is reported.
You want to be able to scan the pages that have been built by Jekyll, on the understanding that a relative link (e.g.
/welcome/index.html instead of
https://mysite.com/welcome/index.html) can be checked by looking for a file called
index.html within a directory called
welcome, and that anything that is an absolute link (e.g. it does start with http or https) is checked against an external site.
I cannot remember which tool we started using to try to solve this. I do remember that it had command-line flags for “internal” and “external” link checking but testing showed that it didn’t do what we wanted it to do.
So an in-house solution was created. It was probably (at the time), the most complex bit of Python code I’d written and involved learning about things like how to run multiple threads in parallel so that the external link checking doesn’t take too long. Some of our websites have a lot of external links!
Over time, the tool has gained various additional options to control the checking behaviour, like producing warnings instead of errors for broken external links, which allows the 96Boards team to submit changes/new pages to their website without having to spend time fixing broken external links first.
The tool is run as part of the Bamboo plan for all of the sites we build and it ensures that the link quality is as high as possible.
Triggering a test build on Bamboo now ensures that a GitHub Pull Request is checked for broken links before the changes are merged into the branch. We’ve also published the script as a standalone Docker container to make it easier for site contributors to run the same tool on their computer without needing to worry about which Python libraries are needed.
The script itself can be found in the git repo for the Docker container, so you can see for yourself how it works and contribute to its development if you want to.
Index page: Tips, tricks and notes on building Jekyll-based websites