Using a Kubernetes/Amazon EKS solution to reduce testing time from hours to minutes.
Background
Code Enigma’s client the Welsh Government perform Behat group testing on their Drupal site each time a feature branch is deployed. At the time, as the numerous tests ran in sequence, they took many hours to complete. We were challenged to reduce this testing time and deploy the feature branch site more efficiently.
Goals
- Investigate an appropriate technology that provides a platform to offload the Behat tests to, where they can run in parallel on replica sites, completing their functions quicker.
- Not introduce any provider lock-in with the solution. Also, have the potential for other use cases beyond the specific client brief.
Uncertainties
There were three critical uncertainties Code Enigma had to overcome:
1. Technology used
Kubernetes was chosen as an emerging technology giving us an opportunity to explore and determine its suitability, not just for this scenario but in the wider context of deploying Drupal applications.
This was chosen over Amazon ECS. Kubernetes as a different container orchestration system is more “portable”, being platform-agnostic, moving us outside of the AWS ecosystem and potentially could provide strategic advantages.
While most of our clients have infrastructure with AWS, we needed to ensure that the solution was not tied too tightly to the environment where the site was hosted. In the context of this client’s infrastructure, however, we chose to base the solution on another Amazon service, Amazon EKS as it runs upstream Kubernetes and is certified Kubernetes conformant.
Amazon EKS is just a Kubernetes control plane, which means the configuration is portable so that workloads can be moved to on-premises, hybrid, or public cloud infrastructure. It was identified that Amazon EKS out of the box was not going to be able to immediately fit the criteria. However, there were sufficient building blocks for us to extend its capabilities.
2. Testing limitations of the existing deployment process
Our current way of deploying to existing Amazon EC2 instances presented the limitation when running linear testing. The configuration (e.g. the number of tests that can be run) was restricted by the size of the server being deployed to. Therefore, we needed to design a testing solution that could go beyond the scope of the size of the server. Kubernetes/Amazon EKS removed this limitation by introducing parallelisation. This enabled the concurrent running of tests by separating them out into independent Docker containers.
The process was for replica Drupal sites to be deployed into Docker containers within the Amazon EKS containers and then concurrent tests could be run on the replica sites. This, in turn, introduces more flexibility as it gives configuration options for splitting them up, meaning that it can be optimised to run as efficiently as possible.
We needed to develop custom bash scripts to initiate the deployment, ensuring everything is set up and configured to get all the web data, solr data, and databases into place. In implementing this solution for the client, the final uncertainty to overcome was in knowing how many containers to split the tests into.
This required significant testing to determine the optimum values. A systematic approach was applied to test this, incrementally increasing the number of tests to a container cluster and recording the results to identify the optimal number for the client.
3. Amazon EKS API bug
During the development of this solution, we encountered a known API bug within Amazon EKS that needed to develop a proprietary solution for the client that would fit with their specific workflow.
The issue was when running an API command to remove an Amazon EKS node group. The problem in the API is when the command is run, and a node group doesn’t delete properly, leaving the whole set up in AWS and leaving a site in place when it shouldn’t be. This introduces information security implications with unmonitored assets and cost implications for the client in continuing to incur costs for running unnecessary node groups. To circumvent the issue, we developed a custom bash script, initiated using Jenkins to list the node groups on the system. This is currently manually reviewed, and then those that are no longer required can be deleted.
Outcomes
This project represents a significant development for Code Enigma. We introduced another service to our offering, with a proven case study for its effectiveness. There is scope for automating further aspects of this solution (e.g. the Amazon EKS API bug solution and the periodic update of the web container Docker image), but this was outside of the client’s initial brief and budget. However, we have still delivered significant value to the Welsh Government’s team, reducing the time to test and subsequently deploy a feature branch down from hours to minutes.