FleetOps Strategies for Platform Upgrades at Scale

When operating a single cluster, or small number of clusters, it’s possible to keep things fairly loosely defined: components can be created, updated and configured as required. However, when running at scale this becomes unsustainable; changes become laborious and configuration differences can lead to compatibility issues and other problems.

In order to manage a large fleet of Kubernetes clusters, some framework is required: A set of processes and tooling used to create, configure, and operate the clusters to achieve your goals.

This can be thought of as your own internal Kubernetes ‘platform’ which enables application teams to easily and efficiently run their software while also allowing the platform team to roll out changes.

The following recommendations are based on my experience working on the development and operation of a large Kubernetes platform built for a major bank. This project took place over several years, during which our team built the platform up to have around 200 teams using it in production.

The recommendations are separated into principles and patterns: The principles are the key high level ideas we identified which helped our platform to be successful. The patterns describe how we operate different aspects of the platform to follow these principles.

Each pattern described covers a different recommendation in a generic way, with an aim to make the advice valuable and implementable for any platform. They avoid describing any specific tooling or implementation details; however to give more context and background to each pattern, there are some details about how our team implemented the recommendation given in italic.

Principles

These are the key principles which we identified for effectively managing platform upgrades.

Processes and automation

Having well defined processes for operating the platform, particularly for upgrades, is vital for ensuring reliability and accommodating growth.

As a platform is complex and involves many components working together there will likely be many ways that changes could be made and rolled out to users. However, unless a consistent approach is established it will be difficult to coordinate and reason about these changes.

A standard process allows you to establish confidence in how you make changes, and iteratively improve on it; it ensures that everyone on the platform team knows what needs to happen, that correct steps are taken each time, and that users know what to expect. All of these are fundamental parts of ensuring reliability.

The benefits of these standard processes are even stronger as the platform scales up. As the platform engineering team grows, aligning on standard processes allows handing off between different team members, for example between team members in different locations for a ‘follow the sun’ operating model. The standard processes allow new engineers to make a positive impact much faster and reduce the need to develop a deeper understanding before they can contribute.

While the most important process to establish is the one for platform upgrades, this should be extended to include not just the upgrade itself, but also the supporting processes such as testing and security scans. A great way to build up these processes is to start with the upgrade, and gradually expand the scope to cover more of its dependencies to ensure these are managed consistently too.

While these processes can take many forms, such as runbooks and operational guides, the best way to implement them is via automation. Runbooks are a great way to specify how a process should work and share this with the platform team, and they’re relatively easy to iteratively change and update. For some infrequently used processes, or processes that involve manual reviews or approvals by people, automation may not be appropriate, but typically it’s best where possible to progress standard processes into full automation.

This automation is not just about making the processes faster and easier, but also ensures a higher level of consistency and reduces opportunities for mistakes. For example, someone following a manual process could still forget to upgrade one cluster from the fleet, or accidentally use a different version of a component in a cluster.

Consistency and configuration

To keep the complexity of the platform under control it’s important to maintain as much consistency as possible and carefully manage all configuration.

Consistency is what makes the platform a platform, rather than just many separate clusters or environments. By keeping the number of different ways each component can be configured to a minimum it makes it easier to reason about and test the expected behaviour, as well as support users with issues. This can be achieved by using common resource definitions which act as the single source of truth for how things should be configured. For example all clusters in the platform using the same Terraform and Helm charts. These should be as ‘functional’ as possible, avoiding things like hardcoded projects and minimising conditional blocks.

Any differences which are required should be specified as configuration, again with a single source of truth. For example, specifying which regions and projects to deploy the platform in as a configuration file which can then be passed into the standard Terraform and Helm charts as parameters.

Communication and engagement

A large Kubernetes platform likely means a large number of engineers and other stakeholders are involved, which makes it vital to clearly communicate changes to the platform. This is especially true for upgrades, which have a wide impact and could require actions from other teams.

As well as the platform team communicating with application teams and other platform stakeholders, it’s important for them to be able to communicate back to the platform team.

This is not just about providing support for the platform, but also allows platform users to raise issues they have identified with new versions. These issues must be addressed and resolved, if necessary halting the upgrade process to prevent them from impacting production.

Design patterns

This is a selection of design patterns which aim to follow the principles stated above.

These are intended to be high level approaches to problems, so don’t generally include any specific implementation or tooling advice. Instead there is an example of how this pattern was implemented for the platform I was working on, to give some context of how it looks in practice.

Release changes as versions

To improve reliability and make changes easier to manage it helps to treat the entire platform like an application: Rather than making ad-hoc changes directly to clusters, any changes, such as component upgrades or configuration edits, should be made to the shared components of the platform itself. Changes can then be grouped together, properly tested, and released as a new version of the platform which is rolled out in a coordinated way.

Much like with application development, new platform versions should be released frequently with incremental and backwards compatible changes. Wherever possible breaking changes should be avoided, even if this requires more engineering effort to support, or requires changes to be rolled out gradually over several versions.

It’s best practice to release new versions as frequently as possible, to minimise the amount of change between each. However there are some limitations on this: Continuously releasing changes can expose users to more instability and make it harder to track the progression of features across different environments. Additionally in an enterprise setting there may be change management procedures in place which must be followed, limiting the rate changes can be rolled out.

This means that the version release and upgrade process is used for all changes to any part of the platform. Having a single, well-tested, and understood mechanism for making changes really simplifies the operation of the platform and streamlines the process.

Having this standardised process for making and rolling out changes not only makes it easier, but is also crucial to allow you to respond quickly to external issues. For example, fixing newly identified CVEs or scaling infrastructure in response to demand.

For the platform I’ve been working on we manage many separate Google Kubernetes Engine (GKE) clusters, each running a standard set of add-ons, for many different user teams. We create and manage these clusters using Terraform to deploy the cluster itself and supporting cloud resources, then using Flux to install various Helm charts and Kustomizations for the add-ons. User and environment specific configuration is passed in as parameters, allowing the rest of the configuration to be common to all clusters, keeping them as consistent as possible to ease the operational workload.

In order to then manage this common configuration as we make changes to it and allow it to be effectively rolled out and applied to user clusters, we group changes made over a period of a few weeks into versions. These versions can then be tested, announced to users, and rolled out across different environments.

The user development environment is your production environment

It’s conventional for organisations to have separate development and production environments, however when managing a platform the development environment will also be used by application teams for their testing: This effectively makes it a production environment for the platform team.

While issues here are not as critical as the actual production environment, it is the first point where the application teams see new features, fixes, and updates. Issues here can impede and slow down their work as well as damage their confidence in the platform.

Therefore it’s recommended that additional environments are added to work on and test the platform before it reaches the organisation's development environment. These could be ‘sub-environments’ of the development environment, but some separation is required even if it’s only recognised by the platform development team.

We created several sub-environments for platform development and testing. These environments were effectively just collections of projects in the organisation’s development environment, but we structured our platform configuration to consider them as separate and used a naming convention to make this clear to members of the platform engineering team.

Progress versions through environments

Once a new platform version is released it should be thoroughly tested and progress through different staging environments before being rolled out in production.

In this operating model, when a new platform version is released the upgrade process is first end-to-end tested in the platform testing environment. Once this is complete and the platform engineers feel confident then the ‘real’ development environment can be upgraded.

Next the user teams can verify that their workloads still run correctly and make any changes or optimisations in the development environment. There should be some delay between upgrading development and production to give these teams time to complete any work required, and to soak test changes to the platform.

Finally, if there are no issues with the new platform version, it can be rolled out to the production environment.

Throughout this process it’s important for the platform team to communicate upcoming changes and have a mechanism for application teams to give feedback and get support. These areas are covered in the following sections.

Our platform development environment includes a project for each member of the engineering team for them to use while implementing changes, for example adding new features or component updates. The management project for this environment also runs a lot of the development processes, such as build and test pipelines.

The platform testing environments also include several projects and are designed to mirror the main organisation development environment with ‘dummy’ application workloads. These are used for end-to-end testing upgrades and assessing platform performance and stability.

Keep all platform components in a single repository

To make managing the version progression of the platform easier it can be beneficial to keep all components of it in a single Git repository. This would include all Terraform, scripts, code, YAML manifests, Helm charts, etc.

The repository should include not just the assets needed by the platform itself but also all parts used to build, test, and operate it. This means that all parts of the platform grow together and encourages using the platform to test itself. As new features are added so are new end to end tests and tooling to support them.

Using a single platform repository also makes it easier when developing to make coordinated changes. For example, if upgrading the version of a component also requires you to change a Kubernetes NetworkPolicy resource, all of this can be done on a new branch. The branch can be treated as a version and applied to a development environment for testing. The change can then be reviewed as a single pull request and easily merged.

Keeping everything together promotes the idea of the platform as a product which is developed and versioned as a single entity.

Note that this could be considered a monorepo for the platform team, however that term implies a much larger single repository which could be shared by the entire organisation. It doesn’t generally make sense to keep assets for all teams using your platform in the same repository, unless they are very tightly bound to the platform.

While using a single repository as much as possible is very effective it may be necessary to store configuration to instantiate the platform itself in a separate repository. This could include configuration such as which environments the platform runs in, which cloud provider projects are onboarded, or which users have access to it. The reason for separating these is that they may require a different permission model: Changes to the platform itself would affect all users and therefore need to be subject to a higher standard of scrutiny, requiring automated checks and peer review, whereas a configuration change like onboarding a new project doesn’t necessarily need the same restriction and could even be made ‘self-service’ for project owners.

We save all assets relating to the platform in a single Git repository. We then use Lighthouse to watch this repository and trigger builds or tests to run in Tekton: As everything is stored in the same repository it’s easy to get all required configuration for every commit.

The only configuration which is stored in separate repositories is the lists of Google Cloud projects in which the platform should deploy clusters. This is because we allow a service account to modify this list as part of our automation of onboarding and offboarding users. We did not want to allow this service account to modify the platform repository itself in case it made unexpected changes.

Pin external dependencies

For platform artifacts that can’t be stored in the Git repository, such as container images or binaries, create configuration files in the repository which specify the exact version which is used. This ensures all the parts of the platform are explicitly defined and allows changes to be linked to Git commits and tags.

This is even more effective if all external artifacts and their versions are defined in a single central configuration file, which is then used to template other configurations such as Helm charts and Terraform. This makes the list of components used for each platform version very clear, as well as making it easier to update: If they are set throughout the codebase, it can be difficult to identify all of them, leading to situations where different versions are used in different places, or dependencies are unknown.

For the platform I worked on we would specify all container images used in a JSON file, with specific tags. The values in this file were then passed to Helm charts for templating, as well into Tekton Pipelines. We also used Go which tracks its dependencies in a go.mod and go.sum file, and Python using a requirements.txt file.

Manage unpinnable external dependencies

With some external dependencies it may not be possible or practical to pin them. In this case they should ideally still progress though the sequence of environments that a normal release would.

For example, a hardware upgrade may not be something that can be fixed in configuration but should still be coordinated to affect your platform development environment first, then the user development, then production.

These changes should also align with the progression of a single released version, to avoid their being multiple parallel changes occurring at the same time. This may require adjusting your release schedule or enacting a change freeze.

For the platform I have been working on we were required to use GKE Release Channels: This is required by the organisation’s security policy, but also ensures the version of underlying Kubernetes clusters are continuously upgraded without increasing the operational burden of the platform team.

To ensure that our reduced control over the specific Kubernetes version didn’t cause disruption to users we configured environments so that the platform development environment would use the ‘rapid’ channel, user development used the ‘regular’ channel, and production used the ‘stable’ channel. This ensured the platform team would be working with new Kubernetes versions first, giving them time to identify and resolve compatibility issues before they impacted users.

Copy external artifacts internally

As well as pinning the version of external dependencies which the platform uses, it can improve security and reliability to copy any artifacts to your own private internal storage. This can include copying container images to your own registry, or binaries to your own storage buckets or artifact manager. It also includes copying artifacts like Helm charts, Kubernetes manifests, and Terraform modules.

Copying artifacts means you have control over the instance of that artifact used by your platform. This improves security by preventing tags or URLs from being hijacked to point to artifacts you don’t expect. It also allows you to scrutinise and scan things as they are brought into your environment.

NOTE: This is not a replacement for a full defence in depth approach to security but does add an additional checkpoint to better control what artifacts your platform uses.

Having better control over artifacts can also offer reliability improvements for the platform. You can also ensure your artifact storage service meets your availability requirements: As the platform likely depends on being able to pull these artifacts, its availability is limited to the availability of your artifact storage. Additionally these can be setup closer to where they are used, cutting down the time it takes to pull them compared to an external registry.

Any resources copied internally should be treated as immutable, just mirroring the upstream version. Any changes made to these artifacts, such as patching images or adding new fields to Helm charts, should be treated as creating a new artifact. It can be tempting, particularly with text based artifacts, to just directly edit them as required, however it then becomes tricky to manage these changes when pulling new versions of the artifact from upstream, and they could be lost. Instead the changes should be stored as separate patches, in a format that makes sense for the given artifact type.

The organisation we built the platform for already required that external artifacts were copied to internal storage and had processes for scanning and assessing these as they were ingested. They hosted several on premises Nexus servers to host a range of different asset types, including container images, Go and Python packages, etc.

For artifacts required at runtime, like container images, we also copy these to Google Cloud Artifact Registry, configured in the same region as our clusters, for faster and more reliable pulling.

If we need to internally patch externally sourced container images, we would create new Dockerfiles, using the upstream image version as the base and then applying any required changes to create a new image. This Dockerfile is saved in our monorepo and built using our Tekton Pipelines, creating a new image tag which can be added to our image versions list for use in Helm charts and other manifests. The base image used can then be updated while reusing the same patch, and if the patch is no longer required it can be removed and the upstream image version can be used directly.

For text artifacts, like Helm charts and Kubernetes Manifests, we would copy these from their upstream repositories into our monorepo. We then use Flux to manage this via layered Kustomizations, which allow patching manifests for things the upstream charts do not support. In some cases changes have been made to the upstream charts, however this is hard to manage when updating so requires documentation.

Maintain an upgrade runbook

In order to standardise the upgrade process a full runbook should be created which documents all the steps required. This should act as a guide that any member of the platform team can follow. Each step should have a comprehensive description and include details such as commands that need to be run, or dashboards that must be checked.

As described in the following section, if possible it’s best to automate all upgrade processes. However, even automated processes will likely need some manual initialization. As new steps are added to the process in response to changing requirements it may take longer than the next upgrade cycle to automate them, or not be possible. Therefore the upgrade runbook should be the main ‘entry point’ for starting an upgrade.

Documenting all the steps required for an upgrade can also help the process of automating it: Explicitly describing all the tasks and their dependencies can start to show what an automated process would look like and exactly what it needs to do.

We initially had a small team and a relatively simple platform, so we all knew the steps required to run an upgrade. However as the process became more complex and the team grew it was no longer sustainable to just rely on our implicit knowledge. It was easy to forget steps and challenging to onboard new team members to the process.

We started our upgrade automation before creating any kind of upgrade runbook, as a necessity to handle the large number of clusters our platform managed. But as there were still many manual steps, and many discrete automated processes, we realised we needed to create a full runbook to explain it all.

We now alternate which team member handles the upgrade process, so that everyone in the team is familiar with it. This also forces us to ensure that steps are properly documented so that any team member can consistently complete the process in an autonomous way. It also means that the process doesn’t depend on a small number of more senior engineers that are familiar with it, reducing the operational burden of the platform.

Test upgrades and rollback for each version

While a released version of the platform may work fine by itself, if there are issues upgrading to that version from other versions, or rolling back from that version, this can cause serious operational problems with user environments.

While iteratively developing a new feature or other change in a test environment it’s easy to miss details such as ordering requirements or manual steps. Tests may pass when run against your test environment because it’s already been configured correctly. However, when trying to apply this change from a previous version these missing details can cause errors.

Therefore it’s important to test that individual changes and new versions undergo upgrade testing to validate the transition from previous versions. The previous versions that should be tested depend on your version support model: For example if you support the previous three versions and allow users to upgrade from those versions to this version then all three of these transitions should be tested.

Support for rollbacks is also important to allow downgrading from a newly released version back to a previous version. If a major issue is identified with a platform version after it has been rolled out to users a rollback can revert the platform to a known stable state. Without this option user environments may be left in a degraded or broken state while the platform team determines how to ‘fix-forward’ and release a new patch version to resolve the issue.

It might not always be possible for a version to support rollback, for example if it allows users to create a new type of resource which wasn’t previously available. Supporting rollback can involve additional engineering work to ensure that all platform changes are applied in a reversible manner. This may require rolling out major changes over multiple versions to reduce the size and complexity of each step, however this adds operational complexity which the platform team has to coordinate. For example, it may require pull requests to be broken up into several smaller pull requests and merged in the correct order as new versions are released. This also means these new features and fixes take longer to get to users.

Testing rollback from new versions back to user supported versions establishes if the rollback is possible. If the tests show that a version cannot be rolled back, then the platform team will then need to make a decision about whether this is acceptable or if the version needs to be modified to support it. This in turn can inform the operational plans around how to handle upgrades of user environments: If there is no rollback option then there should be an alternative plan in place, such as rapidly rolling out a new fix release.

If it’s not possible or practical for a version to support rollback then the number of changes made in that version should be kept as small as possible. It’s always better to release smaller, more incremental changes, but this applies even more when those changes can’t be reverted.

Our platform would test both upgrading and rollback for a new version as part of our end-to-end tests, described in the next section.

We would try to always ensure that rollback was supported, so that we could go back to a stable version if problems were identified after upgrading the user environments. While this was rare, as we tested new versions extensively in our development environments, users would sometimes have edge-case configurations or workloads which we had not considered: If these caused problems then we would rollback their clusters to the previous version. If the issue seemed to affect a significant portion of users, then we would rollback the entire environment.

Maintaining rollback support did sometimes involve a lot of extra work, for example requiring creation of additional tooling to facilitate removal of new resources. This would often require making changes over several versions, initially adding recognition and cleanup of new resources, then in a subsequent version adding creation and use of them: almost like adding the feature in reverse.

When a version couldn’t support rollback, we would keep the amount of changes included to a minimum, adjusting our release cadence if required to accommodate this. We would also ensure there was some operational runbook about how to handle problems with the release, and the team would be ready to rapidly create a patch release to fix these problems if required.

Set up end-to-end testing

Testing is the most important way to establish the reliability and performance of new versions of the platform. As a platform is a complex collection of dependent components, full end-to-end testing is required to make sure everything is working together as expected.

End-to-end tests can highlight compatibility issues between different components, or other edge-case configuration clashes that wouldn’t be identified by individually testing components.

For a platform the end-to-end tests should enable or configure platform features in the same way a user would, and then verify that the expected outcome occurs in a reasonable amount of time. This ensures the platform behaves as expected from a user perspective. The tests should cover all features offered by the platform, and as many different configuration combinations as possible.

The tests should also check that upgrades from the version that users are currently on to the test version work correctly, and that it’s possible to roll back.

Running these kinds of tests can take a long time: It can take a while to enable or configure a new feature as it may require deploying new workloads or creating new cloud provider resources. Adding up all this waiting time can push the duration of end-to-end tests to several hours.

While this can make it impractical to run end-to-end tests with the same frequency as other testing, it’s still important to run these tests regularly to identify issues with newly added changes. A good balance is to run the end-to-end tests each night against the ‘main’ branch of the platform, checking the results and addressing any issues raised the next day.

Additionally the tests should be run against released versions to ensure the specific set of changes in that release are fully tested.

Our platform end-to-end tests would run against a specified test cluster; upgrading it to the version or branch being tested, deleting and recreating it, then enabling and disabling all of the features we offered. Each step included checks and timeouts to ensure that everything was happening as expected.

These tests were quite challenging to maintain, particularly when first set up, as there are many failure modes and it takes a while to calibrate sensible timeout durations. Over time we got better at designing tests, and importantly improved the reliability of the platform and its internal failure handling.

The tests are run in Tekton Pipelines, using a Go testing framework. These are triggered each night using a Kubernetes CronJob, as well as for each new version as part of the release process.

Run unit and integration testing frequently

While end-to-end testing is important for validating new platform versions before upgrading, it is supported by unit and integration testing. As the end-to-end tests can be slow to run and complicated to set up it’s helpful to ‘shift left’ as much of the testing as possible.

Unit tests are performed on individual software components at a function level. Integration tests fall between unit tests and end-to-end tests, running at a component level.

These tests should be much quicker than end-to-end tests, allowing them to be run more frequently and earlier on in development to give quicker feedback to the platform engineering team. Using these tests to catch simple errors and failure modes will reduce the number of slower end-to-end test runs that fail, which then increases the speed and ease with which new versions of the platform can be released and rolled out.

For normal software development these kinds of tests are quite common. However, as platform components may be developed in a less standard way it can be easy to neglect them. For example, functionality can evolve from simple Bash scripts into small Python programs and then Go binaries with multiple packages: In this progression there might not be a clear point where developers felt that unit tests were required, but the result can be a large amount of untested code.

The platform code is also likely to contain a lot of ‘glue’ too; code which connects multiple APIs and heavily depends on them. For example programs which check one API for something and then inform another API about it. This code can be hard to unit test without fully mocking APIs, which can be impractical if the API is quite complex and the code to test is fairly simple.

Wherever possible any platform specific logic should be in separate functions which don’t depend on any API, making them much easier to test independently. This should then make the functions which make API calls less complex, reducing the risk of errors. Testing of these API bound functions then effectively falls through to integration and end-to-end tests, which should still catch API specific errors.

As the platform will likely contain more than just code it can help to identify ways to unit test other assets, such as configuration and manifests. For example Kubernetes manifests, Helm charts, and Terraform definitions can all be syntactically validated with a dry-run ‘kubectl apply’, a JSON schema for the values file in a Helm chart, and by running ‘terraform fmt’ respectively. Custom configuration files should also have schemas defined to ensure these are validated.

As with other procedures these tests should be automated. As they are relatively quick to run it’s recommended that they are used to gate which changes are merged into the platform repository. This should keep the quality of the main branch higher and reduce the amount of changes which have to later be reverted or patched.

We have a range of unit tests defined for our Go and Python code, we also use Shellcheck for Bash scripts, as well as validation for Helm charts, Terraform, etc. and custom JSON schemas for platform specific configuration files. All of these checks are run as GitHub checks on every pull request, triggered by Lighthouse and run in Tekton Pipelines.

We also have integration tests which run after applying a pull request’s branch to a development cluster, checking the state of various components to make sure they are working correctly. These are slightly slower than the unit tests but can be manually triggered from the pull requests using a comment command which is picked up by Lighthouse.

Manage and communicate platform CVEs

As part of the operational security of a platform all components should go through some form of scanning to identify CVEs and other issues. This includes container images, binaries, dependent packages, and your own code, which should all be analysed to assess the risk they could present.

While it’s good practice to do this continually, in order to stay on top of CVEs and other findings, it’s also important to be able to know which findings affect each version of your platform, as this is what will actually impact the users.

Results of these scans for each version should be shared with the platform users, so they are aware of potential issues and can see when they have been resolved. One way to do this is to include them with release notes.

We run daily scans of container images, our code, and dependent packages against our repository’s ‘main’ branch, so that CVE information is relevant to the state of the platform as we are currently working on it. These scans are run in Tekton Pipelines, triggered by Kubernetes CronJobs. We also run code scan pipelines for each pull request branch, triggered by Lighthouse and configured as a pre-merge check to require any code issues to be resolved before they can impact the rest of the code base.

However as the ‘main’ branch is ahead of the versions running in our user facing environment, we also have to run the same scans for these versions. These are also run in Tekton pipelines, triggered as part of the upgrade runbook.

The organisation requires results of these scans to be recorded and accessible for audit purposes, so we save the logs and results from the scans to a Google Cloud Storage bucket, then have a Python script in another Tekton pipeline upload these to our release notes in Confluence.

Use standard channels for announcements

To facilitate effective communication to users there should be one or more standard methods of communication, such as emails or Slack channels. Any announcements are then made using these channels. This ensures that users know where to expect and check for announcements.

Whenever new teams onboard to the platform they should be added to these communication channels and advised how they can monitor them and get notifications.

The communication channels should also not be overused. Excessive or unnecessary announcements will cause fatigue and mean that important information is missed by users.

For the platform I’m working on we make announcements in a dedicated Microsoft Teams channel, as this is where our users currently communicate. The channel only allows members of the platform team to post, to avoid user discussion obscuring important announcements, but users can comment on posts, and we will share follow up comments with more details about upgrades or issues.

We also send more infrequent but formal announcements as emails to the managers of all teams using the platform.

Announce upgrades with sufficient notice

Every upgrade must be announced on all standard channels with enough notice for teams to respond appropriately.

This notice time can be aligned with the version progression. For example, enforcing a 7 day waiting period between upgrading development environments and production, to give application teams some time to make and test any changes to their deployments which are required by the new platform version.

As soon as the new version has been released and tested, we announce this with a link to the release notes and advise platform users that we’re going to be upgrading the development environment. We also give an estimated date for the production upgrade.

Another announcement is made when the production upgrade date has been confirmed and a final announcement will be made just before the production upgrade begins.

We’ll add follow up comments to these announcements about how upgrades are progressing, including any delays or issues, when they have completed, or if they have to be cancelled or reverted.

The Teams announcements are currently done manually but are part of our upgrades runbook. We’re investigating automating them but this has been delayed due to restrictions on API access to Teams within the organisation.

Provide detailed and understandable release notes

As well as announcing an upgrade, release notes which are detailed, but also understandable to users, should also be provided.

These should include information about all changes which have been made. Extra warning and details should be given for any potentially breaking changes, and instructions should be provided on how teams can determine if they are affected and how they can remediate issues.

We generate our release notes using a Go script based on the one previously used by Kubernetes, which aggregates the title of all pull requests merged since the previous release and includes a link to each. This script is run in a Tekton Pipeline along with several other initial release tasks such as creating a tag and release page on GitHub.

We require all pull request titles to include a prefix to indicate the type of change it makes, for example ‘feat’ for new features or ‘fix’ for fixes. Pull request titles are then grouped by type in the release notes which makes it easier to see what kind of changes the new version contains, in particular if it is just fixes or also includes new features.

Where relevant we also include Jira ticket IDs in pull request titles and automate linking to these tickets when generating the release notes. These Jira ticket IDs can be from tickets in our engineering project or support project, which gives users visibility over changes made to implement features they have been waiting for or support issues they have raised.

The generated release notes are initially put into the GitHub release description. A Python script, also running in the Tekton release pipeline, then copies these notes to a Confluence page under a new heading. This allows users to watch the Confluence page and receive notifications whenever it is updated.

Prefer support tickets for user requests

To facilitate user requests, whether these are for support, features, or to report bugs, some form of ticketing system is recommended.

While direct communication over email or Slack is often preferred, this can become hard to manage as the platform grows. It’s difficult to balance support requests, so individual platform engineers may be overwhelmed by large numbers of messages, and handing over communication between team members is more difficult.

More importantly, if issues are only discussed in direct communication, then they may not be visible to the wider team. This could mean larger trends or patterns of issues which affect multiple platform users may not be identified correctly. A ticketing system allows for high level analysis of the rate and type of support requests that the platform team receives, which allows the team to more effectively prioritise their work.

We use Jira tickets for user support as this was already the standard for support requests in the organisation. We have a separate Jira project for support tickets and engineering tickets and have different forms for users to raise onboarding or support requests. This helps manage the different work streams and easily see the operational and engineering workloads.

Do not separate support and engineering teams

While a ticketing system is recommended, splitting the engineering and support work is not. By keeping all of the team involved with both development and operations work they are all kept more informed about how the platform works, better able to give high quality support to platform users, and incentivised to design the platform in a way that minimises the support burden.

In order to keep the whole team aware of both operational and engineering issues we review both engineering and support tickets as part of stand-up calls. This also ensures operations tickets are being properly addressed and our team are able to respond to them, as they may contain more unpredictable questions and less complete information.

Facilitate user communities

In addition to a ticketing system, it’s recommended that some form of platform community is set up, for example a Slack channel or private forum. This allows platform users to communicate directly with each other and share solutions and advice.

This is not just about reducing the amount of support work the platform team needs to do. In many cases the issues users are facing are not directly with the platform but how best to integrate with it and run their workloads on it: For these kinds of questions other users often have the best advice as they share a similar perspective.

We have a Microsoft Teams group as a user forum for discussion, as well as an internally hosted ‘Stack Exchange’ style question and answer service. The platform team will engage with both of these and provide support, as well as trying to connect users to each other to collaboratively solve problems.

Many users asked questions about connecting to databases from their clusters, and while the platform team did investigate this none of us had much experience with it or regularly used database proxies or other tooling like this. Other users were able to offer more effective advice, which was then added to the platform’s user documentation to help future users.

Make changes based on feedback

Finally, it’s important to take feedback from support tickets, community discussion, and other communication channels and use this to update documentation. Frequently asked questions should be addressed in the documentation, and any processes which regularly cause confusion should have clear guides written. This also improves the platform user experience and reduces the support burden on the platform team.

It can be beneficial to include reviewing user support tickets and community questions as part of daily stand-ups, or in other regular sessions, to ensure that issues they raise are effectively addressed. These issues should be triaged and, where changes to processes or documentation are required, this should be integrated into the normal flow for planning work.

Within the team we would encourage team members to take time to follow up on support tickets, for example by updating user documentation. If more substantial work was required to address problems, then we would raise engineering tickets which would be prioritised and assigned as part of our normal workflow.

Cloud Native Consulting Services

When our experts are your experts, you can make the most of Kubernetes

Find an expert

Conclusion

These patterns are intended to improve all aspects of a platform upgrade and the processes that support it. While they may not be simple to implement, they will have extensive operational benefits and support the platform scaling up seamlessly.

Here’s a brief summary of each pattern which can be used as a checklist or prompt:

Release changes as versions - Group changes to platform components together and release them as a version.
The user development environment is your production environment - Have separate environments for where the platform is developed and where users develop on the platform.
Progress versions through environments - Move versions through testing environments, then user development environments, and finally production environments.
Keep all platform components in a single repository - Group all code, resource definitions, manifests, etc. in a single repository so it’s easy to make coordinates changes and tag versions.
Pin external dependencies - Use fixed versions of external packages, binaries, images, etc. to ensure their behaviour is predictable and tested.
Copy external artifacts internally - Make copies of packages, binaries, images, etc. to improve reliability and security.
Maintain an upgrade runbook - Have up-to-date and clear instructions on all the steps required to complete an upgrade in your documentation.
Test upgrades and rollback for each version - Don’t just test each version on its own; ensure that the process of upgrading from an old version to the new one, and a new one back to an old one, all works correctly.
Set up end-to-end testing - Run high level tests that make configuration changes in the same way a real user would, and verify these trigger the correct behaviour.
Run unit and integration testing frequently - Have a suite of fast unit and integration tests which can be run with high frequency, for example on every pull request.
Manage and communicate platform CVEs - Scan all components of the platform, track CVEs and resolve them in a reasonable time, and publish security information to users so they can assess risks for themselves.
Use standard channels for announcements - Always make announcements in the same places so users know where to find them.
Announce upgrades with sufficient notice - Let users know about upcoming changes with enough time for them to prepare and make any required changes.
Provide detailed and understandable release notes - Document all changes included in a release in a way that is helpful for users.
Prefer support tickets for user requests - Encourage users to raise important requests as tickets so these can be effectively tracked and shared by the platform team.
Do not separate support and engineering teams - Ensure that everyone working on the platform is familiar with both the support and engineering sides of the work to encourage holistic thinking.
Facilitate user communities - Set up spaces where platform users can communicate to help each other and share advice.
Make changes based on feedback - Ensure that user tickets and comments are reviewed to identify trends and areas of improvement which feeds into the platform team’s work.

Jetstack Consult has extensive experience building and scaling different platforms for customers looking to elevate their cloud-native and Kubernetes offerings. Get in touch to see how we can work together to overcome your challenges and build something great!

FleetOps: Strategies for Platform Upgrades at Scale