Practical Tips for Refactoring Release CI using GitHub Actions

Starting with a Release Pipeline

Since the very first day of GreptimeDB going open-source, it embraced the automated software building process with GitHub Actions, and leading to the inaugural Release Pipeline.

For an open-source project, having a stable and consistent Release Pipeline is paramount in the following ways:

Delivering Ready-to-use Software Artifacts: As the upstream producer in the software supply chain, offering safe, trustworthy, and ready-to-use software artifacts (e.g., binaries, images, etc.) to various downstream users becomes imperative.
Enhancing Developer Experience: Users can acquire a readily executable software artifact for their respective platforms without excessive configurations or the need to set up and compile from scratch.
Automated Testing Around Release Workflow: Integrating different types of regression tests (like performance, stability, integration tests, etc.) with the automated Release process boosts software quality.

Despite other alternatives like Circle CI, Travis CI, GitLab CI or even self-hosted options using open-source projects like Tekton or Argo Workflow, the reason for choosing GitHub Actions was straightforward: GitHub Actions, in conjunction with the GitHub ecosystem, offers a user-friendly experience and access to a rich software marketplace.

However, being user-friendly doesn't necessarily mean that it's easy to maintain. In contrast, GitHub Actions can deteriorate easily. GreptimeDB's initial open-source version was very concise, with only 183 lines in its release.yml. However, after numerous modifications made by multiple contributors, this concise YAML progressively incorporated:

Artifact builds across diverse platforms;
Activate builds for different feature toggles of software artifacts;
Integration tests conducted before actual builds;
Pushes to varied software artifact repositories (DockerHub, ACR, S3, etc.);
Control under diverse Release conditions (like manual triggering, error tolerances, etc.);
and so on.

Additionally, due to other reasons (debugging releases, daily builds, etc.), forks of similar pipelines with only minor distinctions emerged in different internal repositories, escalating the maintenance overhead.

With the complexity of the granular build requirements, the release.yml quickly swelled up and filled with redundant configurations, becoming hard to maintain. Without timely refactoring, the Release Pipeline risks imminent and total deterioration.

Reasons why Release Pipeline Deteriorate

Examining the release.yml, I would like to identify several reasons why it degraded so rapidly. Only with a comprehensive understanding of how this occurred can we develop appropriate refactoring plans.

Language level: Being based on YAML, the Domain Specific Language (DSL) of GitHub Actions lacks the expressiveness found in general-purpose languages. This limitation can lead to the writing of redundant and unmaintainable code.
Low debuggability: GitHub Actions are notoriously difficult to debug. This challenge is exacerbated by the project's use of Rust, a language known for its high compile cost, which further extends the debug cycle. While tools like act enable localized execution of GitHub Actions, actions still must be actually run, thus they can not truly shorten the write-run-debug cycle.
Lack of consideration of modular decoupling between actions: GitHub Actions use the Composite method to combine different actions. Due to lack of experience, we did not break down the logic into separate actions, instead, we just piled everything into a single YAML file, which of course became hard to maintain.
Ignoring Reproducible Build: As GitHub currently lacks ARM64 VM instances, to get better compile performance, we chose to build software artifacts of both AMD64 and ARM64 types (cross-compiling) on GitHub’s x86_64 VM instances. Although we could simulate ARM64 platform building using Docker Buildx to launch QEMU, its performance is much worse than Native platform. Since we rely on GitHub Runner's host environment without using Dockerfile, achieving a consistent Reproducible Build is challenging.

Schema

Mastering GitHub Actions can be tricky, getting GitHub Actions work flawlessly on the first go is impressive!

When we embarked on the refactoring process, we followed the principle that maintainability >> performance (build speed).

This approach is crucial as Release pipelines are bound to continually evolve to accommodate the growth of the project. Should maintainability fall by the wayside, decay sets in, it'll ultimately lead to a decrease in R&D efficiency. We'll be more motivated to focus on performance enhancement once maintainability is ensured. For compile/build scenarios, better build machines can typically boost performance if not consider various caching mechanisms.

Refactoring Plan

Refactoring a YAML file, unlike typical programming projects, is essentially a thorough review of various configuration processes. While it isn't particularly logical, it presents high accidental complexity. During the whole process, one might repeatedly fall into hidden traps and then face the arduous challenge of extricating oneself. Here we summarized several practical tips for those who are undergoing similar refactoring situations.

Standardize builds with Dockerfile: Though building based on Dockerfile may lead to performance loss, it improves maintainability, standardizes the build process across platforms, thus ensuring a Reproducible Build.
Unified Command Interface: Based on the previous point, try to refine various build commands into a single make command. This keeps complex compilation contexts out of the yaml. Instead of hiding too many details in the Release phase, try to expose them in the Makefile or scripts of development phase. By utilizing Makefile, users can have a consistent build experience with the Release phase, thus improving R&D efficiency.
Apply AWS EC2: As mentioned earlier, since GitHub Actions currently doesn't have ARM64 VM instances, we have to employ cross-compile. To standardize the build process of all platform with single Dockerfile, we utilized AWS EC2 ARM64 instances to construct software products of ARM64 platform.
Modular Decoupling: Split the release.yml to ensure it a collection of Jobs that is relatively straightforward and uncluttered. Each individual action.yml file located under the actions/ directory should be kept brief and to the point. By doing so, it becomes easier to customize various pipelines based on the same actions, thus enhancing the adaptability and efficiency of the overall process. It's important to note that, due to the absence of a Group Job mechanism within GitHub Actions, this approach represents the optimal solution.
Keep Jobs Simple: ach Job should concentrate on a single, specific task, thereby enhancing its idempotency. Should an error arise, this focused approach makes it easier to retry the Job. Additionally, it facilitates the more effective extraction of top-level control variables, allowing for more precise manual triggering control.
Avoid Overloading Shell Runs with Excessive Commands in Actions: Refrain from packing too many Shell commands into a single GitHub Actions Step. While this might appear to be a straightforward approach, it can be detrimental to maintainability. If you find yourself faced with numerous commands, consider converting them into external scripts and refining the input parameters. By doing so, you ensure that the scripts are independently executable and verifiable.
Introduce a Pre Job for Allocate Runners: Allocate Runners is the first executed job that allocates Runners and creates global Version markers for the following Job. For instance, if we choose to use EC2, Allocate Runners Job will allocate EC2 instances of the corresponding platform through the EC2 API (implemented by the ec2-github-runner Action). In the future, we plan to incorporate more sophisticated selection algorithms to allocate Runners, with the aim of optimizing the costs of Runner allocation.
Global Unified Pipeline: Avoid forking functionally similar GitHub Actions as it raises maintenance costs. To foster a more transparent open-source development process, we have consolidated all previously internally-used build pipelines into the main GreptimeDB repository. As long as the code is open-source, then both the software product and the build process should be as well.
Use Variables and Secrets properly in GitHub Repository: Previously, our CI treated most external parameters as Secrets, which wasn't appropriate. Some non-Secret external parameters should actually be configured as GitHub Variables for better future adjustments. Variables that might need frequent adjustments shouldn't be hardcoded in the YAML, instead, it should be extracted from YAML as Variables. By doing so, it reduces low-information configuration modification PRs.

Future

The refactoring of the release pipeline is merely a small step in GreptimeDB's journey toward maturity. In the future, we're building an even higher quality and powerful CI:

Expanding Platform Ecosystem: We are about to release software artifacts for the Windows platform and you're welcome to test and experience it upon its release.
Introducing more Automated Testing: Going forward, we aim to integrate an array of test types in our CI, such as chaos testing and performance testing, to further boost software quality.
Lowering CI Usage Costs: By assigning diverse types of Runners based on different use cases, we intend to make the overall CI usage more cost-effective.
Improving Building Performance: In fact, the refactoring of the release pipeline has somewhat reduced our build performance (#2113). By employing smarter build caching, we may further improve building performance.
Achieving a More Secure Software Supply Chain: In the management of modern software artifact, securing the software supply chain is becoming more and more crucial. As an open-source project, we must ensure that the software artifacts we distribute are safe, trustworthy, and transparent. Towards this end, we need to integrate essential security measures into our existing release process. Practices like SBOM management and software artifact signing and verification mechanisms are well worth emulating.

Fully leveraging GitHub Actions can be challenging and we're committed to continuous improvement. If you're intrigued and want to explore further, we warmly invite you to join our community discussions on Slack! Your insights and participation could be vital in shaping the next stage of our growth.

Practical Tips for Refactoring Release CI using GitHub Actions

Starting with a Release Pipeline ​

Reasons why Release Pipeline Deteriorate ​

Refactoring Plan ​

Future ​

Join our community

Starting with a Release Pipeline

Reasons why Release Pipeline Deteriorate

Refactoring Plan

Future