The main activities for developers to work in a codebase are the following: make some changes, run tests, package and upload some artifacts, deploy the artifacts to a dev environment, and then perform automated or manual tests against the deployed changes. All these tasks are usually automated.

A developer might run ./gradlew test to execute tests, ./gradlew shadowJar to create an artifact for distribution, docker build and docker push to create and publish a Docker image, ansible or terraform to apply infrastructure changes, and curl for API testing.

These knowledge areas are usually captured in some automation scripts, typically some bash scripts, manual steps in README.md, or other markdown files in the codebase, or in the worst cases, reside in your team members’ brains, waiting to be captured somewhere.

It is common to see code snippets such as ./scripts/do_something.sh followed by some more text, then ./scripts/do_something_else.sh in the README.md.

It can be tedious for developers to read through hundred lines of text and understand the correct usage of the scripts. Sometimes developers have to scan through the bash script to figure out why the provided parameter does not work. Let us all be honest, it takes great effort to become an expert in working with shell script. Also, shell script lacks some modern scripting language features, making it hard to create reusable, maintainable code.


In the last two years, I have experienced those pains in few codebases that I worked on. After a few times of frustration with these scripts, I recalled some good examples I saw in Ruby on Rails projects many years ago. Almost all those tasks above are automated as a rake task. Running rake -T, it will show a list of all automation tasks in a codebase. Each task has a detailed description of its responsibility, expected parameters, etc.

Here are a few benefits of using Rake for task automation, in my opinion.

  • It has extremely easy integration with the operating system. Simply put command within backticks “``“, Rake will invoke the command as a separate OS process. e.g date
  • It is written in the Ruby programming language; developers can create classes and functions for better maintainability.

There are some downsides to using Rake; some Ruby packages (referred to as gem) require OS-specific native dependencies. These dependencies could break during OS updates and cause headaches for developers who have not worked with the Ruby ecosystem.

After shopping around, I found pyinvoke, which provides similar functionality to rake, but it is a Python-based tool. invoke -l will list all tasks in a codebase. Each pyinvoke task is a Python function annotated with @task. Developers define classes and functions and use them in pyinvoke tasks. Currently, we have migrated all of our shell-script-based build scripts to pyinvoke tasks. The introduction of pyinvoke has helped us standardize our deployment process, and more developers feel comfortable creating new automation tasks.

With the concepts outlined in Unified interface for developer and Build Pipeline, we have implemented several pyinvoke tasks featuring a --local flag. This enhancement enables developers to test changes locally without the need to push to a branch, thereby creating a quicker feedback loop.

The same task will be used in Bitbucket Pipeline without the --local flag. This has made our life much easier while dealing with some pipeline failures.

Developer’s experience with pyinvoke

With those tasks automated with Pyinvoke, here is what a developer will perform in their daily work.

  • invoke test to run all unit tests locally;
  • invoke package --version=<version> to create the jar and build a Docker image with a specific version;
  • invoke publish --version=<version> to publish the specific image you just built;
  • invoke deploy --version=<version> --env=<env> to deploy the version to a specific region;
  • invoke smoke-test --env=<env> to run some basic post-deployment validation against your service in an environment. If they forget which task to use, invoke -l will show the full list of existing automation tasks. Developers can also easily create a new one if none of the existing tasks fulfill their needs.

今天在实现 Logging Correlation ID 的功能。理想状态下,我是期望能够在不影响业务逻辑代码的情况下,参照AOP的理念,给Topology的每个processor的增加如下行为:

  • 从header提取CorrelationID
  • CorrelationID 设置到 Mapped Diagnositc Context(其底层用ThreadLocal实现)
  • 在logger的 pattern增加 CorrelationID
  • 清理Mapped Diagnositc Context

这样业务代码不需要做任何改动就可以获得打印correlationID的能力。 然而由于KafakStream的dsl不暴露ConsumerRecord给我们,操作header并不是特别方便。 参见Kafka Record Header的背景以及使用场景。

Zipkin中对KafkaStream的tracing的实现方式与我在前一个项目中做的SafeKafkaStream的实现方式非常类似:通过一个wrapper,实现KafkaStream接口,把各种operation delegate到wrapper例的delegatee并添加额外的行为。

最后采取的思路如下:

  • 使用原始消息的payload中的id作为correlation id,使用一个“知道如何从各种类型的Record中提取correlation id”的CorrelationIDExtractor提取 correlationId
  • 把各个operator的参数, 大多为一个 function,使用 withContext 装饰起来,在装饰后的function里进行 setupcleanup的操作。

这种折衷的方案仍然有如下优点:

  • CorrelationID的获取集中在CorrelationIDExtractor这个一个地方,后续如果KafkaStream有更新对header的支持很容易切换到新的方案。
  • withContext尽量减少了对业务代码的侵入。

Issue

I’ve see many times that developers struggle to diagnosis failing build. “Why the test/deployment passed on my machine but failed in build pipeline?” This might be the most frequent questions developer asked.
Back in the old days, developer can ssh to the build agent, go straight into the build’s working directory, and start diagnosis.
Now, with many pipeline as service such as CireCI, Buildkite , developers have much less access to build server and agent than ever before. They can not perform these kind of diagnosis, without mentioning many drawbacks of this approach.

What are the alternative? One common approach I seen is, making small tweaks and pushing changes fiercely, hoping for one of those many fixes would work or reveals the root cause. This is both insufficient.

Solution

I tend to follow one principal when setup pipeline in a project.

Unified CLI interface for dev machine and ci agent.
Developer’s should be able to run a build task on their dev machine and CI agent if they’re granted the correct permission.

Examples

For example, a deployment script should have the following command line interface

./go deploy <env> [--local]

When this script is executed on a build agent, the script will try to use the build agent role to perform deployment.

When it is executed from a developer’s machine, they would need to provide their user id and prompted for password (potentially one time password) to acquire permission for deployment.

Benefits

There’re many benefits by following this principal:

  • Improved developer experience.
    • The feedback loop of changes are much faster comparing to testing every changes on pipeline.
    • Enabling developer executing tasks on their local machine help them trial new ideas and trouble shooting.
  • Knowledge are persisted.
    • Developer are smart, they would always find some trick to improve their efficiency for trouble shooting. It could be temporarily comment out or add few lines in the script, these knowledge tend get lost if they were not persisted as a pattern.
      The principal would encourage developers to persist these knowledge into build script, this would benefit all developers worked on this project.

Local Optimization and Its Impact:

Local optimization refers to optimizing specific parts of the process or codebase without considering the impact on the entire system or project. It’s essential to remember that software development is a collaborative effort, and every team member’s work contributes to the overall project’s success. While local optimizations might improve a specific area, they can hinder the project’s progress if they don’t align with the project’s goals or create dependencies that slow down overall development. For instance, optimizing a single service without considering its alignment with the entire system can cause unintended bottlenecks. The impacts of local optimization include increased complexity, delayed delivery due to unforeseen consequences, and limited flexibility. Local optimizations can lead to complex and hard-to-maintain code, ultimately slowing down future development. Moreover, an obsession with optimizing a single part can cause delays due to unexpected consequences. Additionally, code that’s over-optimized for specific scenarios might be less adaptable to changing requirements, limiting its usefulness in the long run. To address these issues, it’s crucial to measure the impact of optimizations on the entire system’s performance, rather than just focusing on isolated metrics. Prioritizing high-impact areas for optimization is another key strategy. By doing so, we ensure that our efforts align with the project’s overall success and deliver the most value to stakeholders.

Scoping, Prioritization, and Re-Prioritization:

Clearly defined scopes are essential for effective prioritization. Establishing frequent and fast feedback loops ensures that we can adjust our priorities as we receive new information. When dealing with technical debt, it’s wise to focus on high-impact areas and set clear boundaries for our goals. Breaking down larger goals into smaller milestones allows us to track progress and maintain a sense of accomplishment. Frequent re-prioritization based on newly learned context is a proactive approach. By doing so, we adapt quickly to changes and align our efforts with the evolving needs. It’s not just acceptable; it’s vital for our success. This practice ensures that our work remains aligned with our goals, continuously delivers value, and effectively responds to our dynamic environment.

Considering a Rewrite:

When technical debt reaches a high level, considering a rewrite might be a more efficient solution than extensive refactoring. A rewrite can result in a cleaner, more maintainable codebase, improved performance, and enhanced functionality. However, undertaking a rewrite requires a thorough exploration of effort, risks, and mitigation plans. A well-executed rewrite leverages the lessons learned from past mistakes, incorporates new technologies, and follows modern design patterns.

Prioritizing Simplicity over Flexibility:

Simplicity is a cornerstone of maintainability and readability within our codebase. Clear, straightforward code that follows consistent patterns is easier to understand and maintain. While flexibility might seem appealing for accommodating potential changes, it can introduce unnecessary complexity. Complex code paths and intricate component interactions hinder our ability to make changes efficiently. Prioritizing simplicity sets the foundation for a codebase that remains valuable over time. It ensures that we strike the right balance between adaptability and maintainability while avoiding unnecessary complications.

Terraform Tips: Multiple Environments

In last post, we explored the idea of layered infrastructure and the problem it was trying to solve.

One of the benefits of using Terraform to provision multiple environments is consistency. We can extract environment-specific configurations such as CIDR, instance size, as Terraform modules variables, and create a separate variable file for each environment.

In this post, we will talk about different options to provision multiple environments with terraform.

In a real-life infrastructure project, remote state store and state locking are widely adopted for ease of collaboration.

One backend, many terraform workspaces

I have seen some teams using terraform workspace to manage multiple environments. Many backends support workspace.

Let’s have a look at an example using S3 as the backend.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.35.0"
    }
  }

  backend "s3" {
    bucket  = "my-app"
    key     = "00-network"
    region  = "ap-southeast-2"
    encrypt = "true"
    lock_table = "my-app"
  }
}

It is pretty straightforward to create multiple workspaces and switch into each workspace.

terraform workspace new dev
terraform workspace new test
terraform workspace new staging
terraform workspace new prod

terraform workspace select <workspace-name>

Each workspace’s states will be stored under a separate subfolder in the S3 bucket.

e.g.

s3://my-app/dev/00-network
s3://my-app/test/00-network
s3://my-app/staging/00-network

However, the downside is that both non-prod and prod environments’ states are stored in the same bucket. This makes it challenging to impose different levels of access control for prod and non-prod conditions.

If you stay in this industry long enough, you must have heard stories of serious consequences of “putting all eggs in one basket.”

One backend per environment

If using one backend for all environments is risky, how about configuring one backend per environment?

parameterise backend config

One way to configure individual backend for each environment is to parameterize backend config block. Let’s have a look at the backup configure of the following project structure:

├ components
│  ├ 01-networks
│       ├ terraform.tf  # backend and providers config
│       ├ main.tf
│       ├ variables.tf
│       └ outputs.tf
│
│  ├ 02-computing

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.35.0"
    }
  }

  backend "s3" {
    bucket  = "my-app-${env}"
    key     = "00-network"
    region  = "ap-southeast-2"
    encrypt = "true"
    lock_table = "my-app-${env}"
  }
}

Everything seems OK. However, when you run terraform init in the component, the following error tells the brutal truth.

Initializing the backend...
╷
│ Error: Variables not allowed
│
│   on terraform.tf line 10, in terraform:
│   10:     bucket         = "my-app-${env}"
│
│ Variables may not be used here.
╵

It turns out there is an open issue about supporting variables in terraform backend config block.

passing backend configure in CLI

terraform init supports partial configuration which allows passing dynamic or sensitive configurations. This seems a perfect solution for dynamically passing bucket names based on environment names.

We can create a wrapper script go for terraform init/plan/apply, which create backend config dynamically based on environment and pass as additional CLI argument.

Then we can structure our project as follows.

├ components
│  ├ 01-networks
│  │    ├ terraform.tf  # backend and providers config
│  │    ├ main.tf
│  │    ├ variables.tf
│  │    ├ outputs.tf
│  │    └ go
│  │
│  ├ 02-computing
│  │     ├ terraform.tf
│  │     ├ main.tf
│  │     ├ variables.tf
│  │     ├ outputs.tf
│  │     └ go
│  ├ 03-my-service
│  │     ├ ....
│
├ envs
│  ├ dev.tfvars
│  ├ test.tfvars
│  ├ staging.tfvars
│  └ prod.tfvars

Let’s take a closer look at the go script.

# go
_ACTION=$1
_ENV_NAME=$2

function init() {
  bucket="my-app-${_ENV_NAME}"
     key="01-networks/terraform.tfstate"
     dynamodb_table="my-app-${_ENV_NAME}"

     echo "+----------------------------------------------------------------------------------+"
     printf "| %-80s |\n" "Initialising Terraform with backend configuration:"
     printf "| %-80s |\n" "    Bucket:         $bucket"
     printf "| %-80s |\n" "    key:            $key"
     printf "| %-80s |\n" "    Dynamodb_table: $dynamodb_table"
     echo "+----------------------------------------------------------------------------------+"

     terraform init  \
         -backend=true  \
         --backend-config "bucket=$bucket" \
         --backend-config "key=$key" \
         --backend-config "region=ap-southeast-2" \
         --backend-config "dynamodb_table=$dynamodb_table" \
         --backend-config "encrypt=true"
}

function plan() {
  init
  terraform plan -out plan.out --var-file=$PROJECT_ROOT/envs/$_ENV_NAME.tfvars #use env specific var file
}

function plan() {
  init
  terraform apply plan.out
}

Then, we can run ./go plan <env> and ./go apply <env> to provision components for each environment with separate backed config.