今天在实现 Logging Correlation ID 的功能。理想状态下,我是期望能够在不影响业务逻辑代码的情况下,参照AOP的理念,给Topology的每个processor的增加如下行为:

  • 从header提取CorrelationID
  • CorrelationID 设置到 Mapped Diagnositc Context(其底层用ThreadLocal实现)
  • 在logger的 pattern增加 CorrelationID
  • 清理Mapped Diagnositc Context

这样业务代码不需要做任何改动就可以获得打印correlationID的能力。 然而由于KafakStream的dsl不暴露ConsumerRecord给我们,操作header并不是特别方便。 参见Kafka Record Header的背景以及使用场景。

Zipkin中对KafkaStream的tracing的实现方式与我在前一个项目中做的SafeKafkaStream的实现方式非常类似:通过一个wrapper,实现KafkaStream接口,把各种operation delegate到wrapper例的delegatee并添加额外的行为。

最后采取的思路如下:

  • 使用原始消息的payload中的id作为correlation id,使用一个“知道如何从各种类型的Record中提取correlation id”的CorrelationIDExtractor提取 correlationId
  • 把各个operator的参数, 大多为一个 function,使用 withContext 装饰起来,在装饰后的function里进行 setupcleanup的操作。

这种折衷的方案仍然有如下优点:

  • CorrelationID的获取集中在CorrelationIDExtractor这个一个地方,后续如果KafkaStream有更新对header的支持很容易切换到新的方案。
  • withContext尽量减少了对业务代码的侵入。

Issue

I’ve see many times that developers struggle to diagnosis failing build. “Why the test/deployment passed on my machine but failed in build pipeline?” This might be the most frequent questions developer asked.
Back in the old days, developer can ssh to the build agent, go straight into the build’s working directory, and start diagnosis.
Now, with many pipeline as service such as CireCI, Buildkite , developers have much less access to build server and agent than ever before. They can not perform these kind of diagnosis, without mentioning many drawbacks of this approach.

What are the alternative? One common approach I seen is, making small tweaks and pushing changes fiercely, hoping for one of those many fixes would work or reveals the root cause. This is both insufficient.

Solution

I tend to follow one principal when setup pipeline in a project.

Unified CLI interface for dev machine and ci agent.
Developer’s should be able to run a build task on their dev machine and CI agent if they’re granted the correct permission.

Examples

For example, a deployment script should have the following command line interface

./go deploy <env> [--local]

When this script is executed on a build agent, the script will try to use the build agent role to perform deployment.

When it is executed from a developer’s machine, they would need to provide their user id and prompted for password (potentially one time password) to acquire permission for deployment.

Benefits

There’re many benefits by following this principal:

  • Improved developer experience.
    • The feedback loop of changes are much faster comparing to testing every changes on pipeline.
    • Enabling developer executing tasks on their local machine help them trial new ideas and trouble shooting.
  • Knowledge are persisted.
    • Developer are smart, they would always find some trick to improve their efficiency for trouble shooting. It could be temporarily comment out or add few lines in the script, these knowledge tend get lost if they were not persisted as a pattern.
      The principal would encourage developers to persist these knowledge into build script, this would benefit all developers worked on this project.

Local Optimization and Its Impact:

Local optimization refers to optimizing specific parts of the process or codebase without considering the impact on the entire system or project. It’s essential to remember that software development is a collaborative effort, and every team member’s work contributes to the overall project’s success. While local optimizations might improve a specific area, they can hinder the project’s progress if they don’t align with the project’s goals or create dependencies that slow down overall development. For instance, optimizing a single service without considering its alignment with the entire system can cause unintended bottlenecks. The impacts of local optimization include increased complexity, delayed delivery due to unforeseen consequences, and limited flexibility. Local optimizations can lead to complex and hard-to-maintain code, ultimately slowing down future development. Moreover, an obsession with optimizing a single part can cause delays due to unexpected consequences. Additionally, code that’s over-optimized for specific scenarios might be less adaptable to changing requirements, limiting its usefulness in the long run. To address these issues, it’s crucial to measure the impact of optimizations on the entire system’s performance, rather than just focusing on isolated metrics. Prioritizing high-impact areas for optimization is another key strategy. By doing so, we ensure that our efforts align with the project’s overall success and deliver the most value to stakeholders.

Scoping, Prioritization, and Re-Prioritization:

Clearly defined scopes are essential for effective prioritization. Establishing frequent and fast feedback loops ensures that we can adjust our priorities as we receive new information. When dealing with technical debt, it’s wise to focus on high-impact areas and set clear boundaries for our goals. Breaking down larger goals into smaller milestones allows us to track progress and maintain a sense of accomplishment. Frequent re-prioritization based on newly learned context is a proactive approach. By doing so, we adapt quickly to changes and align our efforts with the evolving needs. It’s not just acceptable; it’s vital for our success. This practice ensures that our work remains aligned with our goals, continuously delivers value, and effectively responds to our dynamic environment.

Considering a Rewrite:

When technical debt reaches a high level, considering a rewrite might be a more efficient solution than extensive refactoring. A rewrite can result in a cleaner, more maintainable codebase, improved performance, and enhanced functionality. However, undertaking a rewrite requires a thorough exploration of effort, risks, and mitigation plans. A well-executed rewrite leverages the lessons learned from past mistakes, incorporates new technologies, and follows modern design patterns.

Prioritizing Simplicity over Flexibility:

Simplicity is a cornerstone of maintainability and readability within our codebase. Clear, straightforward code that follows consistent patterns is easier to understand and maintain. While flexibility might seem appealing for accommodating potential changes, it can introduce unnecessary complexity. Complex code paths and intricate component interactions hinder our ability to make changes efficiently. Prioritizing simplicity sets the foundation for a codebase that remains valuable over time. It ensures that we strike the right balance between adaptability and maintainability while avoiding unnecessary complications.

Terraform Tips: Multiple Environments

In last post, we explored the idea of layered infrastructure and the problem it was trying to solve.

One of the benefits of using Terraform to provision multiple environments is consistency. We can extract environment-specific configurations such as CIDR, instance size, as Terraform modules variables, and create a separate variable file for each environment.

In this post, we will talk about different options to provision multiple environments with terraform.

In a real-life infrastructure project, remote state store and state locking are widely adopted for ease of collaboration.

One backend, many terraform workspaces

I have seen some teams using terraform workspace to manage multiple environments. Many backends support workspace.

Let’s have a look at an example using S3 as the backend.

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.35.0"
    }
  }

  backend "s3" {
    bucket  = "my-app"
    key     = "00-network"
    region  = "ap-southeast-2"
    encrypt = "true"
    lock_table = "my-app"
  }
}

It is pretty straightforward to create multiple workspaces and switch into each workspace.

terraform workspace new dev
terraform workspace new test
terraform workspace new staging
terraform workspace new prod

terraform workspace select <workspace-name>

Each workspace’s states will be stored under a separate subfolder in the S3 bucket.

e.g.

s3://my-app/dev/00-network
s3://my-app/test/00-network
s3://my-app/staging/00-network

However, the downside is that both non-prod and prod environments’ states are stored in the same bucket. This makes it challenging to impose different levels of access control for prod and non-prod conditions.

If you stay in this industry long enough, you must have heard stories of serious consequences of “putting all eggs in one basket.”

One backend per environment

If using one backend for all environments is risky, how about configuring one backend per environment?

parameterise backend config

One way to configure individual backend for each environment is to parameterize backend config block. Let’s have a look at the backup configure of the following project structure:

├ components
│  ├ 01-networks
│       ├ terraform.tf  # backend and providers config
│       ├ main.tf
│       ├ variables.tf
│       └ outputs.tf
│
│  ├ 02-computing

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.35.0"
    }
  }

  backend "s3" {
    bucket  = "my-app-${env}"
    key     = "00-network"
    region  = "ap-southeast-2"
    encrypt = "true"
    lock_table = "my-app-${env}"
  }
}

Everything seems OK. However, when you run terraform init in the component, the following error tells the brutal truth.

Initializing the backend...
╷
│ Error: Variables not allowed
│
│   on terraform.tf line 10, in terraform:
│   10:     bucket         = "my-app-${env}"
│
│ Variables may not be used here.
╵

It turns out there is an open issue about supporting variables in terraform backend config block.

passing backend configure in CLI

terraform init supports partial configuration which allows passing dynamic or sensitive configurations. This seems a perfect solution for dynamically passing bucket names based on environment names.

We can create a wrapper script go for terraform init/plan/apply, which create backend config dynamically based on environment and pass as additional CLI argument.

Then we can structure our project as follows.

├ components
│  ├ 01-networks
│  │    ├ terraform.tf  # backend and providers config
│  │    ├ main.tf
│  │    ├ variables.tf
│  │    ├ outputs.tf
│  │    └ go
│  │
│  ├ 02-computing
│  │     ├ terraform.tf
│  │     ├ main.tf
│  │     ├ variables.tf
│  │     ├ outputs.tf
│  │     └ go
│  ├ 03-my-service
│  │     ├ ....
│
├ envs
│  ├ dev.tfvars
│  ├ test.tfvars
│  ├ staging.tfvars
│  └ prod.tfvars

Let’s take a closer look at the go script.

# go
_ACTION=$1
_ENV_NAME=$2

function init() {
  bucket="my-app-${_ENV_NAME}"
     key="01-networks/terraform.tfstate"
     dynamodb_table="my-app-${_ENV_NAME}"

     echo "+----------------------------------------------------------------------------------+"
     printf "| %-80s |\n" "Initialising Terraform with backend configuration:"
     printf "| %-80s |\n" "    Bucket:         $bucket"
     printf "| %-80s |\n" "    key:            $key"
     printf "| %-80s |\n" "    Dynamodb_table: $dynamodb_table"
     echo "+----------------------------------------------------------------------------------+"

     terraform init  \
         -backend=true  \
         --backend-config "bucket=$bucket" \
         --backend-config "key=$key" \
         --backend-config "region=ap-southeast-2" \
         --backend-config "dynamodb_table=$dynamodb_table" \
         --backend-config "encrypt=true"
}

function plan() {
  init
  terraform plan -out plan.out --var-file=$PROJECT_ROOT/envs/$_ENV_NAME.tfvars #use env specific var file
}

function plan() {
  init
  terraform apply plan.out
}

Then, we can run ./go plan <env> and ./go apply <env> to provision components for each environment with separate backed config.

Terraform Tips: Layered Infrastrucutre

Terraform have been a significant player in the infrastructure as code field. Since its first release in 2014, it has been widely used in the industry. Terraform finally reached 1.0 on 8 June 2021.

It is dead simple to provision and manages resources via terraform’s human-readable, declarative configuration language. However, you might only see the challenges when using it with anger in real-life projects. In this post, we’ll talk about the idea behind layered infrastructure; The problem it was trying to solve, and how to adapt it in your project.

Technically, we can provision a whole environment including networks, subnets, security groups, data stores, EC2 instances in one single terraform file. See the example below.

└── infra
    ├── prod
    │   └── main.tf
    ├── qa
    │   └── main.tf
    ├── dev
    │   └── main.tf
    └── stage
        └── main.tf

However, this would lead to a slow deployment process. To apply any resources changes, terraform would have to query and compare the state for each resource defined in main.tf.

We knew that the frequency of changes to different types of resources varies drastically; for example, the chance of changing the number of EC2 instances would be significantly higher than VPC CIDR. It would be a massive waste for terraform to compare hundreds of nearly unchanged resources to increase the instance number for an AutosScalingGroup.

If we use a remote state store, we can only apply any infrastructure changes to the environment one at a time.

There’s room for improvement. In a standard application deployment, we can classify these resources into layers such as application, compute and networks; the higher layer can depend on resources in lower layers.

Resources such as docker containers, data store, SNS topics, SQS queue, Lambda function are usually owned by an application. Resources such as EC2 instances, ECS or EKS clusters, providing computing capabilities, are usually shared across different applications.

Resources such as VPC, subnets, internet gateway, Network Address Translation (NAT) Gateway, network peering are essential to provision resources mentioned above. With these layered infrastructures, we can provision resources in different layers independently.

This is the idea of “layered infrastructure”, here is a layout of the project adopting layered infrastructure.

├ components       # components for an environment
│  ├ 00-iam           # bootstrap roles which will be used in higher layers
│  ├ 01-networks
│  ├ 02-computing
│  ├ 03-application
├ modules          # in-house terraform modules

As you can see from the layout, prepending number to component name makes it easy to understand their dependency.

Now let’s have a closer look at this layout. The layered infrastructure has three key concepts, module, component and environment.

Module

A Terraform module is a set of Terraform configuration files in a single directory intended to organise, encapsulate, and reuse configuration files, providing consistency and ensuring best practices. A terraform module usually has the following structure:

.
├── LICENSE
├── README.md
├── main.tf
├── variables.tf
└── outputs.tf    

For example, terraform-aws-vpc is a community module that can be used to provision VPC with subnets.

You can also maintain in-house terraform modules for shared codes within your organisation. Module

Component

An environment components groups multiple closely related modules or resources together. It can be provisioned independently within an environment. A component might depend on other components; Cyclic dependency must be avoided in component dependencies. A component usually has the following structure:

.
├── terraform.tf // backend configuration
├── provider.tf
├── main.tf
├── variables.tf
├── outputs.tf
└── go           // entry point for `terraform plan`, `terraform apply` and `terraform destroy`

Example of network components. Component

Environment

In the context of infrastructure as code, an environment is an isolated deployed instance of one or more components configured for a specific purpose, e.g. “dev”, “test”, “staging”, “production”.

All environments should have the same layout, with knot can be adjusted according to each environment. The only difference between environments should be captured in an environment-specific file tfvar. Let’s revisit the example project layout for an environment.

├ components          # components for an environment
│  ├ 00-iam           # bootstrap roles which will be used in higher layers
│  ├ 01-networks      # manage VPC, subnets, common security groups. output vpc/subnets/security group id.
│  ├ 02-computing     # manage computing resources into the vpc/subnets.
│  ├ 03-application   # manage application-specific resources 
├ modules             # in-house terraform modules

Environment

There are many benefits of adopting this approach, such as

  • Enables independent provisioning of each component (when the component’s output doesn’t change)
  • Fast deployment for the benefits of less state comparison.

Conclusion

We explored the problem layered infrastructure trying to solve; The benefits of this approach, and how to adapt it in your project.
This idea was inspired by Terraform Best Practices.