Someone's life journey.

Getting Started with Grafana Loki, Part 1: The Concepts


Our logging solution for Kubernetes cluster has been CloudWatch Logs for a long time, and we were ok with it. For applications with spcial requirements, we leveraged S3 for long-term, low-cost storage, then query with Athena.

However, as more and more services being containerized and moved into Kubernetes cluster, issues start to emerge:

  • The time from ingestion to ready for search is suboptimal. Take CloudWatch Insights as an example, it takes roughly 2 minutes to find the logs from my experience.

  • Searching logs at different places is inconvenient and slow for service team members, let alone comparing them.

  • CloudWatch Logs cost is going to increase significantly. We can put it into S3, but there will be no metric filters.

  • Colleagues from very different tech stacks need a lot of time to learn different things.

Monitor your bandwidth

2022/03/29 update

In terms of bandwidth, there are actually two metrics called NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded for ElastiCache like EC2. It’s a better metric to determine whether a node has already reached bandwidth limit.

If these values are not yet available or not increasing, it means the node probably either hasn’t exceeded burst bandwidth or burst time.

The Incident

Recently, we saw unexpectedly high traffic during a special event (well, the traffic itself was expected, just didn’t expect this much), and then service went down for a few minutes. The application didn’t show high CPU utilization or memory usage, but API latency was climbing. Checked upstream services, looked all good.

The service team checked the application’s logs and noticed there were many errors related to Redis.

Then checked the Redis’ metrics, CPU is low, memory usage is high, swap is slowly increasing, that looked not good, but shouldn’t cause connection problems. Redis latency is slightly unstable; however, it’s only a few microseconds higher.

What gives?

The Making of Admission Webhooks, Part 2: The Implementation

In part 1, we briefly went through the concept of admission webhooks. In this post, we are going to build one and deploy it to a cluster.

Let’s keep it simple: this webhook adds a throwaway Redis sidecar container when the pod has the following annotations (why use annotations?):

  • true
  • <user specified prot> (optional)
  • <user specified memory> (optional)

You can find the complete resources in this repo:

The Making of Admission Webhooks, Part 1: The Concept

Recently, we got an internal requirement to send logs to different destinations according to the content. Since our cluster level shipper’s config is already filled with settings to send logs based on the namespace by default, and we would like to make this new feature “pluggable”, leveraging admission webhooks seem to be a more reasonable choice. 1

Automatically Recover EC2 Instances That Failing Status Checks With Cloudwatch Events and Lambda

The Incident

Recently, some of our EKS worker nodes suddenly became unresponsive. When I was checking on the EC2 console, the status check showed “Insufficient Data”.

According to past experience, when underlying hardware somehow got impaired, we will get notifications. However, without much useful information this time, I only did some quick investigation and then had to manually terminate these instances.