Troubleshooting the Connection Reset Incident
The incident
Recently, a colleague told me one of their applications intermittently throws exceptions saying the connection was reset. The application sends HTTP requests to multiple service providers but the error only happens to a specific one.
Getting Started with Grafana Loki, Part 2: Up and Running
- 2023/09/08: Added route_randomly configuration for Redis cluster.
This post will focus on the day-2 operations of Grafana Loki, providing general suggestions and some caveats when deploying on Kubernetes with AWS.
I highly recommend reading through the Operations section in Grafana’s documentation before diving into this article, as it serves as a complementary resource.
Additionally, this post will be updated whenever I come across noteworthy information.
Automatically Clean Up Dangling Jobs with Policy Engine
Preface
Last year, I was reading about PSP deprecation and started wondering what could be the solutions in the future. Fortunately, there are already several policy engines like OPA Gatekeeper and Kyverno available.
Getting Started with Grafana Loki, Part 1: The Concepts
- 2023/07/09: Added the query-scheduler, table manager, and compactor.
Preface
Our logging solution for Kubernetes cluster has been CloudWatch Logs for a long time, and we were ok with it. For applications with spcial requirements, we leveraged S3 for long-term, low-cost storage, then query with Athena.
How It Works: Cluster Log Shipper as a DaemonSet
Logs are essential for almost all programs. These data provide valuable insights for application behavior and troubleshooting clues and can even transform into metrics if needed.
Collecting logs for containers on a Kubernetes Worker Node is not much different than a regular VM. This post explains how it’s done.
Monitor your bandwidth
In terms of bandwidth, there are actually two metrics called NetworkBandwidthInAllowanceExceeded
and NetworkBandwidthOutAllowanceExceeded
for ElastiCache like EC2. It’s a better metric to determine whether a node has already reached bandwidth limit.
If these values are not yet available or not increasing, it means the node probably either hasn’t exceeded burst bandwidth or burst time.
The Incident
Recently, we saw unexpectedly high traffic during a special event (well, the traffic itself was expected, just didn’t expect this much), and then service went down for a few minutes. The application didn’t show high CPU utilization or memory usage, but API latency was climbing. Checked upstream services, looked all good.
The service team checked the application’s logs and noticed there were many errors related to Redis.
Then checked the Redis’ metrics, CPU is low, memory usage is high, swap is slowly increasing, that looked not good, but shouldn’t cause connection problems. Redis latency is slightly unstable; however, it’s only a few microseconds higher.
What gives?
The Making of Admission Webhooks, Part 2: The Implementation
In part 1, we briefly went through the concept of admission webhooks. In this post, we are going to build one and deploy it to a cluster.
The Making of Admission Webhooks, Part 1: The Concept
Recently, we got an internal requirement to send logs to different destinations according to the content. Since our cluster level shipper’s config is already filled with settings to send logs based on the namespace by default, and we would like to make this new feature “pluggable” and can be dynamically set by teams, leveraging admission webhooks appears to be a more reasonable choice. 1
Query Stub Domains with CoreDNS and NodeLocal DNSCache
Recently, We started migrating our EKS 1.16 clusters to brand new 1.19 ones.
Of all the changes, I am most excited about NodeLocal DNSCache. It not only greatly reduced the time of DNS query latency (mostly because of these serach domains), but issues like conntrack races at very little expense.