Someone's life journey.

Getting Started with Grafana Loki, Part 2: Up and Running

Update history

This post will focus on the day-2 operations of Grafana Loki, providing general suggestions and some caveats when deploying on Kubernetes with AWS.

I highly recommend reading through the Operations section in Grafana’s documentation before diving into this article, as it serves as a complementary resource.

Additionally, this post will be updated whenever I come across noteworthy information.

Getting Started with Grafana Loki, Part 1: The Concepts

Update history


Our logging solution for Kubernetes cluster has been CloudWatch Logs for a long time, and we were ok with it. For applications with spcial requirements, we leveraged S3 for long-term, low-cost storage, then query with Athena.

Monitor your bandwidth

2022/03/29 update

In terms of bandwidth, there are actually two metrics called NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded for ElastiCache like EC2. It’s a better metric to determine whether a node has already reached bandwidth limit.

If these values are not yet available or not increasing, it means the node probably either hasn’t exceeded burst bandwidth or burst time.

The Incident

Recently, we saw unexpectedly high traffic during a special event (well, the traffic itself was expected, just didn’t expect this much), and then service went down for a few minutes. The application didn’t show high CPU utilization or memory usage, but API latency was climbing. Checked upstream services, looked all good.

The service team checked the application’s logs and noticed there were many errors related to Redis.

Then checked the Redis’ metrics, CPU is low, memory usage is high, swap is slowly increasing, that looked not good, but shouldn’t cause connection problems. Redis latency is slightly unstable; however, it’s only a few microseconds higher.

What gives?

The Making of Admission Webhooks, Part 1: The Concept

Recently, we got an internal requirement to send logs to different destinations according to the content. Since our cluster level shipper’s config is already filled with settings to send logs based on the namespace by default, and we would like to make this new feature “pluggable” and can be dynamically set by teams, leveraging admission webhooks appears to be a more reasonable choice. 1