Recently, a colleague told me one of their applications intermittently throws exceptions saying the connection was reset. The application sends HTTP requests to multiple service providers but the error only happens to a specific one.
- 2023/09/08: Added route_randomly configuration for Redis cluster.
This post will focus on the day-2 operations of Grafana Loki, providing general suggestions and some caveats when deploying on Kubernetes with AWS.
I highly recommend reading through the Operations section in Grafana’s documentation before diving into this article, as it serves as a complementary resource.
Additionally, this post will be updated whenever I come across noteworthy information.
Our logging solution for Kubernetes cluster has been CloudWatch Logs for a long time, and we were ok with it. For applications with spcial requirements, we leveraged S3 for long-term, low-cost storage, then query with Athena.
Logs are essential for almost all programs. These data provide valuable insights for application behavior and troubleshooting clues and can even transform into metrics if needed.
Collecting logs for containers on a Kubernetes Worker Node is not much different than a regular VM. This post explains how it’s done.
In terms of bandwidth, there are actually two metrics called
NetworkBandwidthOutAllowanceExceeded for ElastiCache like EC2. It’s a better metric to determine whether a node has already reached bandwidth limit.
If these values are not yet available or not increasing, it means the node probably either hasn’t exceeded burst bandwidth or burst time.
Recently, we saw unexpectedly high traffic during a special event (well, the traffic itself was expected, just didn’t expect this much), and then service went down for a few minutes. The application didn’t show high CPU utilization or memory usage, but API latency was climbing. Checked upstream services, looked all good.
The service team checked the application’s logs and noticed there were many errors related to Redis.
Then checked the Redis’ metrics, CPU is low, memory usage is high, swap is slowly increasing, that looked not good, but shouldn’t cause connection problems. Redis latency is slightly unstable; however, it’s only a few microseconds higher.
In part 1, we briefly went through the concept of admission webhooks. In this post, we are going to build one and deploy it to a cluster.
Recently, we got an internal requirement to send logs to different destinations according to the content. Since our cluster level shipper’s config is already filled with settings to send logs based on the namespace by default, and we would like to make this new feature “pluggable” and can be dynamically set by teams, leveraging admission webhooks appears to be a more reasonable choice. 1