- 2023/12/29: Added additional learning resources.
- 2023/12/23: Added paragraph of Prevent OOM using
GOMEMLIMIT
. - 2023/09/08: Added paragraph of Enable
route_randomly
for Redis cluster.
This post will focus on the day-2 operations of Grafana Loki, providing general suggestions and some caveats when deploying on Kubernetes with AWS.
I highly recommend reading through the Operations section in Grafana’s documentation before diving into this article, as this post serves as a complementary resource.
Additionally, this post will be updated whenever I come across noteworthy information.
Keep up with the…official resources#
I spent a lot of time, I meant a lot, on its documents to really understand how Grafana Loki works and the corresponding configurations. The official webinars and blog posts are much better than that in my opinion.
Here are some resources I highly recommend reading besides documents:
- Best practices for configuring Grafana Loki (would be better if there is a updated version with TSDB).
- Open source log monitoring: The concise guide to Grafana Loki
- The concise guide to Grafana Loki: Everything you need to know about labels
- The concise guide to Loki: How to get the most out of your query performance
If you are using RSS, you can subscribe to the following links:
- Grafana Labs blog: https://grafana.com/blog/index.xml
- Grafana Loki releases (GitHub): https://github.com/grafana/loki/releases.atom
General operation#
Consider using the officially supported Helm Chart#
From the official blog post:
Going forward, this will be the only Helm chart we use to run any Helm-deployed Loki clusters here at Grafana Labs.
We of course encourage the community to continue to create and maintain charts that work for their use cases, but we plan to play a less active role in those other charts as we focus our energy on this new one.
There are just way too many charts for Grafana Loki. While it’s OK to use the community version of charts, it would be easier to follow the official chart.
If you are already using loki-distributed chart long time ago like I did, you can refer to Migrate from loki-distributed Helm Chart to migrate.
A few notes for the migration:
It only supports
single binary
mode andsimple-scalable deployment
(aka “SSD”) mode. If you are looking for microservice deployment and want to juggle all the components we mentioned in part1, you can’t use this chart.How do you know if you need the microservice deployment? If you need to ask about it, you don’t need it.
Ensure you carefully review the values file between these two charts.
I believe 99.999% of people reading this article will use simple-scalable deployment. You might as well check the Migrate To Three Scalable Targets out.
By default, this chart will not only deploy
read
(query frontend & querier),write
(distributor & ingester), but also abackend
component, which runs components like ruler, compactor, table manager, and query-scheduler. Using three targets mode allows us to deployread
component as a deployment.
Enabling compactor#
To borrow from part1: An ingester writes an index every 15 minutes, which results in 96 indicies per day. Since it’s unlikely that you only have one ingester in the production environment, there can be hundreds of indices in a single day.
Compacting these duplicate indices not only saves costs but also lowers the overall query latency.
Learn about log deletion#
Day-2 problems are sometimes underestimated (sometimes “sometimes” is also an understatement). Therefore it will be helpful to know how to delete logs in case you need it.
At the time of this writing (2023/07/17), the official document says log deletion only works with
boltdb-shipper
, but I’ve tested with TSDB index and am sure it works as well.The following configurations assume you already use the official
grafana/loki
chart described above.If you use multi-tenant setup, ensure you have the
X-Scope-OrgID
header on the requests below. I recommend you read the document about log entry deletion first.
To delete selected logs, you have to:
Set
loki.compactor.retention_enabled
totrue
. Make sure you have checked the followingloki.limits_config.retention_period
before enabling this.Check your
loki.limits_config.retention_period
. I assume you only want to “delete” the select logs and don’t want to tell Loki to periodically delete old logs. On Grafana Loki 2.8 or later, you can set this value to0s
, which means data retention settings is disabled. On previous versions, you need to use “really large value” according to the document.Set
loki.limits_config.deletion_mode
to eitherfilter
orfilter-and-delete
. The decision is on you.The logs will be actually deleted after
loki.compactor.delete_request_cancel_period
. Set this value up before you start deleting logs.Determine the logs you want to delete:
{ my_service="foo", container="bar" } |= "this should be deleted!!1"
If you are sure the result is all you want to delete, perform the request to the compactor:
curl -X POST 'http://<YOUR_LOKI_ENDPOINT>/loki/api/v1/delete?query={%20my_service%3D%22foo%22%2C%20container%3D%22bar%22%20}%7C%3D%20%22this%20should%20be%20deleted!!1%22&start=1685548800&end=1688140799'
Where:
the
query
is, of course, the query. It must be URL-encoded.start
andend
are the epoch timestamp in seconds (as in, 10-digit).
How do you know if logs are deleted? There are several ways to make sure:
Query these logs in the given time window after
loki.compactor.delete_request_cancel_period
.Check the logs of the backend pod running the compactor (only one of the
backend
pods will actually run it).Perform the following command to see whether the status is
processed
. The document didn’t mention the possible status though.curl -X GET 'http://<YOUR_LOKI_ENDPOINT>/loki/api/v1/delete'
That’s pretty easy isn’t it? Yes, that’s the problem. It’s too easy.
And this leads us to the next point: Prevent unwanted access.
Prevent unwanted access using basic auth and network policy#
Grafana Loki itself doesn’t come with built-in authentication. It relies on the loki-gateway component, which is effectively an nginx in front of the Loki cluster.
To enable authentication on loki-gateway, follow the steps below:
Ensure your logging agent (e.g., fluentbit) has a basic auth config set up beforehand. It won’t cause problems even if you haven’t set the basic auth in the loki-gateway.
Make sure Grafana’s Grafana Loki data source has basic auth config set up beforehand. Same as the logging agent; it won’t cause problems at this point.
$ docker run --rm -it ubuntu bash # spin up an one-off ubuntu container root@1473bb48ba60:/# apt update; apt install -y apache2-utils # Installing... root@1473bb48ba60:/# htpasswd -c /tmp/.htpasswd user56 # choose the name you like, of course New password: # you will be aksed to type the password here Re-type new password: # and type again Adding password for user user56
Get the file content at
/tmp/.htpasswd
and create a secret in the namespace where Grafana Loki is installed. Call it whatever you like, probablyloki-gateway-auth
with the format like this:apiVersion: v1 kind: Secret type: Opaque metadata: name: loki-gateway-auth stringData: .htpasswd: <THE_CONTENT_YOU_JUST_COPIED_AND_NO_NEED_TO_BASE64_ENCODE_BECAUSE_IT_IS_STRING_DATA>
You can use base64 encoded value in
data
field if you like.Update your values.yaml:
- set
gateway.basicAuth.enable
totrue
. - set
gateway.basicAuth.existingSecret
toloki-gateway-auth
.
- set
Profit.
…Or does it?
Now that we have loki-gateway with basic auth set up, isn’t that supposed to be safe?
Usually, the request comes like the following flow (depending on whether you use ingress or just in-cluster services):
It looks safe since we’ve got this pretty cool lock 🔒! But what about direct access through the cluster’s service FQDN? What about direct Pod IP access?
Consider the following situation:
A chain is only as strong as its weakest link. We have to enhance it with the way that basic auth can’t.
Fortunately, we have network policies.
First, label the namespaces we want to allow to send requests to Grafana Loki, and then apply the network policy like the following:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: restrict-loki-access
namespace: loki
spec:
podSelector: {}
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
some-label-on-namespace: ingress-nginx
- namespaceSelector:
matchLabels:
some-label-on-namespace: loki
- namespaceSelector:
matchLabels:
some-label-on-namespace: grafana-is-here
This reads “for all pods (podSelector: {}
) in loki
namespace (line 5), accept only traffic from pods in namespaces with label some-label-on-namespace
where the value is either ingress-nginx
, loki
, or grafana-is-here
”.
You can combine more restrictive rules such as IP block, ports, protocol, and/or pod selectors based on your need.
Use TSDB index#
TSDB index was introduced in Grafana Loki 2.7 as an experimental feature and graduated as a stable feature in 2.8. To simply put, TSDB can make better query planning and more stable resource usage, therefore reduce the possibility of OOM on the queriers.
I encourage you to read Loki’s new TSDB Index by Owen Diehl to learn more about the detail of TSDB index.
The official TSDB documentation also provides a great example of how you can start to use TSDB by setting the schema_config
. You should carefully review the settings and do not switch index in the past, only enable TSDB index in a future date (the from
field in each schema element is presented as UTC+0).
Set proper timeout values#
Query execution time depends on your actual query and the time window. It can take minutes if it’s a massive query.
Let’s take a look at a query request’s journey:
To enable longer timeout, you will need to:
Set timeout value in Grafana’s Grafana Loki data source.
If your Grafana connects to Grafana Loki via a load balancer (either load balancer in front of ingress controller, or service load balancer), make sure you have set the timeout value on the LB itself (e.g., the default timeout value of Amazon ALB is only 60 seconds).
If you expose Grafana Loki through an Ingress Controller, you also need to increase the timeout there.
For example, I am using ingress-nginx as my ingress controller. I have the following annotations set in the values.yaml:
# (...omitted) gateway: ingress: annotations: nginx.ingress.kubernetes.io/proxy-connect-timeout: "600" nginx.ingress.kubernetes.io/proxy-read-timeout: "600" nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
The loki-gateway (nginx) has
proxy_read_timeout
set to 600 seconds.There are two things you need to consider on the
read
component:loki.server.http_server_read_timeout
should be set high enough, for example,600s
(or10m
).loki.limits_config.query_timeout
should be set high enough, for example,600s
(or10m
).
A general rule of thumb: The further left the component is in the architecture, the higher the timeout value it requires.
Enable route_randomly
for Redis cluster#
Earlier this year, we were setting up the monitoring system for our Redis cluster for Grafana Loki. We noticed Loki didn’t utilize our Redis cluster’s read replicas as expected. Therefore, all of the read pressure was on the primary replica all along. And you guessed it, it caused bandwidth issue as well.
After investigation, it turned out the Redis client doesn’t use read replicas because of the client’s default setting. I submitted a small pull request to allow users to increase the Redis cluster’s utilization as much as possible.
Now that Grafana Loki 2.9.0 has been released – you can enable the route_randomly
to separate the read request across all replicas! This option defaults to false
to prevent breaking change.
Observe the slow queries#
We mentioned that some sorts of queries can be very slow. Grafana Labs provided a few ways to optimize a query (basically line filter first, parsers later), but not all queries performed by others are optimal.
You can check the vaule of loki.frontend.log_queries_longer_than
and the query frontend’s logs (|="component=frontend"
) to determine what queries are slow and need to be improved. A metric loki_logql_querystats_latency_seconds_bucket
can also give you an overview of current query latency.
Capacity planning & cost#
Measure the read component capacity#
Even though the write path is the most critical part in the cluster, it’s not the most difficult to main. The read path is where the real challenges reside.
While Grafana Loki’s data retrieval and line filtering is fast, the parsing itself is not. We often encountered slow queries when users applied regex parser and JSON parser. The query execution time with a JSON parser can even be tens of times slower.
Out queriers have cpu 8
cores with 60Gi
of RAM. We saw some performance improvement and more stable resource usage after switching to TSDB. However, it can’t help with the slow parsing, especially when it comes to JSON parser with metric queries.
We tried to fine-tune loki.limits_config.split_queries_by_interval
, loki.querier.max_concurrent
, loki.limits_config.max_query_parallelism
, and loki.limits_config.tsdb_max_query_parallelism
to different possible values but still no avail.
Therefore, you better have high enough query power when the cache can’t save you. To lower the cost of the querier, you can use spot instances (if you are on AWS), as they are much cheaper. There are also PodDisruptionBudget for each role so the trade-offs using querier on spot instances are acceptable.
Prevent OOM using GOMEMLIMIT
#
I’ve followed OOM (out of memory) issues of components (mostly querier) for a while and it seems pretty…common.
In issue #6501, there was a comment mentioning GOMEMLIMIT
. I am no Go expert, but after learning about Go’s garbage collection for a while, this seems the best solution for OOM at this moment.
We are running queriers with massive memory capacity (typically with 60Gi
mentioned above) with spot instances. But even with this size, OOM issues still occur.
After introduing the GOMEMLIMIT
(50Gi
in this case, but can probably be higher) to queriers, it never happened again. The downside of this is performance will definitely get impacted as it triggers GC more frequently, but still beats the crash of containers (and therefore leads to the failed queries).
By the way, if you are deleting lots of logs, you should keep an eye on backend
component and apply GOMEMLIMIT
to it when necessary.
As for write path – I haven’t noticed any unreasonable memory spikes, but when you see one, you know what to do.
Measure cache capacity#
At the time of this writing, there are a few types of cache you can leverage:
Index Cache (but is no longer needed if you are using TSDB index, still helpful when querying old logs that are not on TSDB index).
Query results cache.
Chunk cache.
We started with all three types of caches enabled and were using Amazon ElastiCache for Redis to server these data. However, just like the post Monitor your bandwidth mentioned, we suffered from the lack of bandwidtch. Which leads to massive write/read operations timed out because of the networking bottleneck.
If you are facing these problems as well, you have a few options:
Just retrieve chunks from S3 directly. Do enable S3 VPC endpoint to prevent AWS bill shock.
Use memcached in the Grafana Loki chart. You should still calculate the possible bandwidth usage beforehand.
If you are considering using Amazon ElastiCache, there is also a data tiering type available, which uses a local disk to extend the storage to provide a relatively cost-effective way for storage size, but the minimum node type starts with r6gd.xlarge
.
# ...(omitted)
cache_config:
redis:
# ...(omitted)
route_randomly: true
Use S3 gateway VPC endpoint#
This one is a must if you are running on AWS.
According to the NAT Gateway product page, there are three types of charge:
NAT Gateway Hourly Charge
NAT Gateway Data Processing Charge
Data Transfer Charge
Even though the data transfer is free if your EC2 and S3 are in the same region, the data processing fee can still be high. Say, you write 200GB logs to S3 and read 500GB in ap-northeast-1
daily. The monthly fee will be (200 + 500) * 30 * 0.062 = $1,302
just for your basic Grafana Loki read/write operations.
Another minor impact is that when a huge query shows up, taking chunks out of S3 can consume some of the bandwidth of the NAT Gateway.
The good thing is that all of the problems above can be easily solved by enabling S3 gateway VPC endpoint. Gateway VPC endpoint won’t make Grafana Loki faster, but it can make costs lower.
Use appropriate storage class#
At the time of this writing (2023/07/17), the PVC is mandatory for components like write
and backend
. If you are on AWS, you can use EFS or EBS.
Here I will use EBS as it’s cheaper (but comes with availability zone restriction, once you create it, this pod is stuck with that az).
You will need to:
Set
backend.persistence.storageClass
andwrite.persistence.storageClass
togp3
. If you are using three targets mode, theread
will be a deployment and doesn’t require PVCs.