Getting Started with Grafana Loki, Part 2: Up and Running

Update history

This post will focus on the day-2 operations of Grafana Loki, providing general suggestions and some caveats when deploying on Kubernetes with AWS.

I highly recommend reading through the Operations section in Grafana’s documentation before diving into this article, as it serves as a complementary resource.

Additionally, this post will be updated whenever I come across noteworthy information.

Keep up with the…official resources

I spent a lot of time, I meant a lot, on its documents to really understand how Grafana Loki works and the corresponding configurations. The official webinars and blog posts are much better than that in my opinion.

Here are some resources I highly recommend reading besides documents:

If you are using RSS, you can subscribe to the following links:

General operation

Consider using the officially supported Helm Chart

From the official blog post:

Going forward, this will be the only Helm chart we use to run any Helm-deployed Loki clusters here at Grafana Labs.

We of course encourage the community to continue to create and maintain charts that work for their use cases, but we plan to play a less active role in those other charts as we focus our energy on this new one.

There are just way too many charts for Grafana Loki. While it’s OK to use the community version of charts, it would be easier to follow the official chart.

If you are already using loki-distributed chart long time ago like I did, you can refer to Migrate from loki-distributed Helm Chart to migrate.

A few notes for the migration:

  1. It only supports single binary mode and simple-scalable deployment (aka “SSD”) mode. If you are looking for microservice deployment and want to juggle all the components we mentioned in part1, you can’t use this chart.

    How do you know if you need the microservice deployment? If you need to ask about it, you don’t need it.

  2. Ensure you carefully review the values file between these two charts.

  3. I believe 99.999% of people reading this article will use simple-scalable deployment. You might as well check the Migrate To Three Scalable Targets out.

    By default, this chart will not only deploy read (query frontend & querier), write (distributor & ingester), but also a backend component, which runs components like ruler, compactor, table manager, and query-scheduler. Using three targets mode allows us to deploy read component as a deployment.

Enabling compactor

To borrow from part1: An ingester writes an index every 15 minutes, which results in 96 indicies per day. Since it’s unlikely that you only have one ingester in the production environment, there can be hundreds of indices in a single day.

Compacting these duplicate indices not only saves costs but also lowers the overall query latency.

Learn about log deletion

Day-2 problems are sometimes underestimated (sometimes “sometimes” is also an understatement). Therefore it will be helpful to know how to delete logs in case you need it.

Note
  1. At the time of this writing (2023/07/17), the official document says log deletion only works with boltdb-shipper, but I’ve tested with TSDB index and am sure it works as well.

  2. The following configurations assume you already use the official grafana/loki chart described above.

  3. If you use multi-tenant setup, ensure you have the X-Scope-OrgID header on the requests below.

I recommend you read the document about log entry deletion first.

To delete selected logs, you have to:

  • Set loki.compactor.retention_enabled to true. Make sure you have checked the following loki.limits_config.retention_period before enabling this.

  • Check your loki.limits_config.retention_period. I assume you only want to “delete” the select logs and don’t want to tell Loki to periodically delete old logs. On Grafana Loki 2.8 or later, you can set this value to 0s, which means data retention settings is disabled. On previous versions, you need to use “really large value” according to the document.

  • Set loki.limits_config.deletion_mode to either filter or filter-and-delete. The decision is on you.

  • The logs will be actually deleted after loki.compactor.delete_request_cancel_period. Set this value up before you start deleting logs.

  • Determine the logs you want to delete:

    1
    2
    
    { my_service="foo", container="bar" }
    |= "this should be deleted!!1"
  • If you are sure the result is all you want to delete, perform the request to the compactor:

    1
    
    curl -X POST 'http://<YOUR_LOKI_ENDPOINT>/loki/api/v1/delete?query={%20my_service%3D%22foo%22%2C%20container%3D%22bar%22%20}%7C%3D%20%22this%20should%20be%20deleted!!1%22&start=1685548800&end=1688140799'

    Where:

    • the query is, of course, the query. It must be URL-encoded.

    • start and end are the epoch timestamp in seconds (as in, 10-digit).

How do you know if logs are deleted? There are several ways to make sure:

  1. Query these logs in the given time window after loki.compactor.delete_request_cancel_period.

  2. Check the logs of the backend pod running the compactor (only one of the backend pods will actually run it).

  3. Perform the following command to see whether the status is processed. The document didn’t mention the possible status though.

    1
    
    curl -X GET 'http://<YOUR_LOKI_ENDPOINT>/loki/api/v1/delete'

That’s pretty easy isn’t it? Yes, that’s the problem. It’s too easy.

And this leads us to the next point: Prevent unwanted access.

Prevent unwanted access using basic auth and network policy

Grafana Loki itself doesn’t come with built-in authentication. It relies on the loki-gateway component, which is effectively an nginx in front of the Loki cluster.

To enable authentication on loki-gateway, follow the steps below:

  1. Ensure your logging agent (e.g., fluentbit) has a basic auth config set up beforehand. It won’t cause problems even if you haven’t set the basic auth in the loki-gateway.

  2. Make sure Grafana’s Grafana Loki data source has basic auth config set up beforehand. Same as the logging agent; it won’t cause problems at this point.

  3. Generate the .htpasswd file

    1
    2
    3
    4
    5
    6
    7
    
    $ docker run --rm -it ubuntu bash # spin up an one-off ubuntu container
    root@1473bb48ba60:/# apt update; apt install -y apache2-utils
    # Installing...
    root@1473bb48ba60:/# htpasswd -c /tmp/.htpasswd user56 # choose the name you like, of course
    New password: # you will be aksed to type the password here
    Re-type new password: # and type again
    Adding password for user user56
  4. Get the file content at /tmp/.htpasswd and create a secret in the namespace where Grafana Loki is installed. Call it whatever you like, probably loki-gateway-auth with the format like this:

    1
    2
    3
    4
    5
    6
    7
    
    apiVersion: v1
    kind: Secret
    type: Opaque
    metadata:
      name: loki-gateway-auth
    stringData:
      .htpasswd: <THE_CONTENT_YOU_JUST_COPIED_AND_NO_NEED_TO_BASE64_ENCODE_BECAUSE_IT_IS_STRING_DATA>

    You can use base64 encoded value in data field if you like.

  5. Update your values.yaml:

    • set gateway.basicAuth.enable to true.
    • set gateway.basicAuth.existingSecret to loki-gateway-auth.
  6. Profit.

…Or does it?

Now that we have loki-gateway with basic auth set up, isn’t that supposed to be safe?

Usually, the request comes like the following flow (depending on whether you use ingress or just in-cluster services):

It looks safe since we’ve got this pretty cool lock šŸ”’! But what about direct access through the cluster’s service FQDN? What about direct Pod IP access?

Consider the following situation:

A chain is only as strong as its weakest link. We have to enhance it with the way that basic auth can’t.

Fortunately, we have network policies.

First, label the namespaces we want to allow to send requests to Grafana Loki, and then apply the network policy like the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: restrict-loki-access
  namespace: loki
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              some-label-on-namespace: ingress-nginx
        - namespaceSelector:
            matchLabels:
              some-label-on-namespace: loki
        - namespaceSelector:
            matchLabels:
              some-label-on-namespace: grafana-is-here

This reads “for all pods (podSelector: {}) in loki namespace (line 5), accept only traffic from pods in namespaces with label some-label-on-namespace where the value is either ingress-nginx, loki, or grafana-is-here”.

You can combine more restrictive rules such as IP block, ports, protocol, and/or pod selectors based on your need.

Use TSDB index

TSDB index was introduced in Grafana Loki 2.7 as an experimental feature and graduated as a stable feature in 2.8. To simply put, TSDB can make better query planning and more stable resource usage, therefore reduce the possibility of OOM on the queriers.

I encourage you to read Loki’s new TSDB Index by Owen Diehl to learn more about the detail of TSDB index.

The official TSDB documentation also provides a great example of how you can start to use TSDB by setting the schema_config. You should carefully review the settings and do not switch index in the past, only enable TSDB index in a future date (the from field in each schema element is presented as UTC+0).

Set proper timeout values

Query execution time depends on your actual query and the time window. It can take minutes if it’s a massive query.

Let’s take a look at a query request’s journey:

To enable longer timeout, you will need to:

  1. Set timeout value in Grafana’s Grafana Loki data source.

  2. If your Grafana connects to Grafana Loki via a load balancer (either load balancer in front of ingress controller, or service load balancer), make sure you have set the timeout value on the LB itself (e.g., the default timeout value of Amazon ALB is only 60 seconds).

  3. If you expose Grafana Loki through an Ingress Controller, you also need to increase the timeout there.

    For example, I am using ingress-nginx as my ingress controller. I have the following annotations set in the values.yaml:

    1
    2
    3
    4
    5
    6
    7
    
    # (...omitted)
    gateway:
      ingress:
        annotations:
          nginx.ingress.kubernetes.io/proxy-connect-timeout: "600"
          nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
          nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  4. The loki-gateway (nginx) has proxy_read_timeout set to 600 seconds.

  5. There are two things you need to consider on the read component:

    1. loki.server.http_server_read_timeout should be set high enough, for example, 600s (or 10m).
    2. loki.limits_config.query_timeout should be set high enough, for example, 600s (or 10m).

A general rule of thumb: The further left the component is in the architecture, the higher the timeout value it requires.

Enable route_randomly for Redis cluster

Earlier this year, we were setting up the monitoring system for our Redis cluster for Grafana Loki. We noticed Loki didn’t utilize our Redis cluster’s read replicas as expected. Therefore, all of the read pressure was on the primary replica all along. And you guessed it, it caused bandwidth issue as well.

After investigation, it turned out the Redis client doesn’t use read replicas because of the client’s default setting. I submitted a small pull request to allow users to increase the Redis cluster’s utilization as much as possible.

Now that Grafana Loki 2.9.0 has been released – you can now enable the route_randomly to separate the read request across all replicas! This option defaults to false to prevent breaking change.

Observe the slow queries

We mentioned that some sorts of queries can be very slow. Grafana Labs provided a few ways to optimize a query (basically line filter first, parsers later), but not all queries performed by others are optimal.

You can check the vaule of loki.frontend.log_queries_longer_than and the query frontend’s logs (|="component=frontend") to determine what queries are slow and need to be improved. A metric loki_logql_querystats_latency_seconds_bucket can also give you an overview of current query latency.

Capacity planning & cost

Measure the read component capacity

Even though the write path is the most critical part in the cluster, it’s not the most difficult to main. The read path is where the real challenges reside.

While Grafana Loki’s data retrieval and line filtering is fast, the parsing itself is not. We often encountered slow queries when users applied regex parser and JSON parser. The query execution time with a JSON parser can even be tens of times slower.

Out queriers have cpu 8 cores with 60Gi of RAM. We saw some performance improvement and more stable resource usage after switching to TSDB. However, it can’t help with the slow parsing, especially when it comes to JSON parser with metric queries.

We tried to fine-tune loki.limits_config.split_queries_by_interval, loki.querier.max_concurrent, loki.limits_config.max_query_parallelism, and loki.limits_config.tsdb_max_query_parallelism to different possible values but still no avail.

Therefore, you better have high enough query power when the cache can’t save you. To lower the cost of the querier, you can use spot instances (if you are on AWS), as they are much cheaper. There are also PodDisruptionBudget for each role; sothe trade-offs using querier spot instances are acceptable.

Prevent OOM using GOMEMLIMIT

I’ve followed OOM (out of memory) issues of components (mostly querier) for a while and it seems pretty…common.

In issue #6501, there was a comment mentioning GOMEMLIMIT. I am no Go expert, but after learning about Go’s garbage collection for a while, this seems the best solution for OOM at this moment.

We are running queriers with massive memory capacity (typically with 60Gi mentioned above) with spot instances. But even with this size, OOM issues still occur.

After introduing the GOMEMLIMIT (50Gi in this case, but can probably be higher) to queriers, it never happened again. The downside of this is performance will definitely get impacted as it triggers GC more frequently, but still beats the crash of containers (and therefore leads to the failed queries).

By the way, if you are deleting lots of logs, you should keep an eye on backend component and apply GOMEMLIMIT to it when necessary.

As for write path – I haven’t noticed any unreasonable memory spikes, but when you see one, you know what to do.

Measure cache capacity

At the time of this writing, there are a few types of cache you can leverage:

  1. Index Cache (but is no longer needed if you are using TSDB index, still helpful when querying old logs that are not on TSDB index).

  2. Query results cache.

  3. Chunk cache.

We started with all three types of caches enabled and were using Amazon ElastiCache for Redis to server these data. However, just like the post Monitor your bandwidth mentioned, we suffered from the lack of bandwidtch. Which leads to massive write/read operations timed out because of the networking bottleneck.

If you are facing these problems as well, you have a few options:

  1. Just retrieve chunks from S3 directly. Do enable S3 VPC endpoint to prevent AWS bill shock.

  2. Use memcached in the Grafana Loki chart. You should still calculate the possible bandwidth usage beforehand.

If you are considering using Amazon ElastiCache, there is also a data tiering type available, which uses a local disk to extend the storage to provide a relatively cost-effective way for storage size, but the minimum node type starts with r6gd.xlarge.

1
2
3
4
5
# ...(omitted)
cache_config:
  redis:
    # ...(omitted)
    route_randomly: true

Use S3 gateway VPC endpoint

This one is a must if you are running on AWS.

According to the NAT Gateway product page, there are three types of charge:

  1. NAT Gateway Hourly Charge

  2. NAT Gateway Data Processing Charge

  3. Data Transfer Charge

Even though the data transfer is free if your EC2 and S3 are in the same region, the data processing fee can still be high. Say, you write 200GB logs to S3 and read 500GB in ap-northeast-1 daily. The monthly fee will be (200 + 500) * 30 * 0.062 = $1,302 just for your basic Grafana Loki read/write operations.

Another minor impact is that when a huge query shows up, taking chunks out of S3 can consume some of the bandwidth of the NAT Gateway.

The good thing is that all of the problems above can be easily solved by enabling S3 gateway VPC endpoint. Gateway VPC endpoint won’t make Grafana Loki faster, but it can make costs lower.

Use appropriate storage class

At the time of this writing (2023/07/17), the PVC is mandatory for components like write and backend. If you are on AWS, you can use EFS or EBS.

Here I will use EBS as it’s cheaper (but comes with availability zone restriction, once you create it, this pod is stuck with that az).

You will need to:

  1. Install the EBS CSI Driver.

  2. Create a gp3 storageclass object.

  3. Set backend.persistence.storageClass and write.persistence.storageClass to gp3. If you are using three targets mode, the read will be a deployment and doesn’t require PVCs.

0%