Contents

Getting Started with Grafana Loki, Part 1: The Concepts

Preface

Our logging solution for Kubernetes cluster has been CloudWatch Logs for a long time, and we were ok with it. For applications with spcial requirements, we leveraged S3 for long-term, low-cost storage, then query with Athena.

However, as more and more services being containerized and moved into Kubernetes cluster, issues start to emerge:

  • The time from ingestion to ready for search is suboptimal. Take CloudWatch Insights as an example, it takes roughly 2 minutes to find the logs from my experience.

  • Searching logs at different places is inconvenient and slow for service team members, let alone comparing them.

  • CloudWatch Logs cost is going to increase significantly. We can put it into S3, but there will be no metric filters.

  • Colleagues from very different tech stacks need a lot of time to learn different things.

This post is part of the Kubernetes logging series:

“Problems solved” is not enough

So I started to search for other feasible solutions. It’s not difficult to solve the problems above. There are solutions from Datadog, Splunk, and don’t forget the classic EFK stack.

I checked a few choices then stopped at Grafana Loki (hence this post). I heard this before but haven’t really looked into it. There are always too many new things going on, especially in the cloud native world. I think we can all agree that tools come and go. What’s important is the values they can provide.

After comparing the solutions a bit, I think Loki can solve our problems like the solutions mentioned above. Plus:

Grafana becomes an one-stop-shop

We are already using the Prometheus with Grafana stack for metrics monitoring; Loki will perfectly fit. Team members know how to use PromQL for metrics, and LogQL uses a similar concept. They will have a lower cognitive load when working on logs. Besides, they can figure things out quickly now that they don’t need to navigate between different places.

Logging is a critical part of our observability journey, along with the traces in the future (yeah, we are late to the party). So the experience and efficiency here are important.

It can store data in S3 (and others, of course)

This significantly lowers the cost for us. One thing I laughed so hard when watching the Getting started with logging and Grafana Loki session is that most of the logs are really “write once, read never”.

Yes, data storage costs for lying there, but only $0.025 per GB in ap-northeast-1. Let’s assume there is a self-hosted Elasticsearch cluster as an example; a single cheapest EBS (sc1, somehow impractical in this case) costs $0.018 per GB-month, let alone the cost of instances running all night.

Object store is much more cost-effective here, and you get the high availability and durability.

Better yet, we still get to transform logs into metrics.

Concepts

Logs, labels, streams, and chunks

./loki-seesion-1-indexing.png
source: Getting started with logging and Grafana Loki

A Log sent to Loki will be treated as two parts:

  1. metadata

  2. data (the log itself, duh)

Metadata consists of timestamp and the label set. The usage of labels is similar to Prometheus Labels, and it’s critical for both writing and reading.

When a log arrives, Loki hashes the tenant ID and the label set to calculate which stream it belongs to.

Logs of a specific stream are batched and stored as chunks. Loki is only efficient for large chunks.

The following label sets will be sorted and are considered as the same stream (given the same tenant):

1
2
{cluster="us-central-00", namespace="loki-1", container="ingester"}
{cluster="us-central-00", container="ingester", namespace="loki-1"}

However, the following are two separate steams (again, given the same tenant):

1
2
{cluster="us-central-00", namespace="loki-1", container="ingester"}
{cluster="us-central-01", namespace="loki-1", container="ingester"}
Be aware of high cardinality

A large number of labels or values (e.g., status code, user ID, IP address, etc.) can result in too many different streams and finally leads to bad performance. It should be avoided, especially if you’re coming from index-heavy solutions.

To quote the official document:

High cardinality causes Loki to build a huge index (read: $$$$) and to flush thousands of tiny chunks to the object store (read: slow). Loki currently performs very poorly in this configuration and will be the least cost-effective and least fun to run and use.

See Cardinality for more details.

A few rules of thumb about labels:

  • Labels should not be extracted from Logs. Unless streams generate throughput more than, say, ~3MB/s, then it makes sense to split into smaller streams.

  • Too many labels make indices too big, which means slow. This sounds counterintuitive, but it’s the small indices that makes Loki fast.

  • If you are not sure what to put in a label set, stick to the topology. It should just tell where the logs come from. For example: {cluster="us-central-00", namespace="loki-1", container="ingester"}. However, if it’s a supermassive Kubernetes cluster, you probably need to think twice before using container name.

  • Keep total unique streams per 24 hours less than 200,000 per tenant.

Loki, unlike the heavily-indexed solutions, only indexes metadata. With the “label rules” above, You might ask: how does it get fast? It turns out, Loki has to load all chunks that match labels and search window by breaking the “big query” into much smaller ones for queriers (querier is explained below) to run in parallel, and finally put them together.

Its brute way of fetching logs is different than other solutions in exchange for low cost. And you still get to decide the query power.

It’s incredible how this mechanism pulls off and how powerful cloud providers’ object storage is nowadays.

./loki-session-2-queries.png
source: Getting started with logging and Grafana Loki

Write path: distributor and ingester

The write path is the journey of how logs are ingested into your storage. There are two components involved: distributor and ingester.

Distributor

When a distributor receives a log request, it has several jobs to do:

  1. Validating the data for things like whether labels are valid, the timestamp is not too new or too old (out-of-orders are enabled by default), and log lines are too long.
  2. Normalizing (sorting) the labels. Same label set with different label order is identical like the examples above:
    1
    2
    
    {cluster="us-central-00", namespace="loki-1", container="ingester"}
    {cluster="us-central-00", container="ingester", namespace="loki-1"}
    
  3. Rate Limiting. To simply put, the distrubter’s rate limit is based on the per tenant limit and the number of distributers. Distributers know how many their friends are out there by using ring config (e.g., gossip ring with memberlist). If tenant A has rate limit of 10MB per second, and there are current 10 distributers, each distributor will have 10MB/10 = 1 MB per second for tenant A.
  4. Hashing. Distributor hashes the tenant and the label set to generate a stream ID and find ingesters with hash ring (more about hash ring later).
  5. Distributing. Finally! Distributers send the logs to appropriate ingesters using stream ID. However, sending requests to ingesters can fail. Everything fails. Fortunately, there is a config called replication factor, which means a quorum of writes must succeed. If the replication factor is 3, then 2 writes must not fail (floor(replication_factor / 2) + 1). If the request fails, the distributor will retry. When the ingesters successfully ingested logs into storage, distributor responses with success code.

As you can see, the distributor carries many works, and it also handles a large volume of requests, therefore lessening the burden of ingesters. Distributor is designed to be stateless so it can be scaled horizontally under different loads.

Ingester and WAL

Distributor distributes logs. And, of course, ingester ingests logs (duh) into storage like S3.

Ingester buffer chunks in memory — until either time is up (chunk_idle_period & max_chunk_age) or chunk size has reached limit (chunk_target_size).

But we know any operations can fail. Distributers have done their job by sending logs to multiple ingesters and hope a quorum of them will succeed; otherwise, they will proceed to retry.

What about ingesters?

Ingester uses write ahead log (WAL) to make sure that data in memory won’t just vanish after a crash. That is, the ingester will save incoming data to the file system. After restarting, the ingester will replay all the events, then registers itself into the ring and begins its full-time job with other ingesters again.

When the ingester is “replaying” the events, WAL has a backpressure mechanism to restrict the memory usage. Ingester knows how much data is being replayed and checks the ingester.wal-replay-memory-ceiling threshold simultaneously. Once memory usage hits the ingester.wal-replay-memory-ceiling (default: 4GB), it flushes data to the storage (e.g. S3). 4GB is not a small number, so it’s better to check whether the default setting fits.

Ingester should be able to replay stably with the backpressure mechanism. However, there are still a few caveats:

  1. Ingester won’t be able to replay all the WAL if it’s corrupted or partially deleted. The data might be loss but won’t stop ingester from starting. You can check the metric loki_ingester_wal_corruptions_total for such an event.
  2. When the disk is full, the ingester can still write to the storage (e.g., S3), but will not log them into the local file system. Availability takes precedence over durability. You can use loki_ingester_wal_disk_full_failures_total to know this event happens. But really, you should always monitor your remaining disk space no matter what. Don’t let this kind of thing drags you down.

Read path: querier, query frontend, and … ingester (again)

Read path is about reading logs (duh).

There are multiple components involved: querier, query frontend and once again, the ingester.

Qurier

Remember we talked about how Grafana Loki breaks a “big query” into smaller ones? Quriers are responsible for executing them.

There are two sources for the querier to fetch logs:

  1. Ingesters, for in-memory data:
  2. Backend storage (e.g. S3). Querier falls back to backend storage with the same query if nothing is found in the previous step.

Query frontend

Query frontend is an optional, stateless component sitting in front of all queriers. It provides the same API endpoint as queriers. When query frontend is enabled, it holds an internal FIFO queue, and queriers will act as queue consumers.

Query frontend has several responsibilities:

  • Splitting, queueing, and putting results together. The query frontend breaks a “big query” into much smaller ones. These smaller queries are enqueued for queriers to consume parallelly and make sure subqueries can be retried once some of them failed (e.g., OOM). Finally, the query frontend receives the results from all queriers and assembles them.

  • Caching. Grafana Loki supports metric queries. Query frontend caches and reuses them later if applicable. Again, when results are not available, it enqueues the queries for downstream queriers to execute. Query frontend can leverage in-memory cache, Redis, and memcached for storage. On the other hand, the cache of log queries, seems to be under active deployment at the time of writing.

Other component(s)

Ruler

Ruler is the component for continually evaluating rules and alerts when it exceeds the threshold. If you are familiar with Prometheus alerting rules, you understand it already.

Into the ring

Grafana Loki uses consistent hash rings to achieve availability, scalability, and log sharding. Hash rings connect instances differently depending on the deployment modes. In microservices mode, there are several rings among different components. The information of rings are stored in a key-value store which defauls to memberlist. Other options like etcd, consul or inmemory are available.

memberlist uses gossip protocol to sync information of all nodes of the same component type. There are two particular rings I’d like to share:

  • Distributor ring: Remember how rate limit is calculated with the number of distributers? Distributers meet their friends using their very own distributor ring.

  • Ingester ring: Ingester ring is also used by distributers since the latter needs to know where to send. As mentioned above, a combination of tenant and label set will be hashed to a stream ID. A stream ID will be used to determine the keyspace of the ring, and the correspondent ingester will be found. If the replication factor is 2, there will be floor(2 / 2) + 1 = 2 ingesters receive logs. Take the following image from Getting started with logging and Grafana Loki as an example, ingester 03 and 04 (the next ingester, clockwise in the ring) will both receive data of the given stream from the distributor.

./loki-session-3-ingester-ring-example.png
Example of the ingester ring
source: Getting started with logging and Grafana Loki

Check the Consistent Hash Rings document for other rings.

Further readings

Cover: https://unsplash.com/photos/YLSwjSy7stw