Automatically Recover EC2 Instances That Failing Status Checks With Cloudwatch Events and Lambda

The Incident

Recently, some of our EKS worker nodes suddenly became unresponsive. When I was checking on the EC2 console, the status check showed “Insufficient Data”.

According to past experience, when underlying hardware somehow got impaired, we will get notifications. However, without much useful information this time, I only did some quick investigation and then had to manually terminate these instances.

From EC2 perspective, what I noticed were:

  • EC2 instances stopped publishing metrics like CPU utilization, network out after 14:26.
  • StatusCheckFailed_Instance and StatusCheckFailed became 1 (from 0) at 14:34.
  • StatusCheckFailed_System became 1 (from 0) at 14:36.

From Kubernetes perspective:

  • $ k get no -o wide showed these nodes are not ready.
  • I described the node via $ k describe node <NODE_NAME>, but the events were empty. What I forgot to do here is to check the exact reason why these nodes were marked as not ready.
  • For the pods on these nodes, they were marked as Terminating. It’s possibly because kubelet stopped reporting. And, there were no events either when I described these pods.

After getting answers from AWS’ support engineer, this incident was indeed caused by the underlying hardware. For rare occasions like this, they were not able to notify customers in advance.

One thing that most people agree is everything breaks. It’s not the first time I encountered underlying hardware issue that causes instances unable to work.

However, I didn’t create status check alarms for some workloads. And it’s a really bad idea.

Before We Start

There are some caveats we need to know about:

  • There is a huge gap between the time that EC2 stopped publishing data (14:26), and the time status check metric reflects the situation (14:34). It’s very likely that instances were already dead before 14:26.
  • StatusCheckFailed becomes 1 when either StatusCheckFailed_Instance or StatusCheckFailed_System failed.
  • If you want to recover this unresponsive instance, you can only use StatusCheckFailed_System for the alarm. But again, in this incident, StatusCheckFailed_System metric became non-zero two minutes after StatusCheckFailed_Instance. For instances like EKS worker nodes, they’re just not worth preserving and can be thrown away when you can just use a fresh one.

Automatically Recover Instances with CloudWatch Events and Lambda

In this post, we will leverage Serverless Framework, CloudWatch Events, and Lambda to recover unresponsive instances created by a limited set of auto scaling groups.

It’s also possible to create alarms for every single EC2 instance or instances created by all the auto scaling groups. But you need to consider the cost and the necessity.

This CloudWatch Events/Lambda combination is straightforward. It only:

  1. Create an alarm when an instance of certain auto scaling groups is launched.
    • This alarm will trigger an EC2 action to recover/terminate/stop/reboot that instance.
  2. Delete the alarm once the instance has been terminated.

For the complete sample code, please check github.com/wtchangdm/aws-samples/tree/master/asg-ec2-status-check-alarm.

Folder structure

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
.
└── asg-ec2-status-check-alarm
   ├── alarmManager.js # The file containing handler itself.
   ├── config # Here I stored some configs for different environments like qa, prod, etc.
   │  ├── config.js
   │  └── qa.js # Sample config for QA environment.
   ├── package.json
   ├── README.md
   ├── samplePayload.json # This file will give you the idea of how an event looks like.
   ├── serverless.yml # The file that Serverless Framework will look for and do its work.
   └── yarn.lock

serverless.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ...redacted
functions:
  alarmManager:
    handler: alarmManager.handler
    # You will find several lines that actually link to a function.
    # Check the Serverless Framework document at the bottom.
    role: ${file(./config/${opt:env}.js):getLambdaRoleArn}
    description: This Lambda will create alarm when an instance created by certain auto scaling group, and will delete when such instances are terminated.
    environment:
      NODE_ENV: production
      ENV: ${opt:env}
    events:
      - cloudwatchEvent:
          enabled: true
          event:
            source:
              - "aws.autoscaling"
            detail-type:
              # We only need these 2 events.
              # When instance is launched, create an alarm.
              - "EC2 Instance Launch Successful"
              # When instance is terminated, delete the alarm.
              - "EC2 Instance Terminate Successful"
            detail:
              # We only create alarm for certain auto scaling group instsances.
              AutoScalingGroupName: ${file(./config/${opt:env}.js):getAsgList}
  • The ${opt:env} here stands for the --env SOMETHING we passed to sls (or serverless) command.
  • The AutoScalingGroupName above is an array. However, we don’t want to leave these auto scaling group names hard-coded or split it into several serverless.yml. Instead, we specify the environment and let Serverless Framework look up the file.

We also expose the ENV environment variable to node runtime so it can require the correct file we want to use.

config.js

1
2
3
4
5
6
7
8
// ...redacted
const EC2_ACTIONS = {
  REBOOT: 'reboot',
  RECOVER: 'recover',
  STOP: 'stop',
  TERMINATE: 'terminate'
}
// ...redacted

As mentioned before, we can just throw away these instances in this case. But for other workloads, for example, Elasicsearch cluster, you might want to save it rather than terminate it. It’s all up to you.

Just don’t forget that if you need to recover it, you can only create alarm with StatusCheckFailed_System metric.

IAM Policy for Lambda Role

Please note: If this is the first time you use PutMetricAlarm in this account, make sure read these two links first:

You can either create service-linked role CreateServiceLinkedRole manually on AWS Console, or add the following policy for one-time service linked role creation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "Effect": "Allow",
  "Action": "iam:CreateServiceLinkedRole",
  "Resource": "arn:aws:iam::*:role/aws-service-role/events.amazonaws.com/AWSServiceRoleForCloudWatchEvents*",
  "Condition": {
    "StringLike": {
      "iam:AWSServiceName": "events.amazonaws.com"
    }
  }
}

with these two alam create/delete permissions:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DeleteAlarms"
      ],
      "Resource": "*"
    }
  ]
}

This is the minimum permissions you will need besides AWSLambdaBasicExecutionRole.

Deployment and Testing

1
$ sls deploy --env qa # or other environments you've created.

Increase desired instance number by 1 to see if it works.

1
$ aws autoscaling set-desired-capacity --auto-scaling-group-name <ASG_NAME> --desired-capacity <N+1>

Finally, let’s see if alarm will be deleted after instance is terminated.

1
$ aws autoscaling terminate-instance-in-auto-scaling-group --instance-id <INSTSANCE_ID> --should-decrement-desired-capacity

Further Readings

0%