The Incident#
Recently, some of our EKS worker nodes suddenly became unresponsive. When I was checking on the EC2 console, the status check showed “Insufficient Data”.
According to past experience, when underlying hardware somehow got impaired, we will get notifications. However, without much useful information this time, I only did some quick investigation and then had to manually terminate these instances.
From EC2 perspective, what I noticed were:
- EC2 instances stopped publishing metrics like CPU utilization, network out after 14:26.
StatusCheckFailed_Instance
andStatusCheckFailed
became 1 (from 0) at 14:34.StatusCheckFailed_System
became 1 (from 0) at 14:36.
From Kubernetes perspective:
$ k get no -o wide
showed these nodes are not ready.- I described the node via
$ k describe node <NODE_NAME>
, but the events were empty. What I forgot to do here is to check the exact reason why these nodes were marked as not ready. - For the pods on these nodes, they were marked as
Terminating
. It’s possibly because kubelet stopped reporting. And, there were no events either when I described these pods.
After getting answers from AWS’ support engineer, this incident was indeed caused by the underlying hardware. For rare occasions like this, they were not able to notify customers in advance.
One thing that most people agree is everything breaks. It’s not the first time I encountered underlying hardware issue that causes instances unable to work.
However, I didn’t create status check alarms for some workloads. And it’s a really bad idea.
Before We Start#
There are some caveats we need to know about:
- There is a huge gap between the time that EC2 stopped publishing data (14:26), and the time status check metric reflects the situation (14:34). It’s very likely that instances were already dead before 14:26.
StatusCheckFailed
becomes 1 when eitherStatusCheckFailed_Instance
orStatusCheckFailed_System
failed.- If you want to recover this unresponsive instance, you can only use
StatusCheckFailed_System
for the alarm. But again, in this incident,StatusCheckFailed_System
metric became non-zero two minutes afterStatusCheckFailed_Instance
. For instances like EKS worker nodes, they’re just not worth preserving and can be thrown away when you can just use a fresh one.
Automatically Recover Instances with CloudWatch Events and Lambda#
In this post, we will leverage Serverless Framework
, CloudWatch Events
, and Lambda
to recover unresponsive instances created by a limited set of auto scaling groups.
It’s also possible to create alarms for every single EC2 instance or instances created by all the auto scaling groups. But you need to consider the cost and the necessity.
This CloudWatch Events/Lambda combination is straightforward. It only:
- Create an alarm when an instance of certain auto scaling groups is launched.
- This alarm will trigger an EC2 action to
recover
/terminate
/stop
/reboot
that instance.
- This alarm will trigger an EC2 action to
- Delete the alarm once the instance has been terminated.
For the complete sample code, please check github.com/wtchangdm/aws-samples/tree/master/asg-ec2-status-check-alarm.
Folder structure#
.
└── asg-ec2-status-check-alarm
├── alarmManager.js # The file containing handler itself.
├── config # Here I stored some configs for different environments like qa, prod, etc.
│ ├── config.js
│ └── qa.js # Sample config for QA environment.
├── package.json
├── README.md
├── samplePayload.json # This file will give you the idea of how an event looks like.
├── serverless.yml # The file that Serverless Framework will look for and do its work.
└── yarn.lock
serverless.yml#
# ...redacted
functions:
alarmManager:
handler: alarmManager.handler
# You will find several lines that actually link to a function.
# Check the Serverless Framework document at the bottom.
role: ${file(./config/${opt:env}.js):getLambdaRoleArn}
description: This Lambda will create alarm when an instance created by certain auto scaling group, and will delete when such instances are terminated.
environment:
NODE_ENV: production
ENV: ${opt:env}
events:
- cloudwatchEvent:
enabled: true
event:
source:
- "aws.autoscaling"
detail-type:
# We only need these 2 events.
# When instance is launched, create an alarm.
- "EC2 Instance Launch Successful"
# When instance is terminated, delete the alarm.
- "EC2 Instance Terminate Successful"
detail:
# We only create alarm for certain auto scaling group instsances.
AutoScalingGroupName: ${file(./config/${opt:env}.js):getAsgList}
- The
${opt:env}
here stands for the--env SOMETHING
we passed tosls
(orserverless
) command. - The
AutoScalingGroupName
above is an array. However, we don’t want to leave these auto scaling group names hard-coded or split it into severalserverless.yml
. Instead, we specify the environment and let Serverless Framework look up the file.
We also expose the ENV
environment variable to node runtime so it can require the correct file we want to use.
config.js#
// ...redacted
const EC2_ACTIONS = {
REBOOT: 'reboot',
RECOVER: 'recover',
STOP: 'stop',
TERMINATE: 'terminate'
}
// ...redacted
As mentioned before, we can just throw away these instances in this case. But for other workloads, for example, Elasicsearch cluster, you might want to save it rather than terminate it. It’s all up to you.
Just don’t forget that if you need to recover
it, you can only create alarm with StatusCheckFailed_System
metric.
IAM Policy for Lambda Role#
Please note: If this is the first time you use PutMetricAlarm
in this account, make sure read these two links first:
You can either create service-linked role CreateServiceLinkedRole
manually on AWS Console, or add the following policy for one-time service linked role creation:
{
"Effect": "Allow",
"Action": "iam:CreateServiceLinkedRole",
"Resource": "arn:aws:iam::*:role/aws-service-role/events.amazonaws.com/AWSServiceRoleForCloudWatchEvents*",
"Condition": {
"StringLike": {
"iam:AWSServiceName": "events.amazonaws.com"
}
}
}
with these two alam create/delete permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricAlarm",
"cloudwatch:DeleteAlarms"
],
"Resource": "*"
}
]
}
This is the minimum permissions you will need besides AWSLambdaBasicExecutionRole
.
Deployment and Testing#
$ sls deploy --env qa # or other environments you've created.
Increase desired instance number by 1 to see if it works.
$ aws autoscaling set-desired-capacity --auto-scaling-group-name <ASG_NAME> --desired-capacity <N+1>
Finally, let’s see if alarm will be deleted after instance is terminated.
$ aws autoscaling terminate-instance-in-auto-scaling-group --instance-id <INSTSANCE_ID> --should-decrement-desired-capacity