adjoe Engineers’ Blog
 /  Infrastructure  /  Migration from Spot Ocean to Karpenter
Infrastructure

Autoscaling in Kubernetes: Migrate from Spot Ocean to Karpenter

As the infrastructure team at adjoe, we have three main objectives. Ensure high availability of our production systems, reduce operational costs, and enhance developer productivity.

One important approach for achieving this is with automatic scaling, commonly known as autoscaling. It helps minimize resources during low traffic and automatically adds more resources for increasing traffic. 

In simple terms, it means: 

  • We don’t overpay during low traffic times.
  • We can handle increasing amounts of traffic without manual intervention. 

Nowadays, we are running most of our infrastructure on Kubernetes (EKS). Autoscaling in Kubernetes consists of three components.

  1. Vertical pod autoscaler (running bigger pods with more resources).
  2. Horizontal pod autoscaler (running more pods and distributing traffic between them).
  3. Cluster autoscaler (adding more nodes).

We’ll focus on cluster autoscaling, which adds new nodes to the Kubernetes cluster so workloads can be scheduled on them.

How Does It Work? 

Whenever there are insufficient resources in a cluster to run all workloads, the cluster autoscaler contacts the cloud provider API resource and provisions new machines to join the cluster.

To make our operations more cost-efficient, we rely heavily on AWS Spot Instances. 

AWS Spot Instances are spare capacity instances that are being offered at huge discounts. 

But these come with a twist, they can be taken away at any moment. 

Nodes leaving and joining our cluster are something that happens all the time, and that has to be fast and seamless.

These are currently the three popular solutions for cluster autoscaling:

  • Cluster Autoscaler, referred to as CAS (open source – part of the Kubernetes project),
  • Spot Ocean (proprietary by NetApp),
  • Karpenter (open source by AWS).
Cluster autoscaling solutions - Karpenter, Spot Ocean, Cluster Autoscaler. How to Migrate from Spot Ocean to Karpenter

Karpenter has some significant advantages over CAS, and there are already many blog posts and talks covering migrations from Cluster Autoscaler (CAS) to Karpenter.

But there aren’t many resources or documentation specifically about migrating from Spot Ocean to Karpenter. 

That’s why in this post, we’ll talk about our journey of migrating from Spot Ocean to Karpenter. We cover how we successfully migrated in production without causing any downtime.

Spot Ocean vs Karpenter

Let’s start by looking at the architecture of Spot Ocean and see how it compares to Karpenter.

Spot Ocean vs Karpenter - which one to choose for cluster autoscaling?

What we can see here is that the so-called spot instance controller is deployed inside our Kubernetes Cluster. It is reporting the state of the cluster constantly to the spot ocean server, which lives outside of our infrastructure and is managed by NetApp (Flexera). 

Now let’s check what the same setup with Karpenter would look like:

Spot Ocean vs Karpenter - Karpenter setup

We can see that we completely got rid of any third-party components, Karpenter is running entirely on our own infrastructure inside our Kubernetes cluster, and from there communicates directly with AWS.

This has both advantages and disadvantages. On one side, we have gained more control over our infrastructure. And we don’t have to give third-party access to optimize resource usage in our AWS account 

On the other hand, we now have to manage this component ourselves, and we cannot rely on any support or pre-made observability dashboards that we had before with Spot Ocean.

Why Did We Choose Karpenter? 

Many factors went into this decision, and the goal of this post is not to complain about or diminish the great product that the folks over at Spot Ocean have built; we probably couldn’t have gotten to where we are today without them. 

The following table shows some of the many points we had to consider. 

Karpenter is open-source, which allows us to dig deeper into issues and to better understand the decisions being made by it. We also don’t have to pay a percentage of our spot savings to NetApp anymore. It was one of the main motivators behind this migration from a cost point of view. 

What sealed the deal for us was seeing that Karpenter is faster in provisioning new nodes than Spot Ocean. Since we are trying to use spot instances as much as possible, this matters to us, as nodes come and go all the time.

Category Spot OceanKarpenter
Open-SourceNoYes
CostPercentage of SavingsFree
ObservabilityOut of the BoxMetrics
PlatformsEKS & ECSEKS
Resource RecommendationYesNo
MaintenanceLowerHigher
SpeedFastVery Fast

Which Came First: The Node or the Karpenter?

Now that we have decided that we want to adopt Karpenter. We need to solve a little conundrum that Karpenter will be managing all the nodes joining our Kubernetes cluster, but it also needs a node to run on itself.

The two most common ways to solve this on AWS are: 

  • Managed Nodegroup.
  • Fargate.

Here, the managed nodegroup refers to us creating a separate nodegroup that is not handled by Karpenter and for the sole purpose of running Karpenter. 

Fargate is an AWS serverless container offering that allows us to not think about nodes at all and instead just run individual pods.  Ultimately, we decided to use Fargate, and so far, we are very happy with it.

That said, there were some initial problems that we wouldn’t have had if we had opted for a managed nodegroup instead. 

One of them is daemonsets. 

The premise of Fargate is that you don’t have to worry about a node; there is no node, you just run individual pods. 

But from the perspective of your daemonsets, every Fargate “node” is still a node, and they would like to schedule on them, which is not going to work.

To work around this, you need to add the following affinity to all daemonsets inside your cluster:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: eks.amazonaws.com/compute-type
              operator: NotIn
              values:
                - fargate
                

These days, we have found that many helm charts already provide this affinity as a default for their daemonsets. 

This made our lives a lot easier, and since we like to contribute back whenever possible, we also made a PR to the prometheus-node-exporter helm chart to have this nodeAffinity by default. 

The PR has been merged by now and might make your work a little easier if you decide to adopt Fargate in your cluster in the future.

👉  PR prevents node-exporter from being scheduled on fargate.

Scaling Down Test Environments Outside Working Hours

One of our requirements was the ability to scale down the cluster on weekends, nights, and holidays. 

Spot Ocean made it easy for the cluster to scale to zero, as Spot Ocean itself is running outside the cluster. Here is how that works:

When a period of running time ends, Ocean automatically scales down the entire cluster to 0. During the off time, all nodes are down, the Ocean Controller is down, and does not report information to the autoscaler.

When off-time ends, Ocean starts a single node from a virtual node group without taints. If all virtual node groups have taints, Ocean starts a node from the default virtual node group unless useAsTemplateOnly is defined, in which case no node is started. In the latter case, check that the controller is running, possibly on a node not managed by Ocean.

The above is taken from the Spot Ocean documentation for Shutdown Hours.

This approach has worked very well for us over the last couple of years, but now with Karpenter, we needed to rethink this process. This is because Karpenter is part of the cluster and running inside of it, meaning that if we truly scale down to zero, there will be no one there to bring up the cluster again.

So instead of a scale to zero, we need to go with a scale to a minimum approach, where the minimum consists of Karpenter and CoreDNS.

As we have established before, we are running these components on Fargate nodes, which are not managed by Karpenter. Karpenter manages the nodes for all other scaling workloads in the cluster.

Here we introduce the concept of Nodepools. Every node that Karpenter spins up has to be part of a Nodepool; therefore, a cluster that is managed by Karpenter has to have at least one Nodepool but can also have many. 

The most important property of the Nodepool for us right now is limits, which define the maximum amount of resources (CPU and memory) that can be part of this Nodepool. 

For example, let’s assume the Kubernetes cluster has a single Nodepool Default where fall nodes are in. 

Now, if we want to shut down the cluster because the weekend is coming up, we can simply set the resource limits for CPU to 0 for this Nodepool and drain and delete the existing nodes.

kubectl patch nodepools.karpenter.sh default --type merge --patch '{\"spec\": {\"limits\": {\"cpu\": \"0\"}}}'

kubectl drain -l karpenter.sh/nodepool=default --ignore-daemonsets --delete-emptydir-data --disable-eviction

kubectl delete node -l karpenter.sh/nodepool=default;

With a CPU limit of 0, Karpenter won’t bring up any new nodes for the cluster even though all pods are now in pending state.

When we want the cluster to scale up again, we simply patch the CPU limit for the Nodepool back to its original value.

kubectl patch nodepools.karpenter.sh default --type merge --patch '{\"spec\": {\"limits\": {\"cpu\": \"1000\"}}}'

While you can run these commands manually, we went with an automated approach and created two cronjobs for this. The scale-down cronjob would run every evening and the scale-up cronjob would run every morning except for weekend days.

Hold up, where does the scale-up cronjob run if there are no nodes? Ah right, Fargate it is again.

Migration from Spot Ocean to Karpenter

Now let’s go over the steps we took on our production cluster to switch from Spot Ocean to Karpenter without causing any downtime in the process.

We can split this into two phases:

  1. Preparation
  2. Migration

We can do all the steps under the preparation phase without causing any disturbance to the running workloads.

The preparation steps are:

Migration from Spot Ocean to Karpenter steps
  1. Fargate Profile: as this is the first time that we are using Fargate in our Cluster we have to add the profile for it.

  1. Security Group Modifications: For Fargate nodes to be able to communicate with regular EC2 nodes in the cluster we have to modify their security groups.
  1. Restart CoreDNS: Restart CoreDNS pods one at a time and see them come back up on Fargate.

  1. Install Karpenter: This entails using their Helm Chart to set up Karpenter on the Cluster as well as creating the SQS queues for spot interruption notifications and setting up all the IAM permissions needed for Karpenter to operate

With these steps done, we now have two cluster autoscalers running in our cluster. At the moment, all the nodes are still managed by Spot Ocean, but we can start the switch by following these steps:

  1. Reduce the maximum number of nodes for Spot Ocean.
  2. Watch pods go pending
  3. Karpenter provisions new nodes
  4. Pods schedule on Karpenter nodes
  5. Repeat

Broken Nodes

There had to be problems we only noticed after we successfully migrated to Karpenter.  The most notable one was that we would start seeing nodes get stuck in a `NotReady` state indefinitely. 

The worst part about these cases was that the pods on these nodes would also get stuck in a terminating state.  

If this pod is then part of statefulset, a new pod would not come up until it is terminated completely, which won’t happen until someone manually interferes and terminates the node on the cloud provider side. 

These cases can thus cause actual downtime for the company.

To tackle this problem, we created a custom controller that watches for nodes that have the unreachable taint for longer than 10 minutes and if it finds one, it will terminate it.

This controller is open source and can be found on the link below:

👉 https://github.com/adjoeio/node-notready-controller

Results

  •   5-10% Faster Node Provisioning
  •   Reduced Cluster Cost

Note: The reduced cluster cost mostly stems from the fact that we don’t pay a percentage of our spot savings to Spot.io anymore. The cost of the workloads themselves hasn’t changed noticeably. 

AWS also recently announced EKS Auto-Mode, which is managed by Karpenter. 

If you choose this solution, then Karpenter becomes part of the control pane and will be managed by AWS. It sounds like a great solution until you look at the cost – we will still be managing Karpenter ourselves.

Lessons Learned

You want to have as many instance types in your node pool as possible. For example, we are running Kafka on one of our clusters, and we had only two instance types in its nodepool.

  • m5n
  • m6in

And this caused a huge amount of “insufficient capacity” errors, the orange line on the graph below, where Karpenter was unable to find spot capacity for this request. 

In these cases, we would switch to the more expensive on-demand instances. But as capacity for these instance types becomes available again, we would switch back to spot, which causes another disruption of the workload. 

Autoscaling in Kubernetes: Migrate from Spot Ocean to Karpenter

Since our only requirement for the nodepool was that instance types should be network optimized, we added all of the following types to it

  • r5n
  • c5n
  • r6in
  • c6in

Which resulted in the following drop-off for interruptions.

Autoscaling in Kubernetes: Migrate from Spot Ocean to Karpenter

As you can see on the right, we still see some spikes of interruptions, but not nearly as bad as before, as Karpenter is now usually able to find some kind of capacity for our workloads to run on. 

What happened to the green line, though?

Well, there was one more strange behavior we experienced in the previous configuration where Karpenter would create a new node and then immediately disrupt it. 

We figured that this happened because `consolidate_after` was set to `0s` and no workload could be scheduled on the node fast enough. We changed `consolidate_after` to `3m` instead and these cases have completely disappeared since then. This has also become the default setting across all of our nodepools now.  

End Note

Migrating from Spot Ocean to Karpenter gave us greater control, faster scaling, and reduced costs. It’s also worth noting that Karpenter is still being very actively developed, issues are being resolved fast, and new features keep coming out regularly. 

Also, the community is very active on Slack, which makes it generally easy to find help whenever needed. You can also stay updated with us on the adjoe engineer’s blog.

Build products that move markets

Your Skills Have a Place at adjoe

Find a Position