Amazon Web Services (AWS) Autoscaling logo

Quick and Dirty Autoscaling of Nomad jobs on AWS

Matt McLaughlin
Matt McLaugh
Published in
3 min readNov 14, 2018

--

While HashiCorps Nomad is fantastic at scheduling jobs over a static host deployment, it has no native solution to handle AWS Autoscaling groups. We ran into this issue when one of our clients experienced a significant spike in requests, and the number of running jobs was not enough. The AWS Autoscaling Group (based on CPU) kicked off and brought up more hosts who helped a little in that some of the periodic jobs had plenty of resources, but the primary long-running service (which runs one per host) was overwhelmed.

As a short-term solution, we manually adjusted the number of jobs running to match the number of new hosts, but this is a horrible way to manage a cluster. The desired solution would automatically change the job count based on the available hosts using tools we already had deployed. It’s worth mentioning we wanted to solve this problem as fast as possible so as not to restrict the client’s growth (we love it when our clients succeed)

The Solution

Built using the following tools:

  • Consul (an established cluster)
  • consul-template
  • Nomad
  • aws-cli
  • bash
  • jq (used for parsing the results of aws-cli calls)

A bash script runs every 15 seconds from a periodic Nomad job that runs an aws-cli query that returns a list of ec2 instances where our applications can run. The script uses jq to get the length of the list and updates a corresponding value in the Consul K/V. Then we change our long-running service Nomad file to a consul-template that references the host count in the Consul K/V as its job count. When `consul-template` detects a change it updates the service file template and runs `nomad run`. Anytime the Autoscale Group adjusts the host inventory there will be at most a 15-second delay before the service scales with it.

We have a nomad service that manages the consul-template application.

bash script that counts application servers and updates consul k/v
the periodic batch job that runs the long polling script

Pros:

  • Uses the existing tool-chain.
  • Responds quickly to incrementing/decrementing host count.
  • It’s easy to read and trace the series of actions.
  • You don’t have to mess around with incrementing or decrementing the count; the value always reflects the current state.

Cons:

  • Long polling. I hate that how chatty this script is. It’s slightly abusive to both the aws endpoints as well as our Consul API. Every 15 seconds isn’t crazy, but it’s not ideal.
  • Only useful in cases where you have a one to one for services to servers.

Can We Do Better?

A better long-term solution will be to write a small application that will listen to an SNS topic that is triggered by Autoscaling events. When the listener hears about the event, it can then check for the instance count and update the Consul K/V.

The good people at HashiCorp have acknowledged this would be a great feature but one that is likely to be included in an enterprise solution.

Need help scaling your infrastructure in the cloud? Looking for zero downtime deployments with a simple CI pipe? We would love to chat. Visit us at seriem.io

--

--

@mattmclaugh Professional problem solver and unceasing builder. A bicycle powered full stack developer focused on teams, architecture, and customer experience.