67% cost savings at adtech company

Published on Mar 19, 2024 by Gergely Daroczi. Read time: 0 mins.

Some of our team members previously worked together at an adtech company, with the need to run data science batch jobs requiring varying resources, e.g.

Training hierarchical models on a large dataset using R required several hundreds GBs of RAM,
NLP models in Python required GPUs,
Parallelized time-series forecasting models required many CPUs,
Many small jobs mostly waiting for SQL results running on a shared host.

Managed service to the rescue #

AWS Batch seemed like a good fit and relatively convenient solution to manage these tasks for us, but after a few months trial, we experienced unexpected costs due to developers misconfiguring resource requirements (overprovisioning to avoid errors, forgot to update etc) and AWS bin packing smaller jobs to already running larger nodes (started for heavy jobs), which resulted in keeping the expensive nodes running for a longer time.

FinOps #

We did some planning and estimates (similar to the below reference used as an industry standard in automation projects), and decided that it is worth our time to come up with a home-brew solution to reduce the monthly ~$5k EC2 costs in the long run:

Source: Is It Worth the Time? (xkcd)

The core concept was to:

Monitor the utilized resources for each job closely.
Find the optimal instance type for each job based on its historical resource usage.
Start a new spot instance for each job, and terminate ASAP after success (or failure).

This sounded relatively easy to implement, and after a couple hundred lines of Python code to spin up new instances and monitor them, it also felt relatively stable. After a couple weeks of experimentation and another 1-2k lines of code, it indeed became stable ... and actually brought down the monthly AWS bill for the data science team to below $2k!

Reasoning #

Proper resource usage monitoring was obviously an important factor in succeeding the project, but we considered avoiding bin packing as the key ingredient: AWS offers almost 800 instance types with the price being based on the included and attached resources, so it doesn't make much sense to run a larger node with multiple jobs, instead each job should run on its own (optimal) instance and stop as soon as possible.

Demo #

To achieve this, we started with a wrapper script to monitor Docker stats every few seconds to identify:

If the script can scale to multiple CPU cores (~max CPU percentage).
How consistently the script requires high amount of compute power (~average CPU percentage).
The required memory (~max memory usage).

This data was recorded in S3 files with databases connected for easy querying:

CPU and memory usage of a batch job monitored every few seconds

CPU and memory usage of job monitored every few seconds

Once the required/optimal vCPUs and memory was defined, we listed the matching instance types, ordered by cost per performance using the cloudperf Python package that we developed for this purpose.

Then the above-mentioned script iterated over the list of optimal instance types, trying to start one with enough capacity:

Trying to start the cheapest instance with 36 vCPUs

Once the job finished, the detailed statistics were written to S3 and the summary printed by the wrapper script:

Finished job's resource usage summary stats

Then we could use the newly collected data to fine-tune the server-selection algorithm by finding new patterns (e.g. a model training job running every hour requires high amount of memory only from time-to-time).

Overall, this solution managed around 4000 batch jobs every single day: some low-resource jobs running on a single shared host, others on their dedicated spot instance. The job-level data was stored in S3 just like the docker stats time-series:

Example remote job among the daily ~4k batch jobs

Further resources #

If interested in more details, we presented on the original solution at various conferences and meetups: