AWS Graviton5 Benchmarks

Published on Jun 12, 2026 by Gergely Daroczi. Read time: 0 mins.

Topics: #benchmark #performance #score #aws #vendor

AWS announced the general availability of the new Graviton5-powered (ARM) m9g and m9gd instance families, promising "up to 25% better compute performance", "2.6x more L3 cache", "faster memory speeds", "15% higher network bandwidth", and "30% higher IOPS" than the previous generation.

This sounded very exciting already back in December when the new Graviton generation was announced at AWS re:Invent 2025, but we only had marketing claims at that time without the ability to actually measure performance -- so I was super happy to dig into the Spare Cores data we automatically collected overnight by actually starting all new instance types and running 500+ benchmark workloads on each along with detailed hardware discovery tools.

You can find the raw data under open-source licenses, but here are the direct links for easy human inspection of the related server sizes across 4 Graviton generations (m6g, m7g, m8g, and m9g):

Larger instance sizes are only available at the m8g and m9g families, as previous generations maxed at 64 vCPUs:

And the metal versions (note that older generations had much lower vCPU/RAM):

metal (64vCPU & 256 GiB RAM @ m6g and m7g; 192 vCPU & 768 GiB RAM @ m8g and m9g)

While I already spent some time reviewing all this rich data, I'm highlighting the most important aspects below to get you up-to-speed 😄 For demo purposes, I'll refer to the large 2xlarge instance sizes in the charts below.

The Specs #

The newer generation of CPU indeed brings in clearly visible advantages over the previous generations -- even just looking at the hardware inspection results (although the hypervisor is sometimes just too shy to reveal all the details):

CPU specs of the large instances of the m6g/m7g/m8g/m9g instance families.

CPU specs of the 2xlarge instances of four Graviton families

Besides the higher frequency, this increase in CPU cache capacity can be beneficial for many workloads: AWS stated that the "chip includes a 5x larger L3 cache" and that "each Graviton5 core has access to 2.6x more L3 cache than Graviton4", while we saw a ~50% increase in the L3 cache amount at this server size.

Note that when looking at the recent metal versions, there's indeed a 73728 KiB -> 196608 KiB jump in that metric, all 192 no-HT CPU cores divided into two symmetric NUMA nodes, each with 96-96 vCPUs sharing over 96 MiB L3 cache (m9g.metal-48xl):

CPU and System Topology of m9g.metal-48xl

Fun fact: the 2MiB private L2 cache per core adds up to a massive 384 MiB .. actually over the aggregate L3 cache amount (192 MiB).

The other highly visible change in the specs is related to the network card's speed:

Memory and Network specs of the 2xlarge instances of four Graviton families.

Memory and network specs of the 2xlarge Graviton instances

This is all in sync with the AWS announcement: "with up to 15% higher network bandwidth and 20% higher EBS bandwidth on average across instance sizes, and up to twice the network bandwidth for the largest instances".

Pricing & Cost Efficiency #

One of the most important bits! By default, we show the best on-demand and spot prices for all selected instance types across the globe, so sometimes preferring some of the less mainstream regions with lower prices:

Pricing and CPU score of the m6g.2xlarge, m7g.2xlarge, m8g.2xlarge, and m9g.2xlarge instances

Pricing and CPU score of the m(6|7|8|9)g.2xlarge instances

The new generation instance is a massive winner when looking at both the single-core and multi-core "SCore" (basically a CPU-only stressing metric of div16 ops): 16.5% improvement in the single-core, and 17.5% boost over the multi-core score at the same number of vCPUs.

But the price increase is also steep in the above table: while you can get the previous-gen instance sizes at 20-25 US cents per hour (on-demand), the most recent generation costs close to 40 US cents per hour at this instance size .. but note the difference in the related AWS regions: the newest generation is only available in 3 US and 1 EU regions. A fairer comparison is looking at the prices in the same (N. Virginia) region:

Pricing and CPU score of the m6g.2xlarge, m7g.2xlarge, m8g.2xlarge, and m9g.2xlarge instances in the us-east-1 region

Pricing and CPU scores in the same example region (us-east-1)

Now this is much more promising: the ~39 US cents of the newest gen compares to the 31-36 US cents of the previous gens at much better performance, overall resulting in higher "$Core" (SCore divided by the price showing the amount of SCore you can buy with $1/hr), so higher performance at the unit price. The low spot prices for previous-gen instances at various regions are still tempting, though -- when there's actually related capacity.

Benchmarks #

We have run ~500 benchmark workloads across all these instance families and sizes, including memory bandwidth measurements, OpenSSL speed of hash functions and block ciphers, static web serving, key/value database operations, LLM inference speed, and general benchmarking suites -- such as GeekBench or PassMark. You can find all the related data and charts in the above URLs, but highlighting a few:

Memory bandwidth measurements of the m6g.2xlarge, m7g.2xlarge, m8g.2xlarge, and m9g.2xlarge instances

Memory bandwidth measurements of the Graviton instances

The newest gen is the clear winner for all read, write, and mixed operations in terms of memory bandwidth at lower block sizes, but surprisingly underperforms previous generations when the block size reaches the L3 cache size, so the CPU is forced to interact with RAM. This might be valid due to the dual-NUMA design, or a methodology detail, so to confirm this, we not only run bw_mem from LMbench, but also our tailored tool (sc-membench) that scales better with many CPU cores and complex NUMA architectures. Unfortunately, we don't yet have the related measurements for the previous gen instances due to funding (we would need to spin up already benchmarked servers again) -- I will follow up on this later. PS If you are from AWS, I appreciate any help with cloud credits for future measurements, as benchmarking thousands of instance types at scale is an expensive pleasure 😊

Benchmarking suites, such as PassMark, show the newest gen instance winning across the board with 16-50% performance improvement, even when comparing to the recent m8g.2xlarge:

	m6g.2xlarge	m7g.2xlarge	m8g.2xlarge	m9g.2xlarge
String Sorting	22.87K	31.62K	37.11K	43.05K
Single Threaded	1.11K	1.57K	1.94K	2.46K
Prime Numbers	60.27	92.45	138.82	162.59
Integer Maths	31.57K	38.16K	41.72K	49.01K
Floating Point Maths	23.96K	37.94K	48.48K	61.26K
Extended Instructions	4.98K	6.64K	7.37K	10.80K
Encryption	1.08K	1.12K	1.50K	2.36K
Compression	37.73K	42.25K	53.12K	74.64K
CPU Mark	5.22K	6.07K	7.68K	10.87K

The overall PassMark score shows that the performance has doubled since the m6g generation, and increased by 40% since the previous (m8g) gen.

The memory-related PassMark scores are similarly promising:

	m6g.2xlarge	m7g.2xlarge	m8g.2xlarge	m9g.2xlarge
Memory Write	12.53K	19.66K	21.24K	24.93K
Memory Read Uncached	9.17K	18.70K	19.51K	23.80K
Memory Read Cached	9.48K	19.66K	21.17K	24.95K
Memory Latency	71.56	52.49	48.88	30.71
Database Operations	5.17K	8.04K	12.12K	14.92K
Memory Mark	1.73K	2.87K	3.08K	4.06K

Note the massive reduction in the memory latency metric, which is well aligned with the AWS announcement. Overall, we measured 30+ percent improvement over the m8g.

Let's not forget about the elephant in the room of all tech articles/conference talks/restroom small talk conversations nowadays: LLM inference. Although CPU-only instances are usually not the best fit for serving LLMs, smaller models can perform at very reasonable speed for low-concurrency scenarios. That's what we measured by using llama.cpp:

LLM inference (text processing and text generation) speed of the m6g.2xlarge, m7g.2xlarge, m8g.2xlarge, and m9g.2xlarge instances using gemma (2B)

LLM inference (text processing and text generation) speed of the m(6|7|8|9)g.2xlarge instances using gemma (2B)

The m9g outperformed previous generations by far, and even managed to perform tasks that older-generation machines timed out on. Although the above screenshot is on Gemma (a 2B parameter LLM), these instances managed to also load and serve the 7B Llama model as well, with 20+ tokens/sec for prompt processing, and 15+ tokens/sec for text generation -- well over 30% improvement compared to m8g, and oftentimes 2-3x speed boost compared to m6g.

Due to the limit on the number of images one can include in a post, I will not share all the other benchmark results here (e.g. compression and OpenSSL algos, web serving or key/value database ops), but please check the above URLs -- I'm sure you will find some additional interesting data points there.

Summary #

I know this has been a long post, so TL;DR:

The new gen servers seem to deliver what it claimed in the announcement 😊

I hope you enjoyed this write-up and found the standardized data on 4 generations of Graviton useful -- please let me know in the comments below!

PS This article was originally posted on the r/aws subreddit on June 12, 2026 -- but right after publishing, it was flagged NSFW and "removed by Reddit's filters". We still have no idea which benchmark score triggered that bot decision (probably still running on m6g) 🤐