Great Architectures, Stacks & DevOps at Webscale

By Chris Ueland

Building a Cloud Benchmarking System


Jason has built an amazing test harness that is very extensible. He can quickly benchmark cloud instances and add more. Lots of lessons here and potentially rolling your own or using his data. Enjoy!

–Chris / ScaleScale / MaxCDN

How did you get started?

In 2010 I was managing applications for small businesses on co-located servers. I got interested in cloud as a resource for backup and overflow capacity. I quickly discovered that unlike dedicated servers and hardware, substantive performance comparisons of cloud services were nearly non-existant.

Most cloud providers seemed to embrace a model of opaqueness in describing their services and I had many questions. I began running benchmarks and blogging on my findings, and that’s how CloudHarmony got started.


Jason Read is the founder of CloudHarmony. Jason’s primary responsibilities include software development, benchmarking and client interaction.

What do your testing nodes look like? How many? What locations?

We use permanent and temporary test nodes.

Permanent test nodes are used to for availability and network performance tests.

We currently have 243 such nodes from 102 different cloud providers. The makeup of these test nodes is 159 compute instances, 20 CDN, 17 DNS, 37 object storage and 10 PaaS. All permanent test nodes host static web content – images and javascript files used to monitor availability and measure latency and throughput.

Compute instances use common network utilities (dig, curl, ping, traceroute) to measure network performance of other cloud services. We also let users test their connectivity to cloud services using a browser based speedtest. Realtime status and availability stats derived from these test nodes are available on our cloud status page.

Temporary test nodes are compute instances we use for system performance tests.

These are usually a representative selection of instances from each compute service. These tests take about a week, and once complete, we capture the results and terminate the instance.

What kind of benchmark suites do you run? How are they configured?

We use multiple benchmarks to measure different compute performance characteristics including CPU, block storage and memory. These benchmarks include the following:


SPEC CPU 2006 – An industry standard benchmark for hardware vendors.

This is our preferred CPU benchmark. We compile using Intel’s compiler suites and run using base/rate configuration with # of copies = # of CPU cores, and SSE flags where available. It is complex to setup, understand and run this benchmark correctly.

SPEC CPU 2006 is a licensed benchmark and subject to run and reporting rules governed by the SPEC organization. One of these rules is repeatability of results which unlike hardware, cannot be guaranteed in a virtualized cloud environment. Because of this, we report our SPEC CPU 2006 metrics as “estimates”.

UnixBench – An older benchmark (started in 1983) that is free and simple to download and run.

We use it mostly because of its accessibility and simplicity. We run with both single and multi-threaded (# of threads = # of cores)

Geekbench – A simple commercial benchmark that lots of people use for quick, simple performance tests.

It isn’t as robust as SPEC CPU 2006, but any user can download and run it with a few commands and have results in minutes. It is backed by a commercial organization and has an active community behind it including ability to upload and compare results in the Geekbench browser.


STREAM – A standard benchmark for measuring memory throughput

Block Storage

fio and the SNIA SSS PTS Enterprise v1.1 – SNIA (Storage Networking Industry Association) is an industry backed organization.

They provide guidance on test methodologies in the form of various test specifications. The SSS (Solid State Storage) PTS (Performance Test Specification) Enterprise v1.1 is one such specification that defines 8 tests for measuring solid state performance.

We created an open source implementation of this specification here and use it as the basis for block storage tests.

Block storage is a complex performance characteristic because there are infinite testing possibilities and because compute services often offer many different block storage options (e.g. ephemeral, external, SSD, rotational, provisioned IOPS, etc.).

Relevance of block storage test metrics is highly dependent on workload. For simplicity, our block storage analysis focuses primarily on synthetic metrics including random IOPS, sequential throughput and latency.

As new instance from providers come out, what do you notice?

One encouraging trend is increasing transparency. Amazon, Google, Microsoft and Rackspace are now publishing CPU architectures for new instance classes and better disclosures about performance expectations (e.g. block storage IOPS).

Amazon seems to have moved away from the (much criticized) days when you never knew what to expect when provisioning an m1 instance. I think cloud providers have come to understand that users want better disclosure, control and consistency over performance.

With cloud providers, what do you notice as you spin up lots of instances?

One observation is the difference in operational scale. Some services have liberal quotas and no problems handling large provisioning requests, while others are more restrictive or simply incapable of fulfilling such requests.

Another observation is the amount of time services take to fulfill provisioning requests, which can be different by an order of magnitude.

How do you manage all the VMs? Spin them up? Config Management?

For more common providers where we do testing frequently, we have scripts to automate provisioning using CLI/API integration. However, with coverage of 70 compute services each with different APIs (and some without), complete automation is impossible, so we manually provision many. Once VMs are provisioned, we have a centralized control system that automates installation of benchmarking software and dependencies, execution of tests, and processing of results. The entire test duration is 3-5 days per VM.

How do you remotely trigger benchmarks? execute commands?

We use a centralized control system that uses SSH and expect scripts to talk to VMs including benchmark execution, polling and results processing.

What do you use to measure DNS performance?

We use our global network of 170 VMs to measure non-recursive DNS performance using dig and direct authoritative queries. We also created a browser based recursive test using wildcard DNS records. This test alternates downloading an 8 byte file using cached and non-cached hostnames and records the difference in time between them. We capture user location using MaxMind’s GeoIP database and aggregate the results of these tests by geographical region. We publish summarized performance analysis monthly in a freemium report

How do you measure CDN performance?

We created a browser based test to measure CDN performance. This test lets users see network latency and throughput between their location and 20 difference CDNs. The test uses files we host on each CDN to measure latency and throughput. These files include an 8 byte file for round trip time tests and images (1KB – 10MB) for small and large file throughput tests. The test consists of a warmup phase to warm CDN caches, followed by test operations consisting of concurrent timed requests.

Popular search terms:

  • actualp5u
  • anythingbx7
  • barkyr1
  • benchmark for cloud instances

Chris Ueland

Wanting to call out all the good stuff when it comes to scaling, Chris Ueland created this blog, ScaleScale.