Great Architectures, Stacks & DevOps at Webscale

By Chris Ueland

The Making of OnMetal

OnMetal Stack

Chasis:Quanta F03A
TOR Switch:Cisco Nexus 3172pq
PCI cards (for IO):LSI Nytro WarpDrive BLP4-1600
Hardware Management through OpenStack
Nova Integration: OpenStack Ironic
TOR Integration: OpenStack Neutron ML2 plugin.
Language: Python
Datastore: MySQL
Event Bus: RabbitMQ


It’s my pleasure to bring you this ScaleScale post “The Making of OnMetal”. At MaxCDN, we buy a lot of metal. We are obsessed with speed and our philosophy is that being as least abstracted as possible is best. While it’s not the first bare metal provisioning platform, I’m excited about how it’s architected, how transparent it is and to see future iterations. We like the fact that OnMetal machines are standardized and burnt in and are priced per minute.
Whether we’re hanging out in the MaxCDN office or grabbing drinks at the Openstack Summit Paul Querna always pushes me to think different about infrastructure design. Paul is a friend who architected Cloudkick and most recently OnMetal. When I heard he and Ev Kontsevoy (from Mailgun) had built OnMetal I wanted to dig into the details with him. I am pleased to bring you this Q & A with Paul Querna from Rackspace.

–Chris / ScaleScale / MaxCDN

Can you tell me why you built OnMetal?

OnMetal Architect Paul Querna

OnMetal Architect Paul Querna

In my career I have experienced using many different infrastructures: from virtualized clouds, to DIY colocation, to hyper-scale datacenters. They all had different advantages, but they still were never quite the right fit for many applications. As an engineer inside Rackspace, I still felt many of the same pain points with infrastructure, but this time, because I work for an infrastructure provider, I was empowered to build a product to fix them.

I wanted a platform for my own teams to deploy faster. To move faster. To combine the best of high performance hardware with the dynamic capabilities of a cloud environment. OnMetal is the platform I wanted as a user of infrastructure.

What does this compete with? Why is it better?

Talking to our customers, we see the two major classes of competition as:

Customers who are already in colocation. They like the cost efficiency, but are realizing that it is not a good use of their focus. Having a supply chain, working with Open Compute, evaluating new switches, and operating multiple regions: these are all a massive time sinks that many customers are seeing diminishing returns on.

Customers in a Public Cloud. They are unhappy with the complexities of dealing with multi-tenancy and the implications it has on their operational expenses and software development.

Because words like “bare metal cloud” are so overloaded, it is hard to cut through what everyone is providing. OnMetal is the first offering which provides opinionated high performance servers, without a hypervisor, in a few minutes via an API. We don’t believe this capability will be unique in the long term, especially because we open sourced all of the related software, but for now we have a lead.

Can it be used for short spurts? Which workloads are best?

All of the instances are billed per-minute — so if you want 3.2 terabytes of PCIe storage for 15 minutes, have at it! Realistically we see the IO and Memory instance types as something people keep around for longer periods. Their purpose is to supply fast storage or RAM. To fill them up with your data takes time after they boot. We see the Compute type being used in more ephemeral workloads, it makes the perfect scale out web server, because it has so little local state.

Instance TypeCPURAMIO
OnMetal IO v12x 2.8Ghz 10 core
E5-2680v2 Xeon
128GB2x LSI Nytro WarpDrive BLP4-1600(1.6TB) and Boot device (32G SATADOM).
OnMetal Memory v12x 2.6Ghz 6 core
E5-2630v2 Xeon
512GBBoot device only. (32G SATADOM)
OnMetal Compute v11x 2.8Ghz 10 core
E5-2680v2 Xeon
32GBBoot device only. (32G SATADOM)
3 Types of Instances

OnMetal launched with three different instance types optimized for either high memory ($0.038/minute) , heavy I/O ($0.041/minute) or general compute ($0.013/minute).

Why the PCI storage and not SSDs?

Our objective for Storage was to provide a balance of price and performance. We found once you moved beyond a few hundred thousand IO operations per second, many common datastores like MySQL and Postgres had other bottlenecks. So we set our sights on how to get the best cost while still hitting north of 300,000 IO operations per second. When we analyzed our options, to get this kind of performance from SSDs you need to use multiple SSDs — and then you need a disk chassis, which reduces your density. We started looking at the PCIe flash cards because with 1 to 2 devices we could reach our target IO Ops, while keeping the smaller form factor. The PCIe storage does have a few drawbacks, like we lost the ability to hot swap. The combination of density, performance, and price are what drove the decision to use PCIe storage over SSDs.

LSI Nytro WarpDrive BLP4 400

PCI storage

The combination of density, performance, and price are what drove the decision to use PCIe storage over SSDs. These cards are found online for $12k+ each.

What hardware did you use? Why?

Our OnMetal hardware is based on Open Compute Platform (OCP). Our current deployment is based on the Quanta F03A and using Cisco Nexus 3172pq top of rack switches, but we expect the exact vendors and models to vary over time. Adopting Open Compute was important to our team and Rackspace. We believe that OCP provides a better model for both innovation and openness. Using OCP enabled us iterate on things like the BIOS and firmware of the servers, and still deliver a cost effective solution.

Will other regions be added?

Yes! Currently OnMetal is generally available in our Northern Virgina (IAD) region, and we will be adding other regions for OnMetal over the coming months.

What’s your stack?

Since OnMetal is powered by OpenStack, most everything is written in Python, uses MySQL as its primary datastore, along with RabbitMQ as an event bus. Deep underneath OnMetal is OpenStack Ironic. Ironic handles the nitty gritty of managing physical machines. We then have an Ironic plugin for OpenStack Nova to integrate it with our normal virtualized public cloud. To Nova, Ironic is just another Cell with a different instance type. To closely integrate top of rack (TOR) switches we modified Ironic and wrote an OpenStack Neutron ML2 plugin.

How do you do capacity planning?

Capacity planning for what is not just a new instance type, but really a new class of product was difficult. We started with data from our launch of other instance types. Then we modeled our growth based on our historical data for other large instances types. Additionally we had input from large customers on their expected usage. Using this we built out basic models and went from there. One of the reasons we started with just one region is we wanted to make sure there was enough capacity at launch in that region, rather than spreading out capacity across multiple regions.

Have any benchmarks been done so far?

Many. KVM and Xen have definitely improved over the years, but there are just certain operations that still have massive penalties. I am personally working on a scaling out Cassandra benchmark right now. We have seen significant performance gains in all kinds of workloads from Postgres to rendering of Ruby HAML templates. We don’t want to flood the world with our first-party benchmarks, and encourage our customers to try OnMetal out for themselves.

What is being done at the Top of Rack for OnMetal?

Our current Top of Rack switches are setup with an MLAG (or bonding) to each server, and on top of that bond we expose VLANs. This means every instance gets 2x 10 gigabit connectivity in a high availability configuration. I believe having highly available and performant networking is a critical feature — software developers want to believe that networks are infinite, fast, and reliable — these are all false in the end, but we can still make reasonable investments in networking gear to provide a better experience.

Today we expose just two VLANs, one for public internet traffic, and another we call “ServiceNet”. ServiceNet is a 10.x network local to each region that has no bandwidth billing. On these VLANs we lock down traffic to specific MAC addresses and IPs to prevent attackers from hijacking another IP address. In the future we want to make those VLANs into VTEPs (aka vxLAN gateways) as part of our vxLAN SDN we run for virtualized instances. But running VTEPs on a highly available MLAG on a TOR is still something that is pushing the edge software capabilities in modern switches.

What’s next for OnMetal?

The core of OnMetal has been fairly stable, so I think our next priorities are more feature parity with virtualized instances, like improved operating system support, Isolated Networks or Cloud Block Storage. After that, focusing on growing new instance types. We launched OnMetal with 3 instances types, focused on the needs of a multi-tier web application, but the performance improvements can be huge for many data intensive applications, so we will keep expanding the instance types were their benefits are the largest.

Have you seen any cool use cases so far?

We have several customers looking at packing the boxes with tens of thousands of containers — I think the density we are seeing is absolutely amazing — it enables whole new classes of businesses because the cost of infrastructure is so low. Enabling entrepreneurial customers to create entirely new classes of business is very cool to see.

Further Reading

Popular search terms:

  • oceanjrr

Chris Ueland

Wanting to call out all the good stuff when it comes to scaling, Chris Ueland created this blog, ScaleScale.