|DNS & Routing|
|DNS:||Python & C|
|Automation & Monitoring|
I originally met him through Jason Read. Kris worked at Voxel/Internap spent a lot of time in Asia. We originally met for Indian food at 10AM in NYC when I realized he was really smart and had great ideas for building NSONE. This article attempts to break down their stack and provide you insight that might help you. Enjoy.
–Chris / ScaleScale / MaxCDN
What makes NSONE different?
NSONE is a modern DNS and traffic management platform. At the end of the day, our job is to get your eyeballs to the right place. That can be as simple as answering a plain old DNS query reliably and quickly on a global basis, and as complex as doing detailed computations on the fly to select the best of many potential datacenters, CDNs, or other endpoints to service a user based on all kinds of real-time telemetry about your infrastructure or the internet. To do all that we’ve architected and built a brand new DNS delivery stack from the ground up, deployed a world-class managed DNS network on 6 continents, and fundamentally rethought a lot of how DNS is traditionally used and managed. NSONE’s also an awesome team of hardened infrastructure pros. We’ve all got deep backgrounds across the infrastructure spectrum: hosting, cloud, on-demand bare metal, colo, transit, CDN, you name it, we’ve built it. That breadth of expertise and perspective turns out to be really useful in answering that simple question: where should we send this traffic? It helps that we have a great time while we’re at it, and have a lot of respect for each other and our customers.
How do you think about DNS based routing, and what are some of the interesting ways you route traffic for your customers?
If you’re just delivering your application from a single datacenter, DNS based routing isn’t something you’re thinking about. But the moment you’re in multiple datacenters and you’ve got a choice — which endpoint is going to best service a user — things get interesting. “Best” is a totally overloaded term: it might be something like fastest response times, or highest throughput, or lowest packet loss, but it could also be any combination of performance and business metrics.
Most DNS based routing that’s being done today makes some simplistic assumptions, for example, that the datacenter physically closest to a user is the one that’s going to give the best service. But that’s not how the internet works, and it’s also a very myopic view of “best”.
Our view is that at decision making time, when we get a DNS query for some domain and have a bunch of potential answers to select from — usually service endpoints for your application — we should have as much information on hand as we possibly can about those endpoints.
Mostly this information falls into three categories:
- Static details: stuff like where your servers are located, how many cores they have, basic priorities and weights, etc.
- Infrastructure metrics: real-time information about what’s happening in your infrastructure, like load averages, connection counts on load balancers, how much of your commits you’ve used, and so on.
- Eyeball metrics: real-time information about what’s happening between end users and your endpoints, like granular latency, throughput, or similar metrics from the vantage point of end users, or information about the state of the global routing table, etc.
What we’ve built is a set of systems and a platform that’s meant to take in all this information, often at high frequency, massage it into something useful for routing, and get it out to our DNS delivery edges as close to real-time as possible so it can be used in making routing decisions.
It’s one thing to have a lot of data available to use in routing; it’s another thing to make it easy to use. This is what our Filter Chain tech is all about. At a high level, the Filter Chain is like a sequence of little building block “filter” algorithms. Each filter examines the set of answers we could give to a DNS query, and all the static config and metrics we have, and manipulates the set of answers somehow: remove endpoints that are down, sorts them by response time for the network of the requester, sheds load from overloaded servers, etc. When you connect a bunch of these filters in sequence it turns out to be really easy to build complex turnkey routing setups driven by real-time data about your application infrastructure.
Our customers use the tech in all sorts of interesting ways, and are always finding new ways of mixing and matching data and algorithms to get powerful behaviors. It’s common for us to do something like: pick a region (e.g. US-WEST or US-EAST), then within the region, send 95% of traffic to colo and 5% to AWS, and make that sticky so most of the time the same 5% of users to go AWS to ensure good cache locality, and if colo infrastructure gets overloaded, start shifting weight so more traffic goes to AWS with auto-scaling enabled, and on and on.
These capabilities are naturally suited to hyper performance sensitive applications in many datacenters, so we do a lot of routing for CDNs, ad tech companies, major web properties, SaaS platforms, and the like. But really, the moment you graduate from a single datacenter app deployment into two or more datacenters, we can make a powerful difference.
What is your stack?
There’s a lot going on in our stack. On the delivery side, we’re touching stuff from the hardware/NIC level (crazy packet filtering), doing deep traffic engineering in BGP, leveraging low level kernel features to get as precise as routing DNS queries to specific cores to maximize cache locality, and hitting a totally custom written nameserver that executes complex routing algorithms for every single request. At a higher level, what we’ve built is a big globally distributed real-time system, and we’ve tried to use the right tools for the right jobs. You’ll see some Mongo (it’s good at replication but we are lightweight with it on the reads/writes), RabbitMQ (which helps us quickly propagate config changes and routing data to every nameserver in our network), Redis, etc. In our core facilities where our portal, API, and other central systems live, we’re using stuff like Hbase and OpenTSDB for metrics. We have a lot of our own software, much of it Python (very heavily Twisted). Probably 20+ different roles. Everything is managed by Ansible, which is really crucial to the velocity with which we iterate and deploy. And we have a really solid QA/build framework and process in place, with a comprehensive functional testing suite. In something as mission critical as DNS you need to be pedantic about QA.
What tools do you use for Performance Monitoring?
Catchpoint is our primary monitoring tool. We also spend a lot of time with route-views and various backbone looking glasses just to ensure we have a clear view of how traffic is getting to our network. Going forward, I anticipate we’ll be using CloudHelix more and more — it’s a powerful tool for doing ad hoc queries on flow metrics.
We generally avoid using some of the simple online ping and other tools: we frequently find they have incorrectly geolocated their nodes, or that they’re otherwise not producing very reliable results. We also have a very significant RUM infrastructure of our own as part of our higher end routing tech, and there are some unique ways we can leverage that to get very granular data on the performance of our DNS infrastructure across a breadth of end user networks.
How is your network laid out?
This is a fun question, because NSONE operates a whole variety of networks. One of the truly unique things we do is deploy private, dedicated managed DNS networks powered by our delivery stack, so there are all sorts of topologies at play. Our big, global, managed DNS network — the one most of our customers use — is spread across six continents and 17 POPs (with some more on the way). It’s anycasted, with around 7-8 carriers in the mix depending on the market. We make really heavy use of BGP communities and specifically work with providers that have solid BGP community policies in place, so we can do really precise traffic engineering to tune our anycasting. It took a lot of work, but now we have one of the fastest DNS networks on the planet.
When you’re building a global anycasted network, what you really care about is topological distance to carriers — especially, to all the big backbones. When we announce our prefixes to a particular network, we need to tune the announcements very precisely to make sure all our global announcements are equidistant (same number of hops) from the backbones. This is where BGP communities come in: we hint to our upstreams to prepend hops to certain routes, or prevent export of our prefixes to certain NSPs. In some regions like South America or Africa, we need to be really careful to prevent export of our routes outside the region. And we always need to prevent route “leakage” via peering exchanges or other paths that some networks normally prefer.
|SEA01||Seattle, Washington, USA|
|SJC01||San Jose, California, USA|
|LAX01||Los Angeles, California, USA|
|DAL01||Dallas, Texas, USA|
|MIA01||Miami, Florida, USA|
|ORD01||Chicago, Illinois, USA|
|LGA01||New York, New York, USA|
|IAD01||Herndon, Virginia, USA|
|GRU01||Sao Paulo, Brazil|
|CPT01||Cape Town, South Africa|
|HKG01||Hong Kong, Hong Kong|
We’re adding locations all the time.
When we deploy private DNS networks, they can be pretty unique. Commonly we’ll deploy into a customer’s existing datacenter footprint and use their connectivity. Sometimes we build purely internal-facing DNS networks for service discovery or other purposes. We also run into quite a few customers who need a real globally distributed, anycasted network, but for regulatory, policy, or technology reasons they can’t use our managed DNS network, and they don’t already have their own global footprints to leverage. We can deploy those kinds of networks into a few clued-in public cloud providers like HostVirtual. We even deploy right into Black Lotus’s DDoS scrubbing centers for customers that have unique DDoS requirements.
We use ExaBGP and some custom code to automate some of the communities. ExaBGP includes support for flowspec which is not supported in a lot of networks, but where it is, is a powerful/flexible way to have really fine-grained routing control.
What is the make up of your team?
We have an amazing team. A lot of us came out of a successful infrastructure company called Voxel that was bought by Internap a few years ago. We’ve worked together for close to a decade, specifically on high volume internet infrastructure. We’re a small team — a couple engineers, ops/devops/neteng to manage our global network, a bit of sales and marketing, and customer support. Everyone on our team spends time with customers — that’s something we really value.
What is the workload being processed?
On the delivery side, our fundamental unit of work is the DNS query. For every query, we’re executing a custom sequence of routing algorithms (which we call the Filter Chain) acting on a collection of answers (e.g., load balancer IPs or CDN hostnames) and realtime data about the answers (e.g. load or other infrastructure metrics, network metrics, etc). The other less visible workload for us is on the data side: ingesting, classifying, normalizing, aggregating, and distributing volumetric, real-time data that’s useful for routing. Sometimes, thousands of data points per second are flying into our systems, and we need to be really smart about how and when to send the data out to our far-flung edges for use by the Filter Chain.
How important is latency to you?
Latency is important in a lot of ways. The traditional managed DNS differentiator is basically how fast you spit out an answer. It’s table stakes in our industry to spit out an answer really, really fast. What’s much more interesting to us and where we think the DNS and traffic management industry needs to head is spitting out the *best* answer. More and more, applications are built to be distributed from the ground up, whether for reliability (DR environments) or performance (pushing the application close to the edge). It’s not that hard anymore to spin up a new CDN of your own using a modern cloud provider and nginx. Nor is it that hard to solve the multi-datacenter data replication and consistency problem with a whole raft of modern databases. It’s even fairly easy to use the basic geoip features of most managed DNS providers to do some simple, decent routing to get eyeballs to the “closest” datacenter. But if you’re really trying to optimize your application’s latency, or throughput, or really any business metric, second order approximations like geographic proximity won’t do — that’s not how the internet actually works. You need to measure those metrics: what’s the response time of a user on Verizon in New York to each of your different datacenters, right now? And you need to suck that data in at high frequency, do a bunch of math, and use it to send users to the right place. Which is what we do. So yeah: latency is pretty important to us. But maybe not in the way you’d think, at least at first glance.
What’s the current state of edns-client-subnet on the internet?
For those that aren’t familiar, edns-client-subnet (ECS) is an extension to the DNS protocol. DNS resolvers that support it (like Google Public DNS and OpenDNS) will include additional details about the original requester when sending a query to an authority (like NSONE) that supports ECS — usually, the first three octets of the requester’s IP address are sent along. That information can be really useful in doing traffic management, especially if you’re doing geoip or some of the more advanced routing NSONE does, since you have some information about the actual end user instead of just some big centralized resolver they’re using.
NSONE is one of the few DNS providers that supports ECS, and we’ve been doing it for a while now. We’ve learned some interesting stuff.
We pretty much only see ECS-enabled queries from Google and OpenDNS. Very few other resolvers support it. But mostly, that’s okay: Google/OpenDNS are big anycasted global resolver networks with nodes that might not really be that related (geographically or otherwise) to actual end users, so it’s valuable to get additional information from them for routing. The majority of other DNS resolvers tend to be more correlated with actual end users so there’d be less benefit in enabling ECS on those.
About 10-15% of the DNS queries we get have ECS data attached. What’s interesting is the moment we start returning query responses that use ECS data, the number of ECS-enabled queries increases because resolvers can only cache responses with respect to the subnet (usually /24) the response applies to. If you have a DNS record like an ad pixel domain that’s hit from most of the subnets on the internet, ECS can lead to a lot more DNS traffic (but also much better routing accuracy). If you have a record that’s localized in its set of users, ECS won’t affect your DNS traffic much. In general, we can do much more precise routing for ECS-enabled queries than we’d otherwise be able to do, thanks to the prevalence of Google Public DNS and OpenDNS.
What’s next for NSONE?
A lot of top secret stuff! We’re hiring like mad (isn’t everybody?) because we’re growing so fast. We’ve found a really fun space to be working in and we’re trying as hard as we can to really push the boundaries of what’s possible in traffic management. That means things like true eyeball metrics based routing, bespoke software defined private DNS networks specifically tailored to application workloads, and a lot of new and powerful algorithms and tools for doing and thinking about DNS and traffic management.
More Info – Kris presents at Surge 2014
Kristopher Beevers | Data Driven DNS: Traffic Management for Distributed Applications
Popular search terms:
- NSONE NET
- exabgp mongo
- nsone datacenters