ScaleScaleScaleScale

Great Architectures, Stacks & DevOps at Webscale

By Chris Ueland - CEO of MaxCDN


The Stack Behind Netflix Scaling

profile

As we research and dig deeper into scaling, we keep running into Netflix. They are very public with their stories. This post is a round up that we put together with Bryan’s help. We collected info from all over the internet. If you’d like to reach out with more info, we’ll append this post. Otherwise, please enjoy!

–Chris / ScaleScale / MaxCDN

Application & Data
Languages Java, Python, Javascript
Database MySQL, Cassandra, Oracle
Frameworks Node.js
Cloud Hosting Amazon EC2
Javascript UI Library React
SQL Database-as-a-Service Amazon RDS
NoSQL Database-as-a-Service Amazon DynamoDB
Database Cluster Management Dynomite
Business Tools
Productivity Suite Google Apps
Project Management Confluence
Password Management OneLogin
Utilities
Transactional Email Amazon SES
Mobile Push Messaging Urban Airship
API Tools Falcor
DevOps
Code Collaboration & Version Control GitHub
Continuous Integration Jenkins
Server Management Apache Mesos
Log Management Sumo Logic
Mobile Error Monitoring Crittercism
Performance Monitoring Boundary, LogicMonitor


Open Connect CDN
Operating System FreeBSD
Server Nginx
Routing Bird daemon

A look at what we think is interesting about how Netflix Scales

Netflix was founded in 1997 by Marc Randolph and Reed Hastings in Scotts Valley, California and started with 30 employees with 925 working on pay-per-rent.Netflix, now the world’s leading Internet television network, has more than 69 million subscribers in 50 countries enjoying more than ten billion hours of TV shows and movies per month. They are very transparent and publish a lot of information online. We’ve collected it and are sharing the things we think are most interesting:

Scaling Culture
  Trust People, not Policies

Supporting Many titles with Amazon
  Having all storage on Amazon S3, lots of compute there..

Supporting Many different Devices
  Having 50+ Encoded titles

NetFlix Connect CDN
  Built their own CDN and multi-CDN capabilities

Scaling Open Source Projects
  Lots of goodies from security to testing to big data to Amazon tools.

Scaling Algorithms
  Deriving the best algorithms from NetFlix Prize Contest

Scaling Culture

NetFlix had a famous presentation about culture. The concepts are about re-thinking HR. A lot of their scaling of people is focused on the principles form this presentation. Here are some sample slides and the presentation. This gives some important context to the culture to understand how they scale their software stack and why it works.

The Full presentation is here

Supporting Many titles with Amazon

Netflix’s infrastructure is on Amazon EC2 with master copies of digital films from movie studios being stored on Amazon S3. Each film is encoded into over 50 different versions based on video resolution and audio quality using machines on the cloud. Over 1 petabyte of data is stored on Amazon. These data are sent to content delivery networks to feed the content to local ISPs.

Netflix uses a number of open-source software at the backend, including Java, MySQL, Gluster, Apache Tomcat, Hive, Chukwa, Cassandra and Hadoop.

Netflix Architecture
Netflix Infrastructure

Diagram showing Netflix viewable content, tech stack, distribution method and playback devices

Supporting Many Devices

The huge amount of codec and bitrate combinations on Netflix means “having to encode the same title 120 different times before it can be delivered to all streaming platforms”.

Although Netflix uses adaptive bitrate streaming technology to adjust the video and audio quality to match the customer’s download speed, they also provide users the ability to choose the quality of video on its website.

You can watch instantly from any Internet-connected device that offers a Netflix app, such as a computer, gaming console, DVD or Blu-ray player, HDTV, set-top box, home theater system, phone or tablet.

They support every title in the following Codecs with different bit rates to make them work on device and connection.


Netflix Open Connect CDN

The Netflix Open Connect CDN is provided for larger ISPs that have over 100,000 subscribers. A specially built low power high storage density appliance caches Netflix content within the ISPs’ data centers to reduce internet transit costs. This appliance runs the FreeBSD operating system, nginx and the Bird Internet routing daemon.



NetFlix Paris Open Connect – Photo Credit: @dtemkin twitter

Watch the Open Connect video here

Scaling Algorithms

In 2009, Netflix did a contest called the Netflix prize. They opened up a bunch of anonymized data and allowed teams to try and derive better algorithms. They got a 10.06% uplift of their existing algorithm from the winning team. Netflix was going to run another Netflix Prize but ultimately didn’t because of privacy concerns from the FTC.

The Netflix recommendation system consists of many algorithms. The two core algorithms used in their production system are Restricted Boltzmann Machines (RBM) and a form of Matrix Factorization called SVD++. These two algorithms are combined using a linear blend to produce a single higher accuracy estimate.

Restricted Boltzmann Machines are neural networks that have been modified to work in collaborative filtering. Each user has one RBM with the input node for each representing a movie the user has rated.

SVD++ is an asymmetric form of SVD (Singular Value Decomposition) that makes use of implicit information like RBMs. It was developed by the winning team in the Netflix Prize contest.

On their Engineering blog, the Netflix team covers Learning a Personalized Homepage

Open Source Projects

https://netflix.github.io/. Netflix has a great engineering blog and they recently did a post called The Evolution of Open Source at Netflix.

Big Data

Genie
A powerful, REST-based abstraction to our various data processing frameworks, notably Hadoop.
Inviso
provides detailed insights into the performance of our Hadoop jobs and clusters.
Lipstick
Shows the workflow of Pig jobs in a clear, visual fashion.
Aegisthus
Enables the bulk abstraction of data out of Cassandra for downstream analytic processing.

Build and Delivery Tools

Nebula
Effort at Netflix to share its internal build infrastructure.
Aminator
A tool for creating EBS AMIs.
Asgard
Web interface for application deployments and cloud management in Amazon Web Services (AWS).

Common Runtime Services & Libraries

Eureka
Service discovery for the Netflix cloud platform.
Archaius
Distributed configuration.
Ribbon
Resilent and intelligent inter-process and service communication.
Hystrix
Provides reliability beyond single service calls. Isolates latency and fault tolerance at runtime.
Karyon and Governator
JVM container services.
Prana sidecar
Prana provides proxy capabilities within an instance.
Zuul
Provides dyamically scriptable proxying at the edge of the cloud deployment.
Fenzo
Provides advanced scheduling and resource management for cloud native frameworks.

Data Persistence

EVCache and Dynomite
For using Memcached and Redis at scale.
Astyanax and Dyno
Client libraries to better consume datastores in the Cloud.

Insight, Reliability and Performance

Atlas
Time-series telemetry platform
Edda
Service to track changes in your cloud
Spectator
Easy integration of Java application code with Atlas
Vector
Exposes high-resolution host-level metrics with minimal overhead.
Ice
Exposes ongoing cost and and cloud utilization trends.
Simian Army
Tests Netflix instances for random failures.

Security

Security Monkey
Helps monitor and secure large AWS-based environments.
Scumblr
Leverages Internet-wide targeted searches to surface specific security issues for investigation.
MSL
An extensible and flexible secure messaging protocol that addresses a number of secure communications use cases and requirements.
Falcor
Represent remote data sources as a single domain model via a virtual JSON graph.
Restify
node.js REST framework specifically meant for web service APIs
RxJS
A reactive programming library for JavaScript

References

  1. https://en.wikipedia.org/wiki/Netflix
  2. http://gizmodo.com/this-box-can-hold-an-entire-netflix-1592590450
  3. http://edition.cnn.com/2014/07/21/showbiz/gallery/netflix-history/
  4. http://techblog.netflix.com/2015/01/netflixs-viewing-data-how-we-know-where.html
  5. https://gigaom.com/2013/03/28/3-shades-of-latency-how-netflix-built-a-data-architecture-around-timeliness/
  6. https://gigaom.com/2015/01/27/netflix-is-revamping-its-data-architecture-for-streaming-movies/
  7. http://stackshare.io/netflix/netflix
  8. https://www.quora.com/How-does-the-Netflix-movie-recommendation-algorithm-work
  9. https://netflix.github.io/

Popular search terms:

  • netflix tech stack
profile

Chris Ueland

http://www.ueland.com

Wanting to call out all the good stuff when it comes to scaling, Chris Ueland created this blog, ScaleScale. Chris is the CEO of MaxCDN.com

  • Hi – thanks for reading our post. If you have any knowledge of updates to the stack, please drop me a line chris at maxcdn com and we’ll update the article. Thanks!

  • We’ve updated this at several users requests to have Java listed as the primary language.

    There is also some talk on twitter of Netflix using the Spring framework. If anyone has some links we missed we’re happy to edit the post.

    Please drop me a line at chris at maxcdn com if you have other comments or changes. Appreciate all the feedback. We’d like to make this as accurate as possible.

  • Thanks to faizshah (HN) for providing this link https://www.youtube.com/watch?v=R2kKmMyqTfc which has Adrian Cockcroft gave a good overview of Netflix open source projects. https://www.youtube.com/watch?v=R2kKmMyqTfc

    He says that It’s two hours long but the first hour is sufficient to get an overview of their open source projects.

  • Thanks to fitzwatermellow (HN) for providing this link: https://www.youtube.com/watch?v=-mL3zT1iIKw

    fitzwatermellow says This AWS re:Invent 2015 talk by Dave Hahn is also really illuminating: A Day in the Life of a Netflix Engineer (Using 37% of the Internet)

    • David Houde

      Thanks for sharing this, not sure how I missed it!

  • Jonah Kowall

    I can guarantee you that Netflix is not using Boundary or LogicMonitor. Where did you get that data?

      • Jonah Kowall

        Ariel left Netflix in 2013, at that time they were much smaller than they are today. Boundary also ended up failing as a company and was liquidated. Here is my take on Boundary after covering and working with them for many years from start to finish : https://blog.appdynamics.com/devops/industry-insights-what-happened-to-boundary/

        Netflix also hired a new performance lead, who has very different ideas about monitoring http://www.brendangregg.com/ he doesn’t seem to care much about topology or system interaction, and instead focuses on single system performance tuning. I come from a different perspective, and the work Boundary did was revolutionary, but ultimately shifting focus many times they ended up losing what made them unique. The product which was built at the end of the journey was very different than the unique vision everyone saw initially. My blog explains that.

        What about LogicMonitor? They have a unique product as well, but they’ve been trying to grow the company for many years now with pretty limited success. I don’t think they will be able to sustain business long term based on the existing customer base.

        • steverfrancis

          While we cannot comment on whether Netflix has or has not been a LogicMonitor customer for years, they do list LogicMonitor as a desired skill in their jobs postings…

          https://jobs.netflix.com/jobs/532

          And I think you have a interesting definition of “limited success” (Admittedly I’m biased, but we’ve been growing about 100% year over year, continually, and have a marque list of whos-who customers, that, unfortunately, we mostly cannot mention…. A majority of public listed SaaS customers, many Fortune 500’s, etc..)

          • Jonah Kowall

            I guess you’ve been more successful than Boundary, but it’s been slowly building for sure. I think the product is great versus other infrastructure monitoring tools, but the problem is the market is 99% on premises when it comes to infrastructure monitoring. Even APM is 85% on premises today. Your growth has been impressive considering that fact.

        • Boundary also couldn’t handle any type of scale. They pitched hard to us at Limelight and got pretty sheepish when we pushed a small portion of our edge machines to it.

  • Jesper

    You miss one important component – Automic. Automic is used as the automation and orchestration tool and without this tool non of it would work. Automic is the back end engine for Netflix.

    • Jonah Kowall

      Only for ETL, they don’t use it across a large portion of systems…

  • David Ferguson

    Great post. Worth highlighting Spinnaker (http://spinnaker.io/) though as the replacement for Asgaard, which was open-sourced last week: http://techblog.netflix.com/2015/11/global-continuous-delivery-with.html