At MaxCDN, we had a Hackathon for our new Analytics Platform (MaxCDN Insights v2.0) which is trickling out. We built this platform in partnership using the commercial product from VoltDB. I decided to participate in the Hackathon and wanted to scan popular outgoing files from the MaxCDN network for malware.
New Malware Policy
We implemented a new Malware Policy and I wanted to come up with a tool using our MaxCDN Insights API. This would provide two benefits to us and the Internet as a whole:
- If a free trial was nefarious and distributing content they shouldn’t, we’d see it.
- If someone’s account got hacked, we’d have an early warning about malware being inserted.
At first, we looked at services like VirusTotal and a few other open source options. But there was nothing that would run independent and download the file and scan quickly. So we decided to build something. Since my article Rolling Your Own CDN – Build A 3 Continent CDN For $25 In 1 Hour was popular, I decided to post about our experience here. We’ll respond to any feedback you have in the comment section below.
Special thanks to my friend Courtney Couch who helped take control and build out documentation and a more scalable infrastructure.
What are the design objectives?
- Be completely API driven.
- Break apart (on the backend) the process of downloading and scanning.
- Use ClamAV to start. Be able to add definitions or other scanners easily.
- Be low cost and have a low footprint when idle.
- Be able to scale up infinitely.
- Be able to be sold as a separate service (if we want to) in the future.
Our first idea was to be completely instance-less, but that wasn’t possible because large files could take longer than container services would allow us to run while downloading.
How is scalability achieved?
We use stateless API servers persisting data to an Aerospike data store. Individual stateless scanners retrieve items from the Aerospike queue and process them. The API servers, Aerospike database, and malware scanners can all be simply scaled by adding nodes as necessary.
What is your infrastructure?
Everything is in AWS, we proxy the API through Mashape.
What is your stack? (Software, DBs etc)
- NodeJS for the scanners and API implementation
- ClamAV for the scanner
- Amazon ELB for the load balancer
- Mashape as an API proxy
- Aerospike for our datastore
How did you implement load balancing?
Amazon ELB. This may change depending on usage patterns. ELB is a bit on the expensive side for many small API requests.
What Data Store is used? Why did you choose it?
We used Aerospike because it scales horizontally very well, and performs fantastically. For the type of data we need to store and the ways we need to access it, we don’t need the multi-key operations redis offers, joins that sql offers, nor the expressive query language that MongoDB offers. The Lua-based User Defined Functions are more than ample for our needs. It’s also far easier to manage than something like Cassandra which is a headache.
What Language and Framework is used and why?
We use NodeJS but there are a number of languages that would have worked fine for our needs here. We could have just as easily used Go, Scala, Erlang, and so on. But since the majority of the work is done by ClamAV or by Aerospike in Lua, the language choice for the actual API and handling the queue wasn’t a major factor for us.
Is there any caching done on the server?
We set up Aerospike in a way that makes caching unnecessary. We pull data from memory using a key much like we would with memcached so the performance is fantastic without any long-lived caches. That being said, we do use small microcaches, or caches that last a few seconds at most. These kinds of caches cover the really damaging edge cases where perhaps someone has a script hitting an API call thousands of times.
What info is returned immediately to the client? What is queued?
We check all the URLs to see what file sizes we can detect and which files are accessible. We also sum up and return the total size of the files queued and a list of any files that are inaccessible. We can’t always determine the size of files without retrieving them, so these files are just calculated as 0 bytes until they are scanned.
What automations have been done? How do you spin up new instances?
At the moment, since this is a new service, there is no automation. Just alerts about resource utilization. Creating a new server merely means spinning up a new instance of the correct type and starting up the service.
How is Monitoring Done?
How does the Malware Detector Work?
1. Mashape sends to ELB which balances the HTTP requests to the API servers.
2. The API servers pump the scan with metadata into Aerospike.
3. The scanners monitor Aerospike for new scans and then process any pending files in the Aerospike queue.
4. As Aerospike processes items in the queue, it updates Aerospike with the latest scan status, any errors, and so on.
5. The API servers check Aerospike for the current scan status when you make API calls. This way the scanners can scale completely independently – and even if the scanners were down, the API stays up.
What does the final API look like?
We launched the site (where you can test the malware scanner and the API) at http://www.thumbsup.com. We’ll see if there’s uptake (besides our use case) and keep launching security tools for developers there.
Here is what the API looks like:
Create a Malware Scan
Creates a scan from a list of files
List all Malware Scans
Lists all the malware scans
List files for a scan
Load specific scan results
Loads the scan results for a specific malware scan
Popular search terms:
- https://www scalescale com/how-we-built-an-infinitely-scalable-malware-detector/