A Microscope on Microservices

twic · on Feb 18, 2015

I'm interested by the application of Little's law as a tool for distilling a particular slice through performance down to a single number:

http://en.wikipedia.org/wiki/Little%27s_law

You have a request rate in requests per chronometer-second, and a response time in stopwatch-seconds per request, and you multiply them to get a "demand" figure in stopwatch-seconds per chronometer-second. It's sort of dimensionless, and sort of not, because the seconds on either side of the division operator are sort of orthogonal (very vaguely like how joules per newton-metre is not dimensionless).

How do i use a number like this? Does it make sense to compare the numbers from two different instances of the same system? From instances of two different systems? Should i worry if it goes up? If it goes down? What can i do about it, either way? Is it meaningful to calculate it for component parts of my system, and is there a way to critically relate the values in parts to the whole? Is there a way to relate it to other quantities in my system?

cpwatson · on Feb 18, 2015

From my experience Little's law is typically used to quantify the number of users in the system, a measure of concurrency. For the purpose of our tools we leverage the calculation to provide insight on "offered load" or the time spent w/in the service for a given interval. We do have a challenge in that many of our downstream dependencies are called concurrently. At the current time this prevents us from easily decomposing the demand in a service cleanly among it's dependencies. Some of this has to do with our transaction tracing framework and the granularity at which we require call behavior to be easily time-ordered. We believe we can solve this overtime with an improved framework. In the case of Mogul we leverage the demand calculation to understand who is the largest contributor, pointing us in the direction of possible optimization. If we are using the utility to triage an issue we typically find that an increase in the demand or offered load within the problematic dependency tends to easily correlate with the demand of the calling service. I think we are just at the beginning of leveraging this data in a more effective manner, and getting away from having eyeballs look at a dashboard is definitely a goal.

adeptus · on Feb 18, 2015

That CPU flame graph is pretty cool. Haven't seen it displayed like that before. Shows process path/name contribution to volume of CPU spike, all in 1 graph. Neato.

brendangregg · on Feb 18, 2015

Thanks, I summarized them on: http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html . They are really helpful for seeing the big picture of CPU usage, and quantifying the contribution by different code paths.

The flamegraph code is on github (https://github.com/brendangregg/FlameGraph). There's other implementations too (see http://www.brendangregg.com/flamegraphs.html#Updates).

We're using them primarily to analyze CPU usage of the Linux and FreeBSD kernels, Java, and Node.js. We had an earlier post about the Node.js ones: http://techblog.netflix.com/2014/11/nodejs-in-flames.html

twic · on Feb 18, 2015

> At Netflix we pioneer new cloud architectures and technologies to operate at massive scale - a scale which breaks most monitoring and analysis tools.

Do we have any idea how massive Netflix's scale is, in terms of end-user requests per second, or some other metric?

And, probably more relevantly for me, how big a scale can one get to while using most monitoring and analysis tools?

cpwatson · on Feb 18, 2015

we don't share our actual requests per-second numbers on the front door. We have mentioned that we run tens of thousands of instances across three AWS regions. Per the Atlas techblog, these instances can generate in aggregate upwards of 1.2B time series which are exposed at the minute level.

twic · on Feb 18, 2015

Part of why i ask is to get an idea of what those tens of thousands of instances are doing. How much of your leviathan scale is about the sheer mass of requests, how much of it is about the depth and sophistication of what you do to serve every request, and how much is about providing an environment which supports deployment and operation of the code which serve those requests?

I used carbon-relay in one job, and if you're using that, i'd guess you have 1000 machines serving users, and 30000 collecting metrics!

cpwatson · on Feb 18, 2015

Our architecture and the sheer number of microservices contributes to much of the scale. In order to achieve the engineering velocity and reliability goals we felt the explosion of instances with this architecture was worth it. If you consider the number of microservice instances serving user traffic (and include persistency tiers such as memcache and cassandra) you would still be in the tens of thousands.