In this Tech Talk, Blake, a Developer here at Fog Creek talks about Netflix’s Hystrix, a latency and fault tolerance library for distributed systems. He gives an overview of it, before giving a quick demo of a working Hystrix Dashboard, showing a visual way to monitor your systems using Hystrix.
About Fog Creek Tech Talks
At Fog Creek, we have weekly Tech Talks from our own staff and invited guests. These are short, informal presentations on something of interest to those involved in software development. We try to share these with you whenever we can.
Content and Timings
- Introduction (0:00)
- The Circuit Breaker Pattern (0:57)
- Hystrix Dashboard (5:14)
I’m going to talk about Hystrix today. Hystrix is one of Netflix’s open-source initiatives. Netflix does a lot of things in micro-services. They have services all over the place and everyone here who has watched Netflix. They’ve seen the service go into a flaky mode where it’s like low-res for a while and then pop back. You’re still watching the movie the whole time, so they’re able to handle little failures pretty well, and this is part of one of the tools they use for that. The textbook definition of Hystrix is “it’s a latency and fault tolerance library for distributed systems.”
Hystrix is a bigger effort. Today I’m going to just talk about Hystrix Dashboard. The dashboard could be used in any language it’s just an API end-point and you can use it to monitor your own software.
The Circuit Breaker Pattern
The root of all this is the Circuit Breaker pattern. This is the decision pattern that I’ll describe here and so you can imagine you have lots of different browsers, maybe mobile apps hitting our API end-point. If you move towards more of a micro service architecture you might have lots of different services behind the firewall that you’re going to consume by your main and API endpoints. Say the API is going to go off and fetch your favorite color from the favorite color service and then it’ll go off to the web somewhere. It might do some other crazy stuff, and we all know what happens when one of these things fails.
We get a total failure, and what we’d rather see is a more graceful failure. Here you can see the bridge is clearly in trouble. There was a bad problem obviously but maybe nobody died here. I put together a demo here using some Go. There’s a library out there called Hystrix Go. Basically, when you use this, I’m sure they have this for .Net and for other languages too, this is just the one that I decided to focus on.
Hystrix Go is a library that you include in your project and you’re basically going to give a command that it’s going to run to fetch remote data. Or to interact with remote systems somehow. You wouldn’t use this pattern everywhere. It’s just when you’re aggregating different responses from remote systems and sometimes some of these end points aren’t crucial. Like I said, Get Favorite Colors. Maybe if that’s not available or if it’s taking too long you could just return not available right now and then you can put that into the API endpoint and then the user will just see oh that’s not available right now. They hit refresh, they hit refresh it eventually comes back. This is basically how it looks. I’m going to make my remote call and I’m going to give it two functions. The first one is the command that I want it to do and the second one is what to do if that fails.
There’s different ways to define failure. One way to define failure is like if an end point’s taking too long to respond. Maybe you set the threshold at one second for this specific endpoint. Or maybe if it fails, if it actually returns a 500 error or maybe if it returns a 500 error three times in a row. You can define your threshold for errors, your percentages, and the time outs all that per end point if you want to.
We take the previous slide and we turn it into this. Service B is now failing for whatever reason. For a little while, we’re going to be returning to the fallback. Again, so the circuit breaker, if you’re familiar with what circuit breakers are, I mean we have this in your electrical panel. If there’s a surge, the circuit breaker opens up and now no traffic, nothing gets through. No electricity gets through, it’s a fail-safe.
In the software world a circuit breaker when it’s closed… communication’s going back and forth and then when the framework determines a problem it opens up the circuit breaker and no traffic will go through there. In the terms of this library right here it will actually keep the circuit breaker open for five seconds, ten seconds whatever you configure it to.
Once it’s tripped it stays open until it’s comfortable to try again. Until the framework’s comfortable trying again. You can imagine how when we see certain types of errors in production they kind of spiral out of control. This is a way of if ten clients are going towards an end point that’s already a little bit overrun, we’re going to open them up, each end point will notice it’s not getting good response. They’ll all backoff and, at some point they’ll start going back in there. Because this whole thing’s configurable you can then decide which end points you’re okay with opening their circuits and which aren’t, so you have a more gracefully failing service.
This is what I was describing here so you have three actors here. You have the client which is the browser, the mobile app, or any kind of consumer of your API. The circuit breaker, which, in this case, is the Hystrix Go library and my individual commands that I’m wrapping with it. The supplier is the end point that I’m getting data for to aggregate and ship out of the API.
You can see time is moving down in this graph here. At first things are great. You can see my arrow. The first part here things are great. Second part we see a time out and the library will say, “Okay, I’m watching you.” Second one there’s another time out, we’re okay but at this point two time outs in ten seconds? It’s going to say, “This is no good.”
It trips the breaker, which means it opens it up and then the one, two, three, fourth request comes in. The breakers’ open we’re not even going to try the end point. It’s going to just return to the “This is not available right now” page.
That’s what Hystrix does and now to switch over to the actual dashboard. If I put this in front of anyone here that doesn’t know what this is I could ask you well, which end point would you say gets the most traffic? It’s pretty clear that this top, left one, VideoMetadataGetEpisode does. It’s because of the size of the green circle. Or if I told you that you’d know from then on.
I like this UI here in that you can get a quick glance of it and see kind of what’s going on and then you can look more closely once you figure out which of these node’s you want to focus on. When there’s a failure these circles turn red and when there’s lots of failures they’ll turn more and more red.
I also want to mention in this, here, Netflix has more than ten servers. What they have is, in fact, here you can see this first one they have 476 different VideoMetadataGetEpisode services out there on their network. They show you the 90th percent time. The 99th and the 99.5 percent time. They show the circuits closed which is green which is good. Let’s dive down more into exactly what you’re seeing here.
You can look at this you can see whatever you need to see. You can stare at this graph when it’s moving and it’s cool to see but then you can, you see a problem you know exactly you get a lot more information from this. Like I said, the size and color of the circle tells you health and traffic volume. You have two minutes of request rates. You have you get to see how many hosts are running with this service and when you see a circuit closed,
if they’re actually, if they had 476 servers it’s not all or nothing. Each node that’s hitting those services is deciding if that is open or closed. It will say, like if half of them were open it will say, “200 and something open. 200 and something closed.”
It will show you “app is starting to fail.” Because I just mentioned that they have tons of services there, they have a separate service called Turbine, and what Turbine does is, instead, if you have a single host your single host can go right to the dashboard.
The dashboard actually pulls in data from the single host. Once you have 476 of these or any two of them you can feed those into Turbine. Turbine aggregates them all and spits out a single combined feed to the dashboard.
This is Hystrix dashboard when you have nothing right when you get here and in here it shows that it wants to, it’s asking you for a Turbine stream. In this case I don’t have Turbine set up I have a single, Go service that’s just got the Go Hystrix library or Hystrix Go I can’t remember which it is.
I’m going to go ahead and show you, now here, this is a lot of misbehaving services but it’s a lot more interesting to look at then what you saw before. Here you can see here I have… you can’t see my thresholds here but the APISlowestServiceEver.
I think I have that timing out for a second and I think I have it set to every response takes two seconds to respond so that circuits open. If there’s a problem in production you come look at this you can see the SlowestServiceEver is being slow once again.
Then you can see relative volumes with everything else here. Some of these are failing and getting back into a good state. Failing again. Up here, you can sort by error then volume, alphabetical. When you have what, fifteen? ten, twelve? Maybe you don’t need to sort a lot but when you have a whole bunch of different services this becomes more useful.
Down at the bottom this is more on thread pulls which I don’t know if this applies to Hystrix Go. I don’t think they implemented this but the thread pull thing is where you can have, you can say like ten clients can access this end point at once. And then they mean ten threads from the same host. You can limit it. Concurrent requests I guess, to an end point.
That’s it. Getting Hystrix Dashboard up and running takes no time at all. You just drop this thing in a TomCat instance and it just starts working. And then you just have to, in fact there’s really no configuration because you’re going to point it to the different end points that you want to watch.