Fog Creek

Building Resilient Services with Go – Tech Talk

In this Tech Talk from GopherCon 2015, Blake Caldwell, a former Software Engineer here at Fog Creek who worked on the Kiln team, explains how he used Go to re-write and speed up KilnProxy, our SSH Reverse Proxy. Hear how he was able to re-write the service and reduce clone times by half, whilst making it more reliable and less noisy.

Blake writes about Go and software development on his blog. He has open-sourced the profiler mentioned and you can get the slides from the talk on his GitHub.

About Fog Creek Tech Talks

At Fog Creek, we have weekly Tech Talks from our own staff and invited guests. These are short, informal presentations on something of interest to those involved in software development. We try to share these with you whenever we can.

Content and Timings

  • Introduction (0:00)
  • Background (0:22)
  • About KilnProxy – SSH Reverse Proxy (1:33)
  • Results (3:35)
  • Handling Errors (5:20)
  • Channels (7:17)
  • Handling Panics (8:09)
  • Avoiding Race Conditions (10:00)
  • Implementing Timeouts (11:20)
  • Profiling (13:44)
  • Logging (21:20)

Transcript

Introduction

Blake:
Hello. This is another rewrite story that it went so well that I want to share with everybody. Like many talks, I’m going to be touching on a lot of things, I’m not going to deep dive on any, I just want to give you as much exposure as I can to things and tools and techniques that I found very helpful when writing my first Go production service.

Background

To give you some background, last year I was working at Fog Creek Software out of New York. If you guys have heard of us, you might be familiar with Kiwi. This is the FogBugz mascot. I was working on Kiln. Kiln is Fog Creek’s Git and Mercurial source code hosting service, it’s like a GitHub that works with Git and Mercurial. Before I worked there I hadn’t heard of Mercurial. I’m very familiar with it now. Most of my work was in C# and Python, I tried to gravitate more towards the back end stuff.

Also last year I was lucky to attend Google I/O. I went there having no idea what Go was, I actually had never heard of it. Now everywhere you look, it’s everywhere. I don’t have to explain to you why I fell in love with it so quickly. I went to every one of the Go talks I could, I met some of the Go authors, and it was awesome. I knew that I had a solution to a problem that might come up some day.

I needed to ship something awesome in Go. There’re hobbies, there’re fun things I could do in my spare time, but I wanted to prove something at work, I wanted to use Go to make something awesome for Kiln and to make Kiln better. I looked around for something to rewrite, there’s always something you can rewrite. I settled on our SSH Reverse Proxy. I also didn’t know what a reverse proxy was when I started working there.

About KilnProxy – SSH Reverse Proxy

To give you some background on what this is, when you’re dealing with Git or Mercurial there’re two ways to interact with a remote server. There’s HTTP and SSH. SSH uses public-private keys, it’s more secure. If you’re going to be cloning a repository from Kiln and you’re using SSH, you’re going to be talking to KilnProxy. You’re one of those people on the left there. When you talk to KilnProxy, KilnProxy has to make sure you are who you say you are by authenticating with your key and then it looks up to see where your data’s actually stored, we have lots of backend servers so it finds out which one is yours, where your repository sits. It opens up a connection to the backend server, it has a connection to you and then all it has to do at this point is proxy all the communication back and forth.

Why would I rewrite this? It was working. At the time, our SSH clones were a lot slower than HTTP and this really shouldn’t be. If anything, it feels like SSH should be faster. Also, we had some stability issues and because of this, our system administrators decided to just restart the service every day. You can imagine if you have a giant repository and you’re trying to clone it, you’re twenty minutes into a clone and then it’s time for us to restart the whole server. You’ve just lost everything, you have to start over, it’s not cool.

This actually turned out to be a perfect project for Go because there’s tons of concurrency. To give you an idea of this, when you first connect, we’re listening on a loop, we kick off a Go routine. That Go routine is responsible for authenticating your public and private key, or your private key. It then connects to the backend server and once it’s got that connection, it then proxies standard in, standard out and standard error, each on their own Go routine.

Results

How’d it go? It went well. Way better than I could’ve imagined. I just wanted to get it to work and things went very well. Let me show you how well they went. Every few minutes or so, we would clone a small repository and we would keep track of the timing. This right here is when we transitioned from our Python implementation to Go. I’ll let you guess where it is, it’s a little bit past the pink mark there. This is just at the launch of the rewrite. At this point now, SSH and HTTP have parity.

You could see before, this was like a 1-megabyte repository but before it was taking about one and a half seconds, afterward it was down to .75. It was almost exactly twice as fast as before. This does scale up as the repository gets bigger, so we saw some pretty good gains here. Also, you can notice that there’s a lot less noise also. What we have is a faster, more reliable, less noisy service.

Being my first Go service, I had to figure out how to write a stable service in Go and I’d like to share some of these tips here. What are we talking about with resiliency? We’re talking about writing a service that doesn’t crash, that doesn’t have to be restarted every day. It doesn’t leak memory and it doesn’t hang. Doesn’t get stuck. Obviously, this is a big process, there’s no magic bullet here. This is a process that spans development, profiling before launch and then after launch monitoring the service.

Handling Errors

Let’s start with error handling. Luckily, not too many speakers have talked about … Have had to say we should handle every error. I think everyone here understands we should handle every error. I came from Java where I’m used to very terse code with exceptions that I have no idea if they’re thrown or how to handle them. We just don’t worry about that in Java.

We’ve all seen this, this is the pattern. You get a resource or you get a return value from a function and its error and you don’t use it under score for the error. Then after we check the error, we break out of the function if there’s an error. We defer the clean-up or the close. One thing I always make sure is to put no new lines in this little block here as I visually scan code, I want to see that this is one unit that can’t be broken apart.

It looks like here we’re handling all the errors and so we’re good. We know that we should be checking for nil sometimes. Back to our example, what if OpenResourceA can return nil and what if it’s not an error condition? Maybe it’s a rare case that we’re trying to open the resource and for some reason it’s offline but this is not an error technically. Within our defer statement, could panic.

Not necessarily, of course we have the technology to avoid this and one way we can do this is with an inline function, an anonymous function. We can just do our little check there and if it’s not nil, then we’ll close. The one problem I find with this is that it’s gross. I don’t like this. Can everybody see this? I don’t know if that’s big enough.

One way I like to handle this is for methods that I know will be deferred, I like terse defers, I like that to be nice and clean. One thing I actually forgot until recently is that if a method on a struct receives its struct by a pointer. If that struct pointer is nil, that function’s still called and it’s passed in a nil for its pointer. In this case here, I’m making my deferred statement, the statement that I know will be the clean-up method. I actually check for nil here and then we get back to our original example where DeferResourceA.close is actually nil proof. I find this to be a good pattern.

Channels

Let’s talk about channels. We all are familiar with channels and how much fun they are and they are fun, they are awesome but if you don’t know what you’re doing, you’re going to cause problems. I’m not going to deep dive on this because deep diving on channels would require half a day but I like to reference Dave Cheney’s blog post here which has gotten me through some tough times. It’s entitled Channel Axioms and literally every time I touch a channel, I just review this. I had it written down somewhere because I want to make sure that I don’t misstep with channels.

The first three here are the ones I want to focus on is that if you have a nil channel and you try to write or read from it, it’s just going to block forever. When you block forever on a go-routine, that go-routine never exits and when a go-routine never exits it never frees up its resources and any resources like local variables it’s holding onto. When you send to a closed channel, you get a panic.

Handling Panics

Let’s talk about panics. They’re usually due, maybe always due, to program our error and if you get a panic it’s going to crash your service. I’m going to upset probably half the people in the room and say that I sometimes like to recover from panics. It’s a contentious issue, people say, “Well, it’s a programmer error. If you have a program error, you should let it crash and you should fix it.”

That’s true, but I do make errors and they do make it to production and I want to scale back their damage if possible. Without going into the details, you can recover from panics. You should not treat them like exceptions, it’s not what I’m doing. When I’m setting up a function, you saw that error block. I make sure that everything I’m setting up will be cleaned up when I leave the function. That way, if a panic happens, all those would be fired.

I try to limit areas of the code, I set aside some areas of the code where panics are allowed to happen, I catch it, I do log it and I do take it very seriously and try to fix those bugs. Let me give you an example. This SSH proxy was very complicated. If I’m being honest, I didn’t read the full SSH spec beforehand. We did have one customer who was using a certain build server that was using Git in a certain way that I didn’t know until afterwards. It was crashing and we had hundreds of other customers, thousands of other customers that had no problems, didn’t have this problem.

If I had let this panic creep all the way to production, we would’ve had a fire and we would’ve had outages and we would’ve had to restart the service over and over. In that example, I handled the panic on the go-routines I mentioned earlier where you have … Which is just handling one client’s request. At the high level, the top level, if something goes wrong in the main loop, then that’s fine, that will crash.

Avoiding Race Conditions

Let’s just pretend that the author of the race detector wasn’t just up here. I don’t need to say a whole lot about this, but race conditions, when you have all this concurrency, you’re going to have races. Again, I said I was from Java in the past and I’ve used all the concurrency stuff you can do there and I know all these tools exist for Java and despite that, in the ten years I was working with Java, I never bothered to look.

It’s super easy in Go because it’s part of the main tool suite here. Just like we heard earlier, in a report where a variable access is not synchronized, and when it does, it will crash for you and it will show you full stack trace, including exactly where the read and writes were and you should use this during your unit tests for sure but also development and integration testing.

Again, here’s the output. This is awesome, tracking down this type of bug would’ve taken forever if you’re able to even catch it happening at all. Here we could see that the error happened where race.go line 14 was trying to read while race.go line 15 was trying to write. This literally takes minutes to solve. Again, you can enable this with the -race command line option on both test, run, build and install.

This next bit isn’t specific to Go but I think it’s important to address if you’re trying to make a service that won’t crash and that could handle some problems at run time.

Implementing Timeouts

Implementing timeouts. We need to guard against some situations here. The big one is network timeouts. Our software, our programs are connecting to remote servers quite a bit. For that, we dial and then we connect and then we transfer data and then we’re all done.

Best practice is, if you are trying to dial, you should be dialling with a timeout. If you look at the standard library, those dial functions usually have a dial with timeouts. Let’s say it’s reasonable to expect a dial to another server to take two seconds. Make that time out twenty seconds, it doesn’t matter, just make it so that some point if there’s a problem you don’t hang forever. If you have twenty of these connections going a hundred, if all day long you’re trying to reach the server and it never answers, you’re just going to run out of memory or have a crash.

Once you have dialled, then you have to connect and you should also have a network connection and activity timeout. If it’s taking you … Actually, no, in this case, a connectivity timeout is if you’re transferring data back and forth and if you haven’t seen data in ten seconds, fifteen seconds, then you should just close connection and log it out. Also, you’re connected for a long time, maybe a minute’s reasonable, so a timeout after five minutes, a timeout after a number that’s ridiculous if it got that far.

This next one goes without saying. I’m not going to deep dive on this but let’s not skimp on the tests. Tests are super important. For me, I don’t like to remember things so after I solve a problem, I like to set up a back stop where the test has my back, and when I hand off the code to somebody else or a future me, it’s still protected against these bugs. I’ll give you an example here, I’m a strong believer in integration tests and you should look in the Docker if you haven’t for these kind of tests.

For an example in KilnProxy, there’s a lot going on with Mercurial and with Git when they’re talking over SSH. What I set up was an environment in a Docker container, Docker Image which I run as a Docker container, which will actually make it get commands to a running server through my KilnProxy code, so I’m trying git pool, git clone, git push, all those things. Same with Mercurial. The nice thing about this is now maybe a new version of Git comes out, we can just have a separate instance of that container with a new version of Git and we could have all these tests running in parallel, always making sure that our proxy’s not broken.

Profiling

We’ve just wrapped up development. We’re now starting to test out our service so we understand how it behaves. Let’s start with, “How does it use memory?” We profile it and like probably dozens of you in the audience here, I made a profiler. This is on Fog Creek’s page here if you want to take a look. It’s fun to watch, I don’t have an animation but it does actually update every second. This shows me here how much memory the service is using as far as the operating system is concerned. Within that it shows me how much is actually in use and how much the run-time is trying to give back to the system. In this case here, the purple line is memory that’s ready to be freed back to the system, sometimes the system takes it, sometimes it doesn’t.

You’re able to come here and see some pretty good information really quickly. The kind of things you want to watch are how much memory does the service use when nobody’s connected? This is basically your baseline. Then when you receive connection, how much memory does that require. When memory ramps up and then back down, does your system reclaim your memory? That’s going to be dependent on the system itself and the memory pressure that it’s under. If you want to deep dive, you can take a look at what the garbage collector is actually doing by running your program with the GODEBUG environment variable set to gctrace=1. This is super cool to watch.

Also, you would like to know where your memory’s allocated? The memory that’s in use, how did it get in use and for that, it’s PPROF. I know we’ve seen this already at the conference but if you guys haven’t used this yet, it’s going to change your life. It’s super cool. What PPROF lets you do is it lets you take a look at your service’s blocking profile. The goroutine count and stacktraces and the heap profile and stacktraces that lead to thread creations.

This is an intimidating tool but it’s super easy to use. All you have to do is import the PPROF package which will set up a whole bunch of HTTP endpoints for you. It’s your job to pick the port, pick the IP and listen and serve. Super easy to set up. Once you do, you hit this page. This is just the main page but that takes you to the real meat of the output here. Right away, you already see some useful information. Right here, we can see that we have 32 goroutines in use. Let’s say this is my baseline, let’s say no users are currently connected to my local dev system, my system requires 32 goroutines.

It’s important that we don’t leak goroutines as I mentioned earlier. We want to avoid these leaks because the memory leaks and … Using PPROF, we can tell how many goroutines are in use when nobody’s connected, we can then connect with one user and keep that connection open and look again. When it’s all done, when you’ve run a hundred connections or a thousand connections, you’ve run it overnight and everyone’s done talking to your service, you should see that first number again. You should see 32 again. If you see 33, you have a problem, you should fix it.

Let’s say I have 33. Where’s that extra goroutine? Back to that original page, you click on goroutines and you see a full stacktrace. We’ve all seen stacktraces and usually it’s with great sadness, our service crashed. Here you’re seeing a stacktrace of your service while it’s running. You refresh this page and you’ll see the next snapshot of what’s going on. This is awesome. You see which line each goroutine is sitting on.

PPROF is awesome from the webpage but where it really shines is the command line because you can run your command line, your PPROF command line locally, it’s going to connect to these web endpoints and most of these endpoints are made for computer consumption anyway. It’s going to interpret this for you. From the command line, we’re going to run go tool PPROF and then the location of the binary, that’s running out on the server. We’re going to give it the HTTP endpoint to the goroutine page.

Now we’re going to enter an interactive terminal. You type in top five to see the top five locations that these goroutines are waiting. You can do top ten or top nine, it’s pretty cool. This is fun, so if you have a problem, you’re going to track that down but then what I really like to show off is you type in web, and after you install a bunch of stuff, this works. It pops up a browser showing you your call stack and showing you how all these goroutines came to be. This is a simple example, once your server gets bigger, you’re going to have a lot more of these boxes. This is a cool SVG graphic that you can zoom in, pan around on on your browser. You’re going to track things down, you’re going to find out exactly how these goroutines respond.

Again, PPROF can help you out with your heap memory as well. Here I can see, by again, so running go tool PPROF, the server which is my binary and then the location of the heap endpoint. I run top five and I can see here, “Oh it looks like my biggest offender is my profiler.” Then I noticed that’s okay, it’s only using 1.8 megabytes of my total 2.3. That’s what I expect. You can dive down further there and just like before, you can type in the web command and you get your SVG opening up in a browser again. This is a super simple example, yours will look a lot more complicated. You see the call stack, the call trace, the call stack that shows you where this memory is being allocated.

This is great to see because a lot of times you’re using a lot of memory, you don’t know why because it seems to be out of your hands, it seems to be in the library that you’re using. You could dive down and maybe you can patch something, you can fix that library or maybe just understand how it works a little better.

At this point, we’ve built our system, we’ve profiled it in testing, so we know how it behaves and we’ve deployed its production. So at this point, we have to be able to tell what’s deployed is the first thing. Sometimes I don’t trust our deployment process. Sometimes I also don’t trust how we keep track of what code is in a release. I have an endpoint that I will expose to our intranet so it’s not exposed to the outside world. Called info and I showed the version, I showed the start time in epoch seconds. I showed the server’s current time in epoch seconds so that you don’t have to worry about time zones, you can do some pretty complicated math to figure out how many seconds have elapsed or you could just look at the up time, human readable form at the end there and I see that I have 167 hours and 10 minutes and 2 seconds and I love seeing that number go up.

I’m going to take a brief aside to show you about this version number because I think this kind of a cool tip. In this case here, the version number I like to use is a variant, there’s lots of different things you can do here. You have major, minor release. Then you have the commit number on the tree that I deployed. That should always go up as you deploy new code. You have the Git SHA of the commit that was deployed which is great because a lot of times, you tag stuff, there’s a build process but it’s not you doing it so you don’t trust it. If there’s a bug, if we just launch something, someone asks, “Did you fix that bug? Is that out in production yet?” I take that SHA, I can do a Git checkout or I could just look at the Git logs and find out where the fix made it. Maybe it already occurred at this point. At the end there I have the date and the time of the build.

To talk about how I’m able to generate this, obviously you can’t expect people to know the Git SHA of the upcoming commit, that’s not reasonable to expect. What we could do here in Go is you have a global variable in your code, call it service version, call it whatever you want and then from the Build Script, from your command line when you’re building, you can set this global variable right there from the command line. Our code’s already committed, the SHA already exists and then use the -ld flags to set main dot service version to that string that we build up in our Bash script.

That’s super useful when you’re deploying ten instances, a hundred instances, thousands of instances of your service on different machines. You could automate making sure that they’re all at the right version, the same version. Now we need to see what our service is doing, back to monitoring what the service is actually doing.

Logging

Obviously, keep good logs. We all like logs and one little tip I can share here and I’m sure this is not my invention, when you receive a new request, come up with a semi-random string, pass that all the way through all the functions or use a context or something to pass it around so you can include that as your prefix on every log entry. This is super useful because when you have thousands of users connecting with concurrency, you look at your logs, it’s a giant mess. You can use Grep to easily find out what’s going in with one connection.

Next, we want to know who’s currently connected. This might not be useful for many of your services which respond within 50 milliseconds. Kiln connections as I say can take minutes when you’re cloning a large repo. I noticed a lot of traffic, I want to see who’s connected. I have another endpoint called connections. Here I can see that I have one user, it’s the Aviato account. Erlich Bachman is connected. It looks like it says build server, people tend to name their keys build server laptop. He’s been connected for twenty-five minutes and four seconds so that might seem a little fishy. Maybe I’ve noticed some problems recently. I’m going to take that session key which is my log string, I’m going to go through the logs and find out exactly what’s been going on throughout this session.

Since I’m keeping track of how many users are currently connected, I can implement drain and die. I never thought of this, I talked to our system administrators and they told me about this. What this means is when system administrators want to kill the app, kill the service because of an upgrade or just a planned restart, I’m not going to let the process end until I’ve served all the current connections. It listens for the sigterm, which you can do in Go, you can listen for system events like sigterm and once I get my sigterm, I stop listening for new requests and I will just sit there and loop and sleep until all the existing requests are done.

Game day. All this work that went into making a responsible service was put into … This was the first time it was tested, it was kind of a pretty cool day. The system administrators got an alert from Nagios and they said, “Blake, KilnProxy’s broken. It’s using 40 MB more than it normally does.” We all know that KilnProxy wasn’t broken. Let me quickly view my profiler and not an actual screenshot but I was able to see yes, okay, there actually is … Memory usage is up there, the system is reporting what I see. I can see that the memory is currently still in use.

Let’s dive down a little bit more and I look at my connections page, my endpoint. I could see Initech is connected ten times. Peter Gibbons is doing something he shouldn’t be doing or he’s having a problem or we’re having a problem. To back up a little bit, during development, I know that each connection takes about 4 Megabytes of memory. Using PPROF, I know that most of that 4 Megabytes was out of my hands. I didn’t feel like diving down any deeper but it was mostly SSH library internals, encryption stuff.

Wolfram Alpha tells me that 4 megabytes times 10 is 40 megabytes roughly. Customer service reaches out to Initech and yes, sure enough, they were having a problem with their build server, I think, I don’t know, I forget. They turn off their build server, restart it, we work with them and they’re happy that we noticed their problem before they did. Because I implemented timeouts, I know that all those ten connections eventually will drain, will be closed. Because of my production profiling, I know that if the system is under memory pressure, all that extra memory I was using will be reclaimed by the system.

Uptime: preserved. As far as I know, I think the service has been, it’s been running for six months or so and I think it’s been restarted three or four times. I haven’t checked Wolfram Alpha, I believe that’s more than once a day. Things worked out really well and it was a good experience. Based on this, this was our first prototype, first production use of Go at Fog Creek. It was met with a lot of skepticism and since it’s been running for so long, months at a time without being restarted, it convinced everybody there that this is a technology worth exploring. It was a great experience for me and I want to thank you for listening.