Fog Creek

Eight Fallacies of Distributed Computing – Tech Talk

In this Tech Talk, Stephen, a Software Engineer here at Fog Creek, explains the Eight Fallacies of Distributed Computing. He does so by providing recent or personal experiences that help to expound each of the fallacies, showing how life, physics, and even sharks can conspire against us.

About Fog Creek Tech Talks

At Fog Creek, we have weekly Tech Talks from our own staff and invited guests. These are short, informal presentations on something of interest to those involved in software development. We try to share these with you whenever we can.

Content and Timings

  • Introduction (0:00)
  • The Network is Reliable (0:37)
  • Latency is Zero (2:17)
  • Bandwidth is Infinite (4:24)
  • The Network is Secure (6:10)
  • Topology Doesn’t Change (7:24)
  • There is One Administrator (9:25)
  • Transport Cost is Zero (10:29)
  • The Network is Homogeneous (12:53)

Transcript

Introduction

So the talk is The Eight Fallacies of Distributed Computing. The Eight Fallacies are something that I heard about at a Java One conference a long time ago by a guy named James Gosling. He attributed them to someone named Peter Deutsch and basically a bunch of guys at Sun had come up with a list of these fallacies. But basically what the fallacies are, are a set of the opposite of rules, a set of fallacies about distributed computing that people often forget.

The Network is Reliable

So, first fallacy, the Network is Reliable. When I first started at NeXT Computer, one of my first jobs was to go out and teach people how to program in Objective-C and NeXTSTEP. One day I was in Michigan, I was teaching the class, everything was good. Everything goes off, we look around, the emergency generators come on because back then they had main-frames, so they had to turn on this generator. And we look outside and there’s a back out right outside the building. The power was out. So number one, back-outs are bad for computing. Bad for networking.

Now, there’s something else that is bad. People. In April of 2009 someone crawled into a manhole cover and chopped through the fibre optic cables that fed San Jose, California, and several other areas. A few years later, in April, I realized someone related to tech, I’m not sure. Boom! They did it again. Right, people are bad.

But there’s something else you have to worry about for your network. Ok, and that’s sharks. Google and other guys wrap their cables in Kevlar because most undersea cables have both power and fiber. The power gives off electro-magnetic radiation, the sharks think that it is a fish that’s freaking out and they come to eat the fish, which then destroys the cable, which then becomes very expensive to fix.

Erm, so, the network is not reliable.

Latency is Zero

What else?

The network is not instant. Latency is zero – fallacy number 2. There’s a company in, I think they are actually in Missouri, called Spread Networks, that bought the rights to lay fibre optic cable from New York to Chicago to shave a few milliseconds off the time that it takes for the signal to get back and forth. And they did that so that they could make more money selling the rights to use their cable to another company. Now, sad thing for Spread Networks, another company is coming in now and is trying to use microwaves and millimeter waves to shave another few milliseconds off so that they can be even faster and sell this to Traders. Because, no-one else so much cares at that level, but latency is zero – No!

One of the Sys Admins at college, sitting in his office and a department Professor comes in. Turns out to be from the Statistics department in this story, that may be a lie, but the story I hope is true. And says ‘we can’t send e-mail more than 500 miles.’ So the guy’s like ‘I don’t believe you’, that’s his first thing right, it’s like this can’t be true. So he starts sending emails and he can’t send emails more than 500 miles. So then he is thinking, so is it geographic, is it really 500 miles or is it that the person is 500 miles away. So he tries sending emails out to someone who is local but whose email server was in Seattle, he I think was in South Carolina. No good. So it turns out some consultant had very smartly upgraded them to Solaris server. This is an old school story. And the Solaris server, in the process he downgraded the mail on the server, the mail server, which didn’t understand the config file. So now it didn’t have a timeout big enough to be able to talk to a server that was more than 500 miles away because the timeout was set to 3ms, which is the time is takes for the signal to travel 500 miles and back is more than 3ms. Latency zero.

Bandwidth is Infinite

Fallacy number 3. Bandwidth is Infinite. Ok, so speaking of something like a message broker or something like that. There’s a problem with bandwidth and that is that bandwidth can also be thought of as not just how much you can stick on your network but how much the different parts of your network can handle. So, if you have a message broker. And everything goes through the message broker, you can’t go any faster than the message broker. Right, data can’t flow any faster than the message broker can process it. The you know, the thing in the middle is just stopping everything else. Fallacy 3, bandwidth is infinite. No. because with a database, if you have one big database your world is constrained by the one big database. If you can have multiple database shards, and they are independent, that’s very important, they don’t, your system isn’t dependent on any one thing.

So, bandwidth is not infinite. But! there’s good news. You can add bandwidth. So, latency isn’t zero, and there’s nothing you can do about that. If you run out of latency, or if you hit the bottom, which you can’t. But if you were somehow a massively great programmer and you can hit the speed of light then you couldn’t do any better than them, right. Well, you can just move things closer, that’s your only choice. But with bandwidth, you can at least add more bandwidth and you can design systems differently. So, there are ways to add bandwidth to the system.

So, what have we got – the network is not secure, it’s slow, it’s very limited, [sarcasm] but it’s safe. So that’s good. So the network is secure, that’s good, we have that going for us. [/sarcasm]

The Network is Secure

The biggest example recently, I’m going to call it the second biggest now, the example of the network is not secure is the Heartbleed issue. Where people were able to try to connect to OpenSSL servers, get random bits of memory from the OpenSSL server, even on failed connection and because the random bits of memory were often near the code that was authenticating users, sometimes they were getting user data just by hitting the server. Even though the server was doing the appropriate thing and just saying ‘sorry, you can’t connect because you’re not authenticated.’ So, we can’t assume the network is secure, the network is not secure.

Moreover, there are bad people out there. In the last ten years over 30 incidents have been reported where over 100,000 user records have been lost. But that’s just minor compared to eBay’s most recent one where there were over 145 million user records lost.

Ok, so, the network isn’t secure, it’s not reliable, it’s slow, it has limited stuff, but at least [sarcasm] it always stays the same. That’s the good thing.[/sarcasm]

Topology Doesn’t Change

Topology Doesn’t Change. So, the CAP theorem, the part we want to think about is something called Partitioning. In the old days, back when I was learning, there was a thing called a mainframe and there was basically a wire from the mainframe to the client and that was it. So, there really wasn’t the concept of partitioning, there was just – the network is up or it is down. It doesn’t matter. But in the new world where we start to think of things like database sharding or just distributing servers around the place, we run in to these problems where maybe some of the servers have buddies still up and others don’t, or maybe they’re all up but the network connecting them isn’t. So if for example you have two data centers and they both have multiple copies of the same thing running, those guys all think that everything is happy dappy but if they line between the data centers goes down you can get in to a situation where most sides are doing the wrong thing. And that’s the whole point of the CAP theorem. Trying to understand what you can accept and not accept in partitioning.

So, sometimes Topology change is good, right. Sometimes we upgrade servers and we get new things and the topology can change in a good way. But even when that happens, we can have problems. For example, when I started in Kiln and I started to learn about deployment, we did a deployment where somebody had taken one of the servers out of the data center. And it was ok, I mean that was intentional, we didn’t use that server, we meant to take it out. Unfortunately all of our tools thought it was still there. So the deployment kept failing because we couldn’t check that it hadn’t deployed to a server that we weren’t using. It isn’t a mistake but it’s because the tools weren’t setup to deal with that topology.

So, topology can change. So what have we got? The network is not reliable, it’s slow, it doesn’t pass everything I want, it’s not safe, it keeps changing, [sarcasm ]but at least, here’s the good news people. We know who to call if there’s a problem. [/sarcasm]

There is One Administrator

The other thing that changed when I started on Kiln, was I would run this script and it would get a little ways and then I wouldn’t have permission. And then I’d get permission and I’d get a little further and then I wouldn’t have permission. Because there were different systems, with different permissions. So people do have ways of dealing with this, like Single Sign-On or one version of that called Kerberos, but even that has issues because, you know, ultimately there are lots of administrators.

Luckily all administrators do what you want. Right? So in May 2013, Edward Snowden, whether you like it or not, as an administrator at the NSA decided to take a lot of data and give it away, right. I’m not saying anything good or bad, I’m just saying, like, there are lots of administrators. If there had been only one administrator, unlikely that would happen. But then the NSA couldn’t do what it does.

The network is not safe, there are all these people in charge, we don’t know who they all are, it’s slow, it’s unreliable, and it won’t send everything we want and it’s constantly changing, but [sarcasm] at least it’s free.

Transport Cost is Zero

Transport cost is zero for the network. And that’s why in early 2014 Netflix paid Comcast to get preferential access to its network for its customers, and then they paid Verizon to do the same thing. And then they paid AT&T to do the same thing. In none of those cases was it to get preferential treatment, that was just a deal where they thought it was important to pay the phone companies and the networks because why not, we should share the money, we’re almost profitable and we should share the wealth, right? [/sarcasm]

So, transport cost is zero. Latency remember is the time it takes for the signal to travel from one computer to the other computer. So one of the projects I worked on at Tibco, was a product call FTL. Now why is it namd FTL? Because marketing people think that you can go faster than light. So when you’re trying to make something go fast, you name it Faster Than Light. Even though the whole point of latency is not zero is that you can’t go faster than light. But ok, so they named it FTL. And FTL was actually a really cool product. We were able to send messages, on multiple transports, so RDMA, TCP/IP, Multicast and shared memory, which is all inside one box, with the same API. And you just change it through configuration. You can actually change it live. So, with the shared memory, the goal was under a micro second, and we achieved that. Under a microsecond, in fact, under 600 nano seconds to send a message from one program to another program, with the same API you could use to send it over RDMA in the data center, for under a few micro seconds, like 2 or 3 micro seconds. Admittedly slower, but still. 600 nano seconds is not very long. At the time it was like 20 slots worth of time in Linux. So the entire amount of work to pack a message up, put it on the transport, send it in a generic way and get a response back – well, you don’t get the response back, but you time it so that you do, right? Under 600 nano seconds. Transport cost isn’t zero.

Now not everyone cares about ten nano seconds, but, you know, it all adds up. And when you’re doing things like sending things across the ocean, that’s 150ms, that’s what it is. Einstein said it had to be. So, nobody can go faster.

The Network is Homogeneous

So, last one. It’s hard to believe that any programmer today has this fallacy, because with the advent of mobile, and everything else people know that there are all kinds of networks going on. I mean, WiFi, you plug in your thing, you use your phone, you use your tablet, people know. But, this is something you have to keep in mind, right? So unless someone has a fibre connection that’s really solid at their house, then they are going to just have vagrancies in their network just from being at home. And the interesting thing about the network being homogeneous, Facebook recently spent several billion dollars buying a company called WhatsApp. And one of the things that WhatsApp does, that is maybe worth buying, is that they didn’t care about cool stuff. So they actually wrote a messaging program that works on old stuff. And slow stuff. Anyway, the network is not the same everywhere, and the nice thing is people like Google and others are trying to build tools that let you try that out. So there’s tools that let you pretend the network isn’t the same.

So, those are the Eight Fallacies.