At Fog Creek, we like to keep innovating and develop new products. Having successfully spun off Trello, we were excited to play with some new ideas. We ran a number of Creek Weeks to try out a few, and one of them turned into HyperDev.
HyperDev is a developer playground for quickly building full-stack web apps. It removes all of the set up required for web development, it makes apps instantly available via a URL, and it auto-deploys code changes as you type them.
From a user’s perspective, the technical side of creating and deploying apps on HyperDev is deliberately opaque – we don’t want you to have to worry about all that stuff, it’s taken care of for you. But for those interested to know more about what’s happening under the hood, in time-honored tradition, here it is: The HyperDev Tech Stack.
There are 3 key components to HyperDev: a collaborative text editor, the hosted environment, and the quick deploy of code from the editor to that environment. In testing, we found that we had to get any code changes updated in the app in under 1.5 seconds for the experience to feel fluid. Any longer than that and you’re quickly longing for your local dev setup. What’s more, we need to do this at scale – we think millions of developers, and would-be developers, can benefit from HyperDev, so it needed to be able to do all of that for many thousands of projects at once.
The Initial Mock-Up
Our early mock-up used a simple multi-file editor that would bundle up the content and post the files to a GitHub repository. A polling event then pulled that content and restarted the Node.js service. This was enough to show how we could get characters changed in the editor to a live running application in fairly quick order. Once we demonstrated the concept we had the support of Joel and our fellow Creekers to build an MVP.
Kicking off the MVP
At a basic level, we knew we would want a single-page app for the frontend editor experience, and have a hosted environment that could accept the code and run it in a Node.js environment dedicated for the application.
From a design perspective we believed that by being conservative about adding UI, and not reinventing the wheel unless we had to, then we could make HyperDev easy to pick up, easy to learn, and the skills you gained using it would be transferable.
The experience we were striving for, was that as soon as you were in HyperDev then you could be productive and creative, and instantly have a running application. This meant we had to make the application load fast, and present an editor ready to create any application you desired. We decided to simplify the flow of the code and removed GitHub from it, even though we knew we would want Git integration down the road.
We also decided that the frontend code would be served to the user’s web browser in static form, and it would integrate with an Orchestration API (OAPI) that would be responsible for the backend deploying, running, and serving of the developer’s application on the Internet.
At this stage, our main philosophy when choosing tools was to always select something a member of the team is an expert with. This was to reduce technical risk and keep development velocity high. During the early stages of a new product there’s plenty of other risks to contend with, like market and product risk, so we didn’t need to add technical issues to that. So we made the decision to match the tools to the team based on familiarity, which is something we could get away with whilst the team was small (at that time, 3 people).
The Frontend Client
- Amazon S3, CloudFront, Route 53
The choice of CoffeeScript was mostly an aesthetic one, as CoffeeScript and ES6/7 are roughly equivalent in terms of capabilities. Hamlet.coffee is something that one our team, Daniel X Moore, previously created himself. It nicely solves the problem of facilitating the use of CoffeeScript and Jade-like syntax with reactive templating, without having to resort to the types of hacks that we had to when trying out Knockout and Backbone.
From the outset, we decided not to write our own editor. In general, online editors are difficult to write and although there are complexities in depending on someone else’s design for an editor, we knew that the editor was not our core value. So we chose to use Ace and hook in our own minor modifications to interface into our editor model.
At the beginning we iterated from the push/pull model of GitHub and transferred that to REST calls into the OAPI, even keeping the same GitHub payload formats. That worked fairly well, and it allowed us to start identifying the design principles for the OAPI backend without prematurely over designing it.
From this basic design, we got some pretty good feedback from our fellow Creekers on the usability and responsiveness which showed we had to do much more. We also decided from that experience that we wanted collaborative editing in the same document and we had no reasonable way to facilitate that with the push/pull model. So we decided to switch to using Operation Transforms (OT).
OT is the mathematical methodology that allows edits in 2 or more instances of the same document to be applied across all the clients and the back-end irrespective of the order they are generated and processed. To help jumpstart our implementation of OT on the frontend we used a fork of Firepad under the Ace editor, which uses the OT.js lib internally. This was then interfaced into our own model of the documents and app Websocket implementation.
Switching from REST to Websockets meant that we needed to find a balance between the number of Websockets used on the backend, and keeping the state in sync. With REST the connections are many but short-lived. With Websockets the connections could last hours and we needed to make sure we didn’t drain the backend of connection resources. In the end, we settled on 2 Websockets: one for OT and document management and the other for streaming logs back to the client. All other requests are done via REST calls.
We are really happy with the choice to leverage other people’s hard work and use Ace and Firepad. Finding an editor that you can live with is tricky, none are perfect, but it’s a real boon when pushing towards MVP. So our recommendation is if you need an online editor, then try a few and see which one has bugs that you can work with or around.
The Backend Orchestration API
Our design goals for the backend were manifold. We wanted to be deployed on AWS so we could invest in the infrastructure costs relative to the value we were generating. We were entering a space where we didn’t know where our bottlenecks would be, nor where our sweet-spots in horizontal or vertical scaling would be. So that made AWS a natural choice – we didn’t have to commit too early on any given part of the stack from Hardware, through to OS and infrastructure services.
We also knew that we wanted multi-language support from day one, even if we were starting with just Node.js as the only option customers could use. To this end, it was important not to put any language-specific knowledge into the backend that we could not easily back out. We compromised on this, adding a few code stinks in our backend as shortcuts whilst we were still proving out our core value. But, we made sure to not add anything that would prevent us from adding support for new languages down the road.
Handling Users’ Code
The proxies that accept requests from the frontend client, processes them and orchestrates the client’s running code is all written in Go.
We chose Go because it is strong in concurrent architectures, has powerful primitives and robust HTTP handling that we knew we could bend to our needs, even if it didn’t work out of the box for us. In addition, several of our stack components were written natively in Go which gave us confidence in the client APIs we would need. Go also had the benefit of being a good standalone binary generator so our dependencies would be minimal once we had the binary compiled for the appropriate architecture.
On the frontend, our proxies have a health endpoint that is pulled out of Route 53 DNS if they fail. These are distributed across our AWS availability zones. It was the responsibility of the proxies, written in Go, to route traffic to either an existing available instance of the user’s project or to place it in a backend node and route to that. Since all the frontend proxies needed to know the state of project placement, which was fluid over time, we decided to experiment with etcd. Etcd is written in Go, so it has native client libraries, and it uses the RAFT consensus algorithm for maintaining state across all the availability zones. Each of the proxies is a node in the etcd cluster so that it has a local copy of state. We were then able to compare and swap atomic changes to consistently route to the right backend instance.
Etcd worked really well in the beginning, however as we ramped up in the early beta we noticed that there would be periodic hangs in servicing the requests. It turned out that because etcd uses a log appending algorithm, then after a few thousand changes it needs to “flatten” through snapshots its view on the data. So our increasingly busy set of user projects would then trigger this regular flattening of the database, which led to the hangs. This ultimately became too painful, so for now, we’ve moved over to PostgreSQL. Etcd is probably a good solution for less volatile data, and where lookups aren’t as time sensitive.
The Go proxies worked out really well though, and apart from having to add support for proxying through Websockets, the native libraries met the majority of our base networking needs.
The Container Servers
Right from the outset, we settled on a user’s application being sandboxed in a Docker container running on AWS EC2 instances. An orchestration service would then need to coordinate the content on the disk, content changes with the editor, the Docker containers used for installation and running the user’s code, and the returning of all the necessary logs back to the user’s editor.
The challenge here is that some parts of the architecture needed to be fast, with low-latency exchanging of messages between the components, and others needed to handle long-running, blocking events such as starting a user’s application. To get around this, we used a messaging hub and spoke model. The hubs were non-blocking event loops that would listen and send on Go channels. The spokes would reflect the single instances of a project’s content with OT support or container environment via the Docker APIs. This architecture has worked well and enabled us in the early days to split the proxies off from the container servers without too much effort, and a messaging approach lends itself to decoupling components as needs arise.
Post-launch we ran into a number of Kernel bugs that only emerged as we scaled up. So we put a lot of effort into hardening Docker and making the environment stable and responsive. To get there we went through several OS and Docker version combinations. In the end, we settled on Ubuntu Xenial with Docker, which works well for stability under load.
Using HyperDev to Power HyperDev
One of the choices we made early on was to force some of our backend services to use HyperDev as their hosting solution so we are always dogfooding. This is important, as if we want people to trust and rely on HyperDev for their projects, then we should be happy do the same for our own too.
We did this for the “About” and “Community” pages on our site without too many problems. The big one for us though was Authentication and Authorization as it’s the first service used once the frontend is running. So any problems on the backend are customer-facing and immediately impact a user’s experience of the product. This has caused some growing pains over using more mature, battle-hardened options. But it has meant we’ve been focused on reliability from the outset and it proves that you can use HyperDev to create complex projects, rather than just toy apps.
Overall, we’re happy with our stack. We’ve had to learn a number of lessons quickly as our launch brought more than 3 times the number of users we had anticipated (but that’s a nice problem to have!) However, no early-stage stack is perfect and we’re continuing to refine and try different options as we continue to scale up, improve speed and performance of the service and deliver the rock-solid reliability our users deserve.