March 14th, 2013 by Benjamin Pollack

Kiln Harmony Internals: the Basics

So let’s cut to the chase: you’re here because you saw Tuesday’s announcement of Kiln Harmony, you know enough about Mercurial and Git to know that the two systems aren’t actually isomorphic, and you’re therefore wondering what the catch is. You want to know how we actually pull it off, diving as far under the covers as it takes to really believe it’s possible and that we’ve done it.

We hear you. In fact, it took us months of prototyping to reach a point where we were getting happy with the results, and lots of actually using Kiln Harmony to get all of the actual workflow-related issues right. And until that point, we ourselves weren’t 100% sure if it’d ever really be possible, so we definitely don’t blame you for having the same doubts.

What we want to do here is to provide a series of posts where we dive right down into the algorithms that make this whole thing work. In each post, we’ll take a look at one part of the system, discuss the high-level view, dive into the low-level gotchas, and hopefully convince you that, yes, while this is insanely complicated, it’s also tractable, and there aren’t any catches involved. And, as a bonus, we’ll even be hosting a Q&A with the Kiln Harmony developers next week for you to ask any questions we’re not answering in this series.

In the meantime, since today is our first day, let’s start at the top: we’ll discuss how and when Kiln Harmony engages, and then dive into how Git commit objects become Mercurial changeset entries and vice-versa.

Ah yes, one more thing: posts like this don’t really have much use for images. It’s kind of one of those text-and-source-code affairs. To make up for that, and to prevent your brain turning into those “this is your brain on drugs” fried egg shticks, I’ll try to include lots of pictures of my cats. Here comes one now:

Hugging Kitties

If you push me, I’ll push you

As mentioned in our launch post, Kiln Harmony had three iron requirements: it had to be repeatable (you couldn’t get different repos on subsequent translation attempts), lossless (we couldn’t discard data just because it was hard to preserve), and idempotent (a given Git commit or Mercurial changeset had to always generate exactly the same counterpart for a given repository, with no exceptions—even across Kiln installations or with a side-trip through a site like GitHub or Google Code). This suggested to us a pretty straightforward architecture: every single repository would be stored both as Git and Mercurial on-disk, and we would write a daemon that synchronized changes between the two repositories.

The exact nature of this daemon will continue to change as we improve Kiln Harmony. At the moment, the daemon—let’s call it the Harmonizer—is written in Python, interacting with Git repositories via a customized version of Dulwich, and Mercurial repositories via a customized version of…well, since it was already written in Python, Mercurial itself. What customizations, you ask? Well, we had to customize both of these libraries to allow us to deliberately generate corrupt repositories—not something we expect would be readily accepted upstream. (That statement probably sounds both evil and insane to you, but it’s neither. We’ll make use of generating corrupt data in today’s blog post, which will give you a feel for where this comes up.)

Whenever you push to Kiln Harmony, we lock all incoming pushes from the other DVCS (e.g., if you just pushed Git, we will block any incoming Mercurial pushes), and begin translation. When the translation is complete, we run translation the other direction, in a cycle, until no work happens for one complete iteration. We then notify the website new data is available, store information about the new changesets (including the Git to Mercurial mapping), and relinquish the write lock.

So far, no rocket surgery. This is important, but mind-numbingly boring, so here’s another picture of a cat.

Bulldozer_Small

I’m a committer

Let’s talk about nearly isomorphic data.

Git commits and Mercurial changesets include nearly the same stuff: they both include timestamps, a committer, a description of what happened, some parents (optional), and what amounts to a pointer to what code is actually included in that version. So we have our first naive algorithm to convert: get the file contents and metadata converted somehow (we’ll cover that in another post, but for now, assume we’ve somehow already got that part working), then make a changeset or commit, using the field-for-field copy of the data in its peer.

Great idea, and it’ll work for very simple Mercurial or Git repositories. But it falls down fast in the real world.

Let’s start by just trying to figure out how to convert the valid data we’re going to come across—because, trust me, there’s going to be a lot of invalid data we’re going to have to work with, too.

Extra, extra, read all about it

Let’s start with something insanely common in the Git world that gets us into trouble: the distinction between authors and committers. In Git, the person who actually wrote the code is called the author. Let’s say that person is Sara. Sara might commit and push the code herself to the official repository, in which case the author and committer will both be her. But more likely, she’ll submit it as a pull request or send her patch to a mailing list, where someone else (let’s call him John) rebases the change on master and then pushes that to the official repository. In that case, Sara stays the author, but John is now the committer.

Problem: Mercurial changesets have only one field, the username, which corresponds most directly with a Git commit’s committer. They don’t have an equivalent of the author. Further, I’ve been glossing this a bit, but the author and the committer can also both have their own timezones and their own time stamps, so we’ll need to preserve that data somehow, too.

Thankfully, there’s a solution: both Git and Mercurial allow additional data, called extras, to be stored in their commits and changesets. So here’s what we’ll do: whenever we have data only Git can understand, we store it as an extra in the Mercurial side. When we have data only Mercurial understands, we’ll store it as an extra on the Git side. Finally, while it’s not quite applicable here, we’ll decree that explicitly stored extras that we recognize trump any data “officially” in the commit.

How does that apply here? Git committers become Mercurial usernames. Git authors, and their associated data, will become custom extras in the Mercurial changeset.

You can see these right now, if you grab the Harmony version of a Git repository that has these. Let’s take a look:

$ hg clone -U https://mirrors.kilnhg.com/Code/Mirrors/Tools/Dulwich
destination directory: Dulwich
requesting all changes
adding changesets
adding manifests
adding file changes
added 1671 changesets with 3630 changes to 355 files (+1 heads)
$ cd Dulwich
$ hg debugdata -c 3ff0539fff4f
2065f43762d222a70e915735b4d348e579ed45c6
Jelmer Vernooij <jelmer@samba.org>
1243726211 7200 -kiln-git-author:Ronald Blaschke <ron@rblasch.org> -kiln-git-commit-message:Mrn8;b7gLHX>Mg~b1n
dulwich/_objects.c
dulwich/_pack.c

Fix sentinels.

As you can see, the fields Kiln adds are all prefixed with -kiln-, to minimize the chance that any other software tries to use the same keys. We also try to minimize their use, so if the committer and author are the same, or at least share the same timestamp, we won’t store the redundant data. But worst-case, we’ve introduced just a couple extras keys: -kiln-git-author, -kiln-git-author-time, and -kiln-git-author-tz. So far, so good. (There’s also an extra key in there, but we’ll come back to that later.)

A rose by any other name would be a spatula

So are we all set for users now? Not quite. All we did was switch which tool gives us problems. Git expects usernames for committers and authors to have a set format, which is (approximately) the regex [^<]* +<(.*)>. This is all well and good, and most Mercurial repositories have usernames in this format as well, but, unlike Git, Mercurial does not require this format. Especially in repositories converted from Subversion, you’ll frequently see usernames like john or sara instead of John Example <john@example.com>.

We’ll fix this in two pieces: first, to make Git happy, we unilaterally declare that every anonymous user who has a commit in Kiln Harmony shares the free email address unknown@kiln.example.com. Second, to keep Mercurial happy on the return, we store the raw username as a Git extras field, kilnhgusername, so we can restore the verbatim name when we go back to Hg. Remember how I said earlier that extras win over “official” data? That comes up here so that Mercurial won’t pick up the munged Git username on the return: if kilnhgusername is present, it trumps the Git username on the commit.

Let’s take a look:

$ git clone https://mirrors.kilnhg.com/Code/Mirrors/Tools/Mercurial.git
Cloning into 'Mercurial'...
remote: Counting objects: 98193, done.
remote: Compressing objects: 100% (22363/22363), done.
remote: Total 98193 (delta 75822), reused 98117 (delta 75746)
Receiving objects: 100% (98193/98193), 24.33 MiB | 2.63 MiB/s, done.
Resolving deltas: 100% (75822/75822), done.
Checking out files: 100% (1015/1015), done.
$ cd Mercurial
$ git show --format=raw 12a7480271a2
commit 12a7480271a246ae990c4558a878b6628bb19550
tree 8bf68397176781c35ad65bccbeb29be8833600c3
parent 16bd8634c95aa7120aa7a7371705538b89e51576
author Stephen Darnell <unknown@kiln.example.com> 1122488324 +0800
committer Stephen Darnell <unknown@kiln.example.com> 1122488324 +0800
kilnhgrawdescription L1bhgVIVCnbZKp6AY*TBZDDR?AZ%%FWgu^GbZKvHAarjabZKp6AZTYGV{dJ3VQyq|3JL
kilnhgusername Stephen Darnell

    Add a --time command line option to time hg commands

diff --git a/mercurial/commands.py b/mercurial/commands.py
...

Okay, not so bad. Plus, I think we earned a cat. A good one. Maybe something retro?

Elaine clawing the oven

Commits beyond description

Mercurial and Git both allow you to have descriptions to, erm, describe what you’ve done. So we just copy those, right?

Well, no. Git descriptions can have any encoding, whereas Mercurial ones have to be UTF-8. But, thankfully, we can use a similar trick here to what we used with Git authors: we’ll use changeset extras to store the Git encoding (-kiln-git-commit-encoding) and the original description’s byte sequence (-kiln-git-commit-message), and we’ll lossily transcode the Git commit message to UTF-8 so Mercurial users do see a description when they run hg log. Going the other direction, we just mark Mercurial descriptions as being UTF-8 encoded, and we’re done. Right?

Almost. Up to this point, we’ve been focusing on how we translate valid data found in Git and Mercurial repositories. But it turns out that a lot of the actual repositories out there have grossly invalid data. You see, both tools started getting used, in big ways, before their data formats were really 100% stabilized. In some cases, it’s not that the data is so much invalid as that the formats slightly changed. In others, insufficient data validation meant corrupt data could make it to disk.

How’s that apply here? Well, in theory, all Mercurial descriptions are UTF-8. In practice, you will find Mercurial repositories whose descriptions are in completely random encodings (we found lots of Latin-1, a little KOI8, some ShiftJIS, and smatters of other encodings), and, worse, you will find descriptions which are not valid in any known encoding. So now, even though “all” Mercurial descriptions are UTF-8, we also need to store the raw Mercurial description in Git (kilnhgrawdescription), and then our best-effort lossy description as the “official” Git description, so Git users have at least some chance of figuring out what was going on in a given commit. So does that work?

No. Weren’t you listening? Mercurial descriptions are UTF-8, and, nowadays, Mercurial actually enforces that. So out-of-the-box, working with the Mercurial API, even if you take the above steps, you can end up with repositories you can convert to Git, but that you cannot convert back to Mercurial.

As the French say: nom d’un chien.

Huddled Elaine

Er, nom d’un chat.

Remember how I said some of our modifications to Mercurial allow us to deliberately store invalid data, and that these modifications are neither nefarious nor insane? This is why: to allow these repositories to round-trip, we need to bypass all the safety mechanisms in Mercurial, and commit a description that is a blob of bytes in an unknown encoding. See? Purely, totally, sanely logical.

1031071313

Whoops, that was a Khan Academy engineer, not a cat. My bad. It won’t happen again.

So, anyway: now do we have descriptions?

Of course not!

Mercurial descriptions cannot end with a newline. Git descriptions must end with a newline. So we’ll add a newline when going to Git and remove it going to Mercurial. Right? Wrong: because Mercurial didn’t always enforce that rule. So while we do add and remove a single newline 99% of the time, the remaining 1% of the time, we store the raw Mercurial description, newline and all, in the kilnhgrawdescription extra in the Git commit. Oh, and because that will by definition have newlines, which we can’t (sanely) store in a Git extras field, we’ll also base85 encode it. Ah yes, and this will of course be another modification of Mercurial so that you can deliberately store improperly formatted descriptions that are nevertheless valid encodings.

So now are we call good with descriptions? Well, descriptions, yes, but commits in general? Not even close. Git timezones aren’t always valid, so we happily welcome our new -kiln-git-commit-tz-raw and -kiln-git-author-tz-raw overlords. Mercurial changesets end up having the same problem, so kilnhgdatetime is born unto the world. And so on and so forth.

When you’re done, you have a disgusting pile of extras on both sides, but you also have repositories that are human-friendly, and round-trip. So it’s worth it in the end.

Wait a minute. Disgusting pile of extras. Do we round-trip extras?

No, no, no, this won’t do at all!

Git extras are really more like a text blob at the end of the official commit area, while Mercurial’s are a sorted key-value store. That wouldn’t matter if we were the only ones using the extras, but there are, of course, other uses: both Git and Mercurial use their extras area to track rebases and to store what Subversion commit a changeset or commit came from, among other things. So we’re going to have to round-trip those, too.

In a maneuver that made me once jokingly refer to this product as Kiln Ouroboros, we of course solve the problem of how to preserve extras by using extras. The implementation here is incredibly uninteresting: we use prefixes to avoid having Git extras collide with Mercurial extras or use base85-encoded JSON dictionaries, depending on which direction we’re going and what we’re doing. And that ends up getting you almost all the way through commits and changesets.

But wait, there’s more!

Almost? Well, yeah. Those concepts of branching and merging that practically make DVCSes what they are? We skipped past all of that.

But I’m out of cat pictures, and you’ve got enough to think about for the time being. So let’s start tackling those concepts in our next post. In the meantime, go out, get some coffee, and enjoy knowing that the Kiln team worried about all of this so that you don’t have to.  (And, if you haven’t yet, go make a Kiln Harmony account so you can start playing with everything we just talked about yourself.)