The Search Guide: Helping you find what you’re looking for

April 11th, 2014 by Aaron Maenpaa

As you may already know, FogBugz offers an impressive array of search axes to help you slice and dice your data. These axes allow you to build searches to find particular cases or any cases matching pretty much arbitrary criteria. While all of these axes are documented on our help site, we kept finding that while axis search is really valuable to us when we use FogBugz, we kept talking to customers who just didn’t know about it.

We’re trying to change that.

The first thing we’re doing is adding a prompt to the search box:

When you click it, it expands to show an axis guide:

We’re hoping that the prompt stays out of the way during normal use, but helps make the axes more discoverable, presents the axis documentation in a more useful format, and ultimately helps you find the cases you’re looking for, when you’re looking for them.

… but wait, there’s more

Like I said, this is the first thing we’re doing. We’re working on some other stuff we think you’ll like. Here’s a sneak peak (provided you haven’t already reached your daily recommended intake of Comic Sans ;-):

Like the look of it, but don’t have FogBugz? You can sign up for a free FogBugz trial today!



Fog Creek and Heartbleed

April 10th, 2014 by Mendy Berkowitz

Along with the rest of the internet, Fog Creek has been reacting to the Heartbleed vulnerability which was discovered this Monday.

TL;DR: FogBugz and Kiln On Demand, Copilot and the Fog Creek website were not vulnerable. Trello was vulnerable and has been remediated.

Fog Creek handles SSL connections at the load balancers in front of the application servers. The load balancers for FogBugz and Kiln On Demand, Copilot and the fogcreek.com website are using the 0.9.8 branch of OpenSSL. This version does not have the newer heartbeat extension and is therefore not vulnerable to Heartbleed. None of your data on fogbugz.com, kilnhg.com, copilot.com or fogcreek.com has been exposed to the Heartbleed vulnerability.

The load balancers for Trello were using the 1.0.1 branch of OpenSSL and were vulnerable. We have upgraded OpenSSL and replaced the Trello certificates. More details and important recommended steps for all users are available in a post on the Trello blog.

We have also reviewed our vendors (for things like this blog) to ensure that our other SSL certificates were not potentially compromised. Fortunately we have not had to replace any other certificates.

If you have configured web hooks in your Kiln On Demand account, we recommend that you verify that the target servers have not been affected. If you use one of the pre-configured types, or your custom web hook uses HTTPS, the data sent to that external service may be at risk. We do not however, send any of your Kiln login credentials, so they are safe.

We take the security of your data seriously. We are committed to protecting it and to maintaining clear and open communication with you. If you have any questions, comments or concerns please contact us.



The Single Most Sure-Fire Hiring Decision You Will Ever Make

April 7th, 2014 by Elizabeth Hall

Imagine after just three months you had eager, extremely talented, computer-science students returning to school to talk nothing but good things about their experience at your company to their peers. And, by the following May, you would have new hires joining your team who were already familiar with your company’s values, products, and work ethics. How could they not? They passed a three month interview! Hiring an intern as a full-time employee will be the single most sure-fire hiring decision you will ever make.

When Should You Begin Intern Recruiting?

Which Career Fairs Should You Attend?

Where Should You Post Your Internship?

How Much Does Intern Recruiting Cost?

What’s A Competitive Program?

How Many Interns Should You Hire?

What’s This Graph All About?

Canidates per Stage

Find out HERE!



The Kiln Bottleneck

April 3rd, 2014 by Jacob Krall

Early in February, one of our Kiln On Demand customers sent us an email to let us know that Git clones were taking much longer than before. Our stellar support engineers started measuring Git transfer times from their own personal accounts, and confirmed that Kiln clones were being sluggish. We didn’t know what the problem was, and since it wasn’t completely preventing the clones from succeeding, we prioritized further investigation lower than we should have.

On Friday, March 7, another customer reported the same issue and wanted to know if we were throttling their connection. Our existing monitoring showed all services performing nominally, especially after the Kiln SSH server resource utilization fix in Kiln 3.0.115 had been released, so we were still a bit puzzled. I started writing a little PowerShell script I called timeit.ps1 to measure the time it took to run git clone and hg clone against the various distributed services inside Kiln. (Unlike Trello’s codenames, mine are boring and vaguely descriptive.) I ran timeit.ps1 against my local machine, and saw nothing unusual.

Kiln DVCS Hosting Infrastructure

Here are the endpoints that timeit.ps1 attempts to clone from; once with Git and once with Mercurial.

HAProxy, IIS, and Apache on the web side; HAProxy, Tesseract and KilnSSH on the SSH side.

HAProxy terminates HTTPS traffic and sends the request on to an IIS server.

IIS authenticates the git/hg request and forwards it to Apache on the backend server that holds the requested repository.

Apache hosts the process that actually performs the git/hg transfer.

HAProxy forwards all SSH packets, untouched, to a Tesseract server.

Tesseract reads the SSH traffic and forwards it to KilnSSH on the backend server that holds the requested repository.

KilnSSH authenticates and performs the git/hg transfer.

Interesting Problems Happen At Scale

There are very few differences between my local development install of Kiln and the live Kiln On Demand service. One particularly important difference is that Kiln On Demand serves real customers, while my laptop is dedicated to just one user (me). Since I couldn’t reproduce the problem on my local machine, but our support engineers and customers could see it every day, I knew the problem had to be something environmental. Monday morning, I tweaked timeit.ps1 to run against any Kiln instance, and pointed it at my personal Kiln On Demand account. By the end of the day on Monday, I shared this compelling graph with the rest of the team in chat.

The public endpoints are much slower than the endpoints immediately behind them

Bingo. Our load balancer, HAProxy, was doubling the clone speed. Kiln QA engineer Andre confirmed from his home in Vancouver, B.C. by running timeit.ps1, he got the same exact shape on his graph. It had to be the load balancer!

By Tuesday morning, our System Administrators (whom I had otherwise forgotten about) had already formulated a plan to add power to the HAProxy machine. A Puppet script was written and run against the staging environment for testing on Tuesday. The process was repeated on our internal Kiln instance Wednesday morning for dogfooding and internal testing all day. That Wednesday night’s maintenance successfully switched the customer-facing servers to the new configuration, and the results were immediate and drastic:

after

Mercurial KilnSSH clones are somehow slower than its caller, Tesseract. This is a strange artifact of the measurement.

Richard, our support engineer in London, had a monitoring task set up that ran a git clone via public SSH. His graph confirmed that the HAProxy maintenance had an immediate and drastic effect. The variance dropped dramatically, making a smooth, placid lake out of what was once a spiky mountain range:

A graph showing Git clone becoming faster and less variant.

Lessons Learned

The scientific method is insanely effective. Formulate a hypothesis, design an experiment, run the experiment, and analyze the results. Then it is often obvious what you need to do.

Software is invisible. You can only observe its behavior indirectly, by measuring it. That means you can’t see what you don’t measure.

Your customers know when your application has a performance problem. If you can’t explain it, you need to dig in and figure it out.

Next Steps

timeit.ps1 was a one-off test script. It only runs in Windows, and is not very easy to use. We are adding automated measurement of DVCS metrics to our monitoring systems.

Start a free trial of Kiln and enjoy the faster Git and Mercurial clones today!



Remote Developer Happiness

March 31st, 2014 by Ben McCormack

We’ve written before about a few of the perks of working in the Fog Creek New York office, but what if you work remotely? As it turns out, there are some nice perks to working remotely as well, even cake on your birthday. For your home office, you get everything you would normally get to help you get work done, so you don’t have to spend time thinking, “Gosh, it sure would be nice to have an extra monitor mounted vertically to tail production logs.” Just get it.

My Steelcase Series 7 height-adjustable desk was just installed last week, so I made a video to show it off:

Here’s a list of some of the inventory on, around, and underneath my desk:

I spent a fair amount of time figuring out how to hide cords and arrange things so the desk looks clean. Having extra space on the desk just to have extra space is actually kinda nice. I used to clutter it with trinkets, but those are now on a shelf somewhere. (Want to hide even more cords? You might check out something like this).

Do you want to work remotely for Fog Creek? We’re hiring.



Project Asteroid: Gracefully Dropping Support for Dinosaur Browsers in Trello

March 25th, 2014 by Bobby Grace

According to the leading theory, dinosaurs went extinct roughly 65.5 million years ago after an asteroid landed on Earth. A few million years later you are working on a large scale web application and think to yourself, “Hey. Would it be such a stretch to compare the biodiversity of the Triassic period a.k.a. dinosaur times to that of today’s browser ecosystem?” No, I don’t think it would be. There are a lot of browsers out there and they all look and behave a little bit differently. There are some new browsers that are fast and have all kinds of great features and there are some old, slow dinosaur browsers made for a bygone time.

You are getting more than a little frustrated trying to support all these browsers. You’re looking at all the things you could be doing in these new browsers and it’s becoming clear that those big old dinosaurs are weighing you down. Of course, you could just decide to send people using these old browsers to an “Upgrade your dang browser are you kidding me?!” page one day, but that would be a cataclysmic event that would be sure to upset many, many people. What if there was a friendlier, more graceful way to transition these dinosaurs into the modern age?

Browser Usage on Trello.com

usage

This is what browser usage on trello.com looks like for the last two months or so. We built Trello from the ground up to work on just about any device. The only caveat was that we promised to support the browsers of today and tomorrow. Some of the browsers we supported at launch just can’t be considered modern by today’s standards. I’m talking about Internet Explorer 9 specifically. IE9 was released three years ago in March 2011 and was the most current version of IE at the launch of Trello. But three years ago might as well be 100 million years ago in Internet Time. I mean, three years ago? Doge was just someone’s dog three years ago.

Sure IE9 is old, but more importantly it’s hard to support. New Trello features take longer to develop because we have to bend over backwards to support IE9. Things tend to look wildly different in IE9. Open up the new attachment viewer in IE9 and it’s like a cubist painting gone wrong. IE9 doesn’t have all the new HTML and CSS features like the new browsers. To top it all off, it’s real slow. It takes forever to process JavaScript and make pixels show up. It’s like a kid running and screaming and knocking down all the boxes in the cereal aisle.

We’ve been hammering the site into a passable state for IE9, but it’s still a poor experience for the good-looking Trello crowd who we hold in the highest regard. So we would like to not have to support it, but at Trello-scale, 1.4% of visitors is not a number we can ignore. What do we do?

Enter Project Asteroid

Remember when you read How We Make Trello and your whole worldview was flipped upside down by the mere mention that there are multiple versions of the website out there at any given time? Remember how the website is just a face that chats with the Trello API and that the iOS and Android apps are also just faces and you can make your own face, too? Remember the dangerous potato faces? Remember when I said the API didn’t have any bugs and you thought I was being serious?

Well there’s a special face out there for people using Internet Explorer 9. We call it Asteroid. Get it? Like the thing that killed the dinosaurs. This is what Trello looks like in IE9 today.

asteroid

This is an old but completely functional version of the site. It doesn’t receive any updates or new features, just the occasional fix. It’s slightly customized. There’s a banner in the header that points you to the platforms page where you can find download links for all the great, modern browsers out there; browsers that are getting all kinds of cool stuff like a new card back and attachment viewer. But you can use Trello in IE9 all the same, perhaps while you are on hold with your work’s IT person trying to convince them to upgrade your browser.

In the end, everyone lives happily ever after. The dinosaur browsers peacefully roam in a land before time, while the modern browsers harness nuclear fusion and terraform Mars millions of years in the future. The dinosaur browsers will fade away eventually, but there’s no need to rid the Earth of them in an instant.

How do we know what face you should be seeing?

Let’s talk about what makes a web app first. A version of the Trello website is just a collection of files: a JavaScript file, a CSS file, some images, and maybe a few others. These files contain all the instructions necessary to make the site look and function in just the right way in your browser. The problem is that these instructions are not understood in the same way by every browser. The newer browsers get the instructions and say “Yeah cool I’m on it” while the dinosaur browsers look at you all confused then go off to find some leaves to munch on. This is annoying and is the reason we got into this mess in the first place.

What happens when we release a new client version? Let’s say the files collectively known as jellydonut-15 are broken in some way. To push a fix, we create a whole new collection of files, jellydonut-16, and you download the latest collection when you refresh or open a new Trello tab. These collections of files never change. jellydonut-15, jellydonut-16 and all the versions of Trello are all stored in a big ol’ file cabinet in the sky called a Content Delivery Network.

When you type https://trello.com in your browser’s address bar and hit enter, you make a request to the Trello server. The server takes the request and figures out who you are and which collection of files you get. The server says to itself, “Ooooh. A new request. Okay, this looks like Bobby. Let’s look him up in the database. Says here that he is on the alpha channel. The alpha channel is using the latest jellydonut version, which is… lemme thumb through this rolodex… jellydonut-22. Okay, I’ll make sure his browser pulls the jellydonut-22 files from that big ol’ file cabinet in the sky called a Content Delivery Network. Hey, when is lunch? *clutches power unit* ‘Ugh, you know what, I’m not hungry. I had some leaky AA batteries last night.’ Hahaha! A computer using AA batteries? That’s a good one. I’m going to tell that joke at the pub when I’m cycled out of production for the night. Oh look, another request…”

Now that you have your face files, you can start chatting up the API which will dutifully get the boards and cards your face asks for. Every face, new and old, talks to the one, single Trello API. I can personally verify that the API does not have a single bug. It’s important to note that the fact that we are cramming dozens of jelly donuts into a file cabinet and the fact that the hungry server knows which face you get are facts that are totally independent of the way the API operates. We don’t need to change the API for every new face. But let’s say jellydonut-17 wants to chat up the API in some new way. We update the API before jellydonut-17 is released so the API can understand what your face is talking about. Usually this new thing is an addition to the API, so it’s backwards compatible. Old faces simply continue to not ask for things they never knew to ask for.

Back to my point. So, in addition to knowing who you are, the server also knows what browser you are using based on the request. If it sees that you are using Internet Explorer 9, it gives you the custom asteroid files, no matter what channel you are on. These files include very simple instructions that won’t upset or confuse the dinosaur browsers and make them go off and eat people hiding in public restrooms. Then you, the person using IE9, see that there are new great things on the horizon if only you would upgraded your browser. Which you do. Hopefully. Right?

So is it working?

The short answer is yes. The asteroid appears to be having a… deep impact.

We launched asteroid on March 6th, 2014. Let’s compare Internet Explorer usage between January 13th–17th, 2014 and February 10th–14th, 2014, a time before asteroid. It goes up across the board for all the versions of IE. IE11 was up 16%, IE10 was up 3%, and IE9 was up 2%.

jan-feb

Now let’s compare February 10th–14th, 2014 to March 10th–14th, 2014, before and after asteroid. IE11 jumps up 23%, IE10 holds steady around 2%, and IE9 drops 8.5%.

feb-mar

For every week we compared before asteroid, IE9 usage was going up. In general, usage goes up for every browser. Now IE9 usage is decreasing. It’s easy to attribute the decreased usage of IE9 to asteroid, but it’s impossible to tell if people are switching to another browser or what. There was an especially big spike in IE11 usage, but we don’t know if that’s directly related. Usage is constantly fluctuating across all browsers.

But it’s good news at any rate. Great features are being developed faster and people can still use Trello on IE9. We can theoretically support this version forever, but we may pull it from production when IE9 usage becomes negligible. At that point, it should be a pretty soft landing. Another nice thing is that we can reuse this strategy when all the Trello developers are shaking their fists at some other browser that is looking pretty old (IE10).


That’s all. Please yell at me on Twitter at @bobbygrace and follow @trello for the latest. Of course, give Trello a shot. It will keep you and your team more organized. Guaranteed.

And please keep your browser up to date.



Cupcakes in Paradise

March 24th, 2014 by Elizabeth Hall

When Brett moved from New York City to Honolulu, where he continues to work for Fog Creek, we didn’t anticipate a very serious problem: What to do on a remote employee’s birthday?!

Typically, birthdays are celebrated with our caterers delivering a cake to the office. However, seemingly overnight, (more like over the course of a year), we went from all employees working out of our New York office to having 37% remote employees worldwide. So the cake delivered to office thing wasn’t gonna suffice.

Did you know it’s actually really difficult to get a cake delivered in Hawaii? It is. The lovely owner of Sweet Revenge pointed us to the Seamless-of-Honolulu called Room Service in Paradise, which had one dessert cupcake vendor listed. So, Brett enjoyed snickerdoodle, cream cheese-filled red velvet and salted caramel cupcakes on his special day.

Battling the birthday crisis wasn’t the only adjustment we needed to make when incorporating remote employees to our once in-house only company. We jumped headfirst into interviewing remote candidates before realizing some changes needed to be made to our well-oiled interviewing machine.

It wasn’t a smooth transition; There were bumps along the way.

Here’s what we learned…



Production Debugging: A story about “Exception code: 0xe053534f”

March 20th, 2014 by Aaron Maenpaa

Imagine for a moment that you’ve been running an ASP.NET app for years, and on one otherwise sunny Saturday morning, you get a call saying that the app has started crashing randomly. What do you do?

Step 0: Check the event log

It took me way too long to get used to checking the Event Log (Start → Event Viewer). I don’t know if I expected it to be loggier, or what, but there’s good stuff in there. You should check it out.

You might even find an error like this one (Windows Logs → Application) :

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.17651, time stamp: 0x4e21213c
Exception code: 0xe053534f
Fault offset: 0x000000000000cacd
Faulting process id: 0x%9
Faulting application start time: 0x%10
Faulting application path: %11
Faulting module path: %12
Report Id: %13

This report is a little terse but it can be the gateway to figuring out what’s wrong. “Faulting” sounds bad: It means “crashing”. Translating to English, It’s saying “the IIS worker process (w3wp.exe) crashed”. Since this is exactly what you’re looking into, it’s a start. Exception code 0xe053534f is a stack overflow (I wish I had advice for how to map exception codes to what the problem is, but as far as I can tell, Google and hope are your best bet).

Ok, so what now?

Step 1: Googling

Looking around, amongst the “I’m getting 0xe053534f, how do I fix it?” posts on forums you’ll realize that what you want is a crash dump, and you’ll find some advice on how to obtain one, like this, or this.

You probably don’t want to use AdPlus

While AdPlus is indeed super cool, and indeed has a mode that will allow you monitor all of the IIS worker processes (a command which I will not include here because I don’t want to encourage you to use it), in my experience the overhead was way too high: The throughput on the monitored servers dropped to the point where 1. They basically weren’t doing any work, 2. They stopped crashing (Don’t worry… subset of servers during a time of reduced throughput requirements, negligible customer impact).

You probably don’t want to try and set up an Unhandled Exception Module

At least not while you’re under the gun. I never actually got it to work, and it wouldn’t actually help us here since it probably wouldn’t get called for a stack overflow anyway.

Step 2: Check the event log

Did you know that Windows Error Reporting has your back? If you look at the Event Log, the event right after the crash should be from Windows Error Reporting and hopefully it has some more information like:

Fault bucket , type 0
Event Name: APPCRASH
Response: Not available
Cab Id: 0
 
Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.17651
P6: 4e21213c
P7: e053534f
P8: 000000000000cacd
P9:
P10:
 
Attached files:
 
These files may be available here:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue\AppCrash_w3wp.exe_16c01b4e25db8aa7e52fdd921dbac8b8dba70ff_e37855c7
 
Analysis symbol:
Rechecking for solution: 0
Report Id: 3bfe0e87-a15d-11e3-a83d-14feb5d64ad7
Report Status: 4

… okay, so that’s not super informative, but there’s a line in there that should get you excited, C:\ProgramData\Microsoft\Windows\WER\ReportQueue is a folder that might already contain crash dumps… no collection required! You probably want to look there and see if there are dumps that look relevant to resolving your problem.

What if there aren’t any crash dumps?

Check all affected servers, you only need one. I never got around to trying it, but SysInternals ProcDump looks like it might do the job, hopefully without taking the toll that AdPlus did.

DebugDiag also came up repeatedly while I was looking around, maybe that would do the trick. Looks more complicated to get going than ProcDump though.

Step 3: Windbg

Okay, you’ve got a crash dump, open it up in windbg, and look around (there are tons of windbg cheat sheets around like this one, this one, or this one). We’re going to:

  1. Load the CLR extensions (.loadby sos mscorwks).

  2. Load the symbols (.symfix).

  3. Inspect the stack (!clrstack).

  4. Inspect the exception (!printexception).

This dump file has an exception of interest stored in it.
The stored exception information can be accessed via .ecxr.
(1cbc.3d34): Unknown exception - code e053534f (first/second chance not available)
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for ntdll.dll -
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for KERNELBASE.dll -
ntdll!NtWaitForSingleObject+0xa:
00000000`76ed135a c3              ret
0:078> .loadby sos mscorwks
------------------------------------------------------------
sos.dll needs a full memory dump for complete functionality.
You can create one with .dump /ma <filename>
------------------------------------------------------------
0:078> !clrstack
*********************************************************************
* Symbols can not be loaded because symbol path is not initialized. *
*                                                                   *
* The Symbol Path can be set by:                                    *
*   using the _NT_SYMBOL_PATH environment variable.                 *
*   using the -y <symbol_path> argument when starting the debugger. *
*   using .sympath and .sympath+                                    *
*********************************************************************
PDB symbol for mscorwks.dll not loaded
------------------------------------------------------------
sos.dll needs a full memory dump for complete functionality.
You can create one with .dump /ma <filename>
------------------------------------------------------------
c0000005 Exception in C:\Windows\Microsoft.NET\Framework64\v2.0.50727\sos.clrstack debugger extension.
PC: 00000642`ffcbf8e4  VA: 00000000`00000000  R/W: 0  Parameter: 00000000`00000000
0:078> .symfix
0:078> !clrstack
OS Thread Id: 0x3d34 (78)
(!clrstack processes a max of 1000 stack frames)
Child-SP         RetAddr          Call Site
000000000e0e6c70 000007ff01a617e9 HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
000000000e0e6cf0 000007ff01a617e9 HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
000000000e0e6d70 000007ff01a617e9
<many, many, many more lines of that>
0:078> !printexception
Exception object: 0000000270090138
Exception type: System.StackOverflowException
Message: <none>
InnerException: <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 800703e9

Here’s what we’ve learned so far:

  1. We’ve confirmed a stack overflow as the cause of our problem.

  2. We’re seeing the error in HtmlAgilityPack, but we can’t actually see the base of the stack because !clrstack gives up too soon.

Delve Deeper

!clrstack won’t go deep enough to show us the base of the stack, but we can get the base of the native stack via k 1000

0:078> k 1000
Child-SP          RetAddr           Call Site
<the native error handling frames>
00000000`0e0e6c70 000007ff`01a617e9 0x000007ff`01a617bc
00000000`0e0e6cf0 000007ff`01a617e9 0x000007ff`01a617e9
<lots of stack frames>
00000000`0e13ad70 000007ff`01a5d442 0x000007ff`01a617e9
00000000`0e13adf0 000007ff`01a5cf28 0x000007ff`01a5d442
00000000`0e13ae80 000007ff`01a5c39a 0x000007ff`01a5cf28
00000000`0e13aed0 000007ff`01a5bd9c 0x000007ff`01a5c39a
00000000`0e13af20 000007ff`01a5bbaa 0x000007ff`01a5bd9c
00000000`0e13afe0 000007ff`01a5b224 0x000007ff`01a5bbaa
00000000`0e13b020 000007ff`01a5aff3 0x000007ff`01a5b224
00000000`0e13b060 000007ff`01a5af05 0x000007ff`01a5aff3
00000000`0e13b0b0 000007ff`01a6f836 0x000007ff`01a5af05
00000000`0e13b0f0 000007ff`01a582e1 0x000007ff`01a6f836
00000000`0e13b4e0 000007ff`01a50fd0 0x000007ff`01a582e1
00000000`0e13b7f0 000007ff`0221f79c 0x000007ff`01a50fd0
00000000`0e13bdf0 000007ff`019513ef 0x000007ff`0221f79c
00000000`0e13be70 000007ff`0194c972 0x000007ff`019513ef
00000000`0e13bf90 000007ff`01949ff0 0x000007ff`0194c972
00000000`0e13c2f0 000007ff`01341aa2 0x000007ff`01949ff0
00000000`0e13c6a0 000007ff`013409c6 0x000007ff`01341aa2
00000000`0e13c900 000007ff`00ffc48e 0x000007ff`013409c6
00000000`0e13c9d0 000007ff`00f1708c 0x000007ff`00ffc48e
00000000`0e13ccd0 000007ff`008f5ec4 0x000007ff`00f1708c
00000000`0e13cdc0 000007ff`00d23e05 0x000007ff`008f5ec4
00000000`0e13d600 000007ff`00c7e416 0x000007ff`00d23e05
00000000`0e13d6d0 000007ff`00c7ad5e 0x000007ff`00c7e416
00000000`0e13d7d0 000007ff`00c78d85 0x000007ff`00c7ad5e
00000000`0e13d960 000007ff`00c6ccb9 0x000007ff`00c78d85
00000000`0e13d9b0 000007ff`00c69be1 0x000007ff`00c6ccb9
00000000`0e13dad0 000007ff`00c691ab 0x000007ff`00c69be1
00000000`0e13dc50 000007ff`00c684c4 0x000007ff`00c691ab
00000000`0e13dcb0 000007fe`f14ee0ba 0x000007ff`00c684c4
00000000`0e13dcf0 000007fe`f47451c7 mscorwks!UMThunkStubAMD64+0x7a
<more native frames>

We’ve got the base of our stack, we just can’t read it :). Fortunately, the !IP2MD command will turn the last pointer into a description of the method:


0:078> !IP2MD 0x000007ff01a617e9
MethodDesc: 000007ff0237bed8
Method Name: HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
Class: 000007ff01d1da00
MethodTable: 000007ff0237c3d8
mdToken: 0600000b
Module: 000007ff0237abc0
IsJitted: yes
CodeAddr: 000007ff01a61700

Putting together some copy/paste, some Vim, some !IP2MD and some grep you can turn that list of pointers into the base of your stack:

HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
HtmlAgilityPack.HtmlNode.CloseNode(HtmlAgilityPack.HtmlNode)
HtmlAgilityPack.HtmlDocument.CloseCurrentNode()
HtmlAgilityPack.HtmlDocument.PushNodeEnd(Int32, Boolean)
HtmlAgilityPack.HtmlDocument.Parse()
HtmlAgilityPack.HtmlDocument.Load(System.IO.TextReader)
HtmlAgilityPack.HtmlDocument.LoadHtml(System.String)
FogCreek.HtmlParser.HtmlParser.Initialize(System.String)
FogCreek.HtmlParser.HtmlParser.ExtractSafeHtml(System.String, FogCreek.HtmlParser.IHtmlTransform[])
FogCreek.HtmlParser.HtmlParser.ExtractText(System.String)
FogCreek.FogBugz.__Global.GetMessageTextByMsg(Wasabi.Runtime.Mail.MailMessage, Boolean, Boolean, System.Nullable`1<Boolean>)
FogCreek.FogBugz.CFullTextIndexer.IndexBugEvent(FogCreek.Search.Interfaces.IDocument, System.DateTime ByRef, Int32 ByRef, FogCreek.FogBugz.CBugEvent, System.Nullable`1<Boolean>)
FogCreek.FogBugz.CFullTextIndexer.IndexBug(Int32, Int32, Int32 ByRef)
FogCreek.FogBugz.CFullTextIndexer.MemorySafeIndexBug(Int32, Int32, Int32)
FogCreek.FogBugz.CFullTextIndexer.IndexItem(System.String, Int32, Int32, Int32 ByRef)
FogCreek.FogBugz.CFullTextIndexer.UpdateByType(System.String, Int32)
FogCreek.FogBugz.CFullTextIndexer.HeartbeatUpdateIndex(FogCreek.FogBugz.CNotification, System.String, Int32)
FogCreek.FogBugz.__Global.SingleHeartbeatFogBugz(Int32, Int32, Int32, System.DateTime)
FogCreek.FogBugz.__Global.SingleHeartbeatPlatform(Int32, Int32, Int32, System.DateTime)
FogCreek.FogBugz.__Global.DoWork(Int32, FogCreek.FogBugz.CTrialManager)
FogCreek.FogBugz.__Global.StartMultiHeartbeat()
FogCreek.FogBugz.HttpHandler.ProcessRequest(System.Web.HttpContext)
System.Web.HttpApplication+CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
System.Web.HttpApplication.ExecuteStep(IExecutionStep, Boolean ByRef)
System.Web.HttpApplication+PipelineStepManager.ResumeSteps(System.Exception)
System.Web.HttpApplication.BeginProcessRequestNotification(System.Web.HttpContext, System.AsyncCallback)
System.Web.HttpRuntime.ProcessRequestNotificationPrivate(System.Web.Hosting.IIS7WorkerRequest, System.Web.HttpContext)
System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr, IntPtr, IntPtr, Int32)
System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr, IntPtr, IntPtr, Int32)
DomainNeutralILStubClass.IL_STUB(Int64, Int64, Int64, Int32)

Step 4: Fix the Bug

What we’ve learned:

  1. We’ve got a crash caused by recursion in HtmlAgilityPack.

  2. Our code is calling into HtmlAgilityPack while processing email for the search index.

You can fish around in the dump a little more to try and find stuff like parameters that will help you track down the requests causing problems, but that was enough to point me in the right direction.

Not coincidentally, we’ve shared our fork of HtmlAgilityPack on GitHub (and a patch on Codeplex).

Bonus: Even More Debugging

While investigating a separate crash I found use for a different command: !analyze -v, which helpfully extracted the relevant exception and call stack all on it’s own:

0:054> !analyze -v
<lots of stuff>
EXCEPTION_OBJECT: !pe ff5ce5e0
Exception object: 00000000ff5ce5e0
Exception type: System.InvalidOperationException
Message: The ConnectionString property has not been initialized.
InnerException: <none>
StackTrace (generated):
<none>
StackTraceString: <none>
HResult: 80131509
 
MANAGED_OBJECT: !dumpobj ff5d0210
Name: System.String
MethodTable: 000007ff001d0d90
EEClass: 000007ff001ccdf0
Size: 2226(0x8b2) bytes
(C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)
String:    at System.Data.SqlClient.SqlConnection.PermissionDemand()
at System.Data.ProviderBase.DbConnectionClosed.OpenConnection(DbConnection outerConnection, DbConnectionFactory connectionFactory)
at System.Data.SqlClient.SqlConnection.Open()
at FogCreek.Database.Connection.Open() in d:\code\hosted\build\FB\8.9.106H\fogbugz\FogBugzUtils\FogCreek.Search\FogUtil.Database\FogUtil.Database\Connection.cs:line 79
at FogCreek.Database.Command.ExecuteNonQuery() in d:\code\hosted\build\FB\8.9.106H\fogbugz\FogBugzUtils\FogCreek.Search\FogUtil.Database\FogUtil.Database\Command.cs:line 164
at FogCreek.Database.IndexFile.Unlock() in d:\code\hosted\build\FB\8.9.106H\fogbugz\FogBugzUtils\FogCreek.Search\FogUtil.Database\FogUtil.Database\IndexFile.cs:line 149
at FogCreek.Search.Store.Lock.Release() in d:\code\hosted\build\FB\8.9.106H\fogbugz\FogBugzUtils\FogCreek.Search\FogUtil.Search\Store\Lock.cs:line 26
at Lucene.Net.Index.IndexWriter.Finalize() in d:\code\hosted\build\FB\8.9.106H\fogbugz\FogBugzUtils\FogCreek.Search\FogUtil.Search\Lucene.Net\Index\IndexWriter.cs:line 562
<even more stuff>

Oh look, a finalizer causing a problem. I’ve never seen that before</sarcasm> (Seriously though, if anyone has ever told you that you probably shouldn’t use finalizers, this is why: they’re hard to test, they can interfere with the operation of the garbage collector, and they can straight up crash your app if they generate exceptions).

Conclusion

I put this together because while the internet is rife with WindDbg cheat sheets, it doesn’t seem to have quite as many walk-throughs of how the various and sundry commands might apply to debugging ASP.NET crashes, which commands are useful and what kind of output you might expect. Hopefully this helps you. I know it will help remind me of what to do the next time I need to resort to WinDbg.

PS: If debugging production apps makes you feel a tingling sense of accomplishment, you might like to know that we’re hiring.



How We Make Trello

March 13th, 2014 by Bobby Grace

Look around you. Is there something nearby that is firmly bolted to the ground? Okay good. Hold onto it. I’m about to pull back the curtains on everything you thought you knew about how Trello was made. The revelation could result in a rippling shockwave that knocks you off your seat and may have troubling, unpredictable consequences for the time-space continuum. Possibly.

To understand how Trello is made, you have to understand a little about how it works. When you go to trello.com in your browser, you are basically running an application–called a client–that talks to the Trello API. The API is a thing that goes to the Trello database and returns with the boards and cards you asked for. You tell the API, “Hello, I’m Bobby. If you don’t mind I would like to see my boards whenever it’s convenient” and in mere milliseconds the API responds with “{user: bobbygrace, boards: ['44321234432', '44321234433']}”. The client takes that and turns it into the beautiful, human-readable information you see on your screen.

You’ve probably downloaded our iOS, Android, Kindle, and/or Windows 8 apps, and are saying to yourself, “These are very well polished apps which I have rated or will rate favorably in their respective app stores. Are they just clients that talk to the Trello API?” Yes. They are just clients. There are clients everywhere. In fact, the Trello API is public and if you know how to speak Trello API, you can make a client, too.

The Trello API is well-written, has no bugs, and is totally rock solid. Or at least it doesn’t change very often. That means we can put out new clients all the time without having to update the API. In fact, we can have multiple versions of the website out at any given time.

“This is pretty interesting and I’m sure it will be relevant later in this post, but I thought this was about how you make Trello.” Okay, okay. I’ll explain our process first and then we’ll come back to this.

Using Trello for Trello

It should come as no surprise that we use Trello to make Trello. We have a board called “Trello Internal” which includes all the stuff related to the API and website that we’re currently working on. Our lists look something like this: “Doing”, “Waiting for test/review”, “Ready for merge”, then a bunch of lists for recent releases, “Incoming Bugs”.

internal

So, let’s say a developer wants to fix a bug. That process goes like this:

  • They pick a card from the “Incoming Bugs” list, assign themselves, and move it to the “Doing” list.
  • They pound on their keyboards like wild primates until it appears to be fixed. (Maybe this is just me.)
  • They create a new branch on their repository, like bobby/oops-no-crash, and push it to Kiln, a top-of-the-line distributed version control and code review system. (Disclaimer: we also make Kiln.) I’m not going to dig too much into how version control works, but basically a branch is a new version based off of a stable, established version that is in production. Everyone who knows how version control works is shrugging and grimacing right now.
  • They put the name of the branch in the card description, like “fixed in bobby/oops-no-crash”, and move it to the “Waiting for test/review” list.
  • Simultaneously, they get their changes reviewed by another developer in Kiln (which you should buy) and notify our intrepid tester Ali T. who checks out that branch and makes sure it’s really fixed.
  • Once it gets four thumbs up, the card is moved to the ”Ready for merge” list.
  • Doug Patti, our release manager, merges the changes into the official Build repo, makes the version with the change the new stable version, cuts a release, and pushes it into the wild. We often joke about how using the Kinect would make this step more exciting. Doug could just point in a different directions and say “Deploy! Deploy! Deploy!” Shamefully we haven’t made this a priority.
  • The fix is out. Bottles of champagne are popped.

Now another developer can pull from the official Build repository, which includes the fix, and merge it into their working branch. Then their changes can go through the above process and everything is real smooth like a good, old-fashioned saxophone solo. We push out fixes many times a day.

I’m sure everything I have said so far sounds reasonable, but bug fixes are small potatoes, right? What about big, new, dangerous potato features that you might want to test out first? Something like the new card back?

How we make and test big, new, dangerous potato features.

Remember how the Build repository contains the stable, established version of Trello? Well actually, there are many stable, established versions of Trello and you might be running any one of them. What you see is not necessarily what everyone else sees. If the walls of reality just crumbled for you, I apologize. Why would we play you like that?

Well lets say we’ve decided to tackle a bigger problem in Trello. We dutifully researched, designed, and developed a whole new version of Trello and it’s completely code reviewed and tested, a long, carefully calculated process that I’ve squashed into one run-on sentence. This version is probably still a confusing, frustrating mess that is riddled with bugs. Why? Because the design is just an educated guess and it hasn’t been subject to eyes on the street. We don’t really know how it works in the real world (a concept which I have fundamentally shaken for you). We ship it to the team and find what really needs fixing, instead of subjecting everyone to not-quite-there version. So we put it out on the “Alpha” channel.

“THE WHA?”

channel switcher

There are three channels of the Trello website: “Stable”, “Beta”, and “Alpha”. When you go to trello.com, the server figures out which channel you are on and which version you should see. Everyone is on “Stable”, a handful of folks, maybe 20, have access to the “Beta” channel, and Trello team members and Fog Creekers are on “Alpha”. If you have access to more than one channel, you get a coveted channel switcher.

So while fixes are continuously being pushed out to the stable channel, our big potato versions like the new card back or the new boards page go out to alpha channel and are used by the team with real data. When it hits alpha, the gloves come off. Then there is heated debate. And anger. Glory. Joy. Frustration. Hope. And iteration. Lots of iteration. If only you saw some of the junk that hit the alpha channel… shoot.

Oh and if there are API changes associated with the new client, they are released separately and before the new client hits the alpha channel.

We iterate and refine until it’s ready for the beta channel. The beta version has a prominent “Give us feedback” button and disclaimer saying this is an early “explorer” versions of Trello and says “look out for these new things”. Then we get feedback, iterate, refine… When we’re almost happy, we slowly roll it out to a small, randomly-selected percentage of people on the stable channel. That usually starts at 1% goes up to 15% over a few days. Then we get plenty of feedback. We look for the big things that come up frequently, then iterate and iterate some more.

Now things are great and it goes out to 100% of stable. We draft up a blog post with the biggest, coolest new features, announce it, and then boom a bottle of champagne is popped. We are popping bottles of champagne constantly and if you ask me I think it should stop because it’s getting expensive and champagne gives me the worst hangovers.

Design and Research

I would like to elaborate on the design and research phase since I glossed over it earlier. We use a product planning board that has a list of problem-oriented cards. Cards are titled in the vein of “I can‘t find this feature” or “I want to do this thing but can’t”. These problems come from a few places:

  • Common cases seen in support emails sent in via FogBugz, a world-class bug tracker. (Disclaimer: we also make FogBugz and we think you should buy it.)
  • Frustrations of team members.
  • One-on-one sessions with Trello users conducted by Ben, our customer happiness maker. I don’t know what Ben’s title is.

We use all this feedback as research. We also do numerous internal polls and hallway usability tests. We also look at anonymous usage tracking in Google Analytics to see what features people are actually using. In all, we hope to get a sense of how people are currently using Trello, how they might want something to work, what they aren’t using, and where their pain points are.

On the back of each card, we take this info into account and propose a few solutions. When we settle on a card that really needs fixin’, I’ll take the card, make a few pen and paper sketches based off the proposed solutions, and create a few low-fi mockups and workflows in Sketch. (Disclaimer: we do not make Sketch, but you should buy it.) I try and post workflows and screenshots to the associated Trello card throughout the process so I can get continuous feedback from the team. I know the design is going to change a lot so I don’t waste time with high-fidelity mockups. We can quickly iterate in code anyway. Here’s an example of the new card back mockup. It has notes to point out the important new things. This one is more high-fidelity than others because it covers some visual design aspects.

jellydonut-workflow

The next step before developing a prototype is establishing a project codename. This is very important and happens to be lots of fun. The new boards page was codenamed “Borderlands” because it concerned everything outside the main board view. The new card back was codenamed “Jelly Donut” because it was even tastier on the inside (of cards). When we added the ability to have multiple clients, we named it “Hydra” after the mythical multi-headed beast. We’re working on some new search stuff. It’s codenamed “Hayride” which I think is a reference to finding a needle in a haystack? The iOS team has project codenames, too. When they were working on scrolling performance, they codenamed it “Hobbit” after the major motion picture featuring a smooth, lifelike high frame rate. Codenames are important as a point of reference. It’s also what we name our branches. bobby/borderlands is much better than bobby/new-boards-page-starred-boards-other-stuff-too.

Now that it is all codenamed and prototyped up, we ship it off to alpha and start the iteration process. Rollout. Launch. Champagne. Saxophones.


I hope this was informative. Maybe there was something you can take home with you. Our release process and workflow has come a long way, mostly due to the work of Doug and the ability to have multiple clients in production. Maybe Doug will write something more technical about that. Also I should say that the iOS and Android teams have different workflows that are interesting in their own right but not covered here. Ian says the iOS build process is “cmd + B”, but I’m like, come on, you know that’s not what I’m talking about.

That’s all. Please yell at me on Twitter.



FogBugz Visits the Head(er) Shrinker

February 7th, 2014 by Adam Wishneusky

The look and feel of FogBugz On Demand hasn’t changed much in two years. We thought we’d update it, starting with the part that shows up on every page: the header. We re-designed the FogBugz On Demand header to make it more modern, compact and organized.

R.I.P. old header…

old header

Long live new header!

new header

Not everyone has the luxury of a 4K display. When you’re working on a small laptop screen, space is limited. Menu bars, docks and taskbars gobble up vertical space. To make more room for your actual work, we made the new header 15% shorter. We did this by consolidating the two rows into one. The FogBugz and Kiln links from above moved into a hover menu:

new - product dropdown

To fit the menus from the top-right into the new design, we made smarter use of the width of the bar. Our customers told us that they work mostly with cases, so we grouped all of the case-related controls at the left. We used symbols for the right-hand menus and removed the List Cases button. You now get to your filters by hovering over Cases.

The first thing we learned when we rolled the new design out to 10% of accounts was that you click List Cases a LOT. To give you faster access than hovering then clicking inside the new menu, we made Cases clickable. Hover over it to get to a specific filter, or just click on it to show your current case list.

new - cases dropdown

Hover over the four right-hand menus and you’ll see that Admin is now “gear,” Extras is now “toolbox,” Help is now “?” and My Settings is now the avatar menu. Log Out has moved into the avatar menu. If you have a larger screen, you have no doubt noticed that the old search box didn’t do you any favors. Whatever length text you entered, it stayed tiny. The new box knows that sometimes you use a lot of search axes in order to narrow down and speed up your search. While it starts out compact,

new - search box 1

When you click into it and type, it grows so you can see what you’re doing:

new - search box 2

We’d love to know what you think of it so drop us a line. Not using FogBugz On Demand? Sign up for a free trial today!