January 30th, 2012 by Jude Allred (3 min read)
Better BugzScout Error Reporting
Prelude and a pain point.
One of the many ways to spawn cases in FogBugz is via a mechanism we call BugzScout. BugzScout is a simple interface for collecting crash reports from your software in the wild. For example, you might add an exception handler to your product which hooks into BugzScout so that when your customers encounter errors, you receive a case in FogBugz with details about the problem. Unlike the FogBugz XML API, which can also be used to serve this use case, BugzScout doesn’t require that you embed any type of FogBugz security token in your deployed application, and BugzScout will automatically collate all of the bug reports into cases based on titles of the reports.
We use BugzScout extensively throughout our own infrastructure — when FogBugz accounts misbehave, if an unparseable email gets stuck in an account’s mail queue, if an account’s full-text search index gets corrupted, etc., we receive cases in our FogBugz environment detailing the problem and often including key diagnostic information such as stack traces and assorted report-specific meta-data.
BugzScout has been around for a while, long enough to get a little crufty, and we’ve felt some BugzScout pain as we’ve scaled up. Pain can lead to cases:
We generally agreed with these suggestions, and we’ve definitely known the frustration of some small process misbehaving and sending literally thousands of BugzScout reports our way.
The first thing we came up with was what we all thought would be a completely amazing BugzScout-esque product which could solve the BugzScout use case perfectly. But we didn’t build that… its scope was slightly larger than the part-of-a-week we wanted to spend on it. Plus, building a new product wouldn’t have solved our immediate problems fast enough. Our driving force behind the improvements became the following insight:
Sometimes you get a human-readable number of BugzScout reports. In these cases, you absolutely want the full error text from each occurrence, and it’s reasonable to route the event through all of the standard notification hooks. If someone doesn’t want to hear more about this error, they can direct BugzScout to stop reporting via editing the case.
Other times, you get so many reports that no human will read them. The case is spammed with stack traces and error notifications. After some point, additional stack traces are of no use — you’re not going to read through them anyway, they just make the case cumbersome. At this scale, you need BugzScout to do a few things:
- Shut up and let me fix it. You’ve already gotten a bunch of emails about the problem- one additional email per occurrence is of no use to anyone, but can legitimately hinder your efforts to actually deal with the problem.
- Help me know I’ve fixed it. It’s always been possible to determine if more reports are coming in by either watching the occurrence count of the BugzScout case or by monitoring when the case was last modified by BugzScout. Unfortunately, if you wanted to accomplish #1 by setting BugzScout to ‘Stop Reporting’, you’d also stop getting updated timestamps from when new reports came in.
We decided that reconciling #1 and #2 would be a great improvement for BugzScout and would be well within the scope of a few days of development. We felt confident that this type of change would solve our pain point, but before we went any further we wanted to make sure we weren’t operating in a no-fact zone. We reached out to our SaaS platform and gathered some anonymous data on how people were using BugzScout. The data confirmed our guess: Our use cases are valid and a significant amount of people are using BugzScout-based error reporting.
We settled on:
- BugzScout will now automatically place itself into “Stop Reporting” mode after 50 occurrences of a report.
- Last Occurrence is a new case field which is stored alongside all BugzScout cases and is fully searchable, filterable, etc. Last Occurrence is the timestamp of the most recent BugzScout report, regardless of whether that report was recorded as a case event.
- Human-readable amounts of BugzScout reports should be unaffected.
- Human-unreadable amounts BugzScout reports will automatically cut off after 50. This means that the bloody tide of email will be curbed; BugzScout cases are limited to having a workable number of events in them.
- If it comes time to collect more reports, you can do this by editing the case and instructing Scout to ‘Continue Reporting’.
- Last Occurrence is a new and useful filterable value.
We’ve been dogfooding the new BugzScout in our FogBugz environment for the last month and have been quite happy with the improvements. We’ve found that the automatic shutoff of case event generation after 50 reports has allowed us to function much better when dealing with issues at scale.
Here’s an example of how our team likes to monitor the incoming BugzScout reports using the new Last Occurrence column:
This isn’t the only way we triage our BugzScout reports, but this simple filter lets us quickly spot any issues that are happening right now and know if they’ve happened with enough frequency to warrant immediate attention.
These changes went live to FogBugz On Demand two days ago, so we hope that you’ve found them useful. Licensed FogBugz customers will get these changes in our next licensed release.