typing monkey[info]keturn wrote
on March 15th, 2008 at 01:16 pm
Previous Entry Add to Memories Tell a Friend Next Entry

artichoke

Sometimes the software we run generates errors.

I mean, it'd be nice if it didn't, but it does, and so what we hope for is that the error report includes enough information about the conditions that led to the error for us to track it down and fix it. Generating a report for a particular error is something pretty well understood; we generally use some variation on stack traces and core dumps, which works well enough.

The part I don't manage so well is what you do when you're receiving these errors from not just a single customer working with your application, but from the global set of all your users at once. (This problem is most obvious in web applications, but plenty of desktop applications have "report crash to developer" functionality now as well.) The approaches I've seen so far are

  1. show an error message to your user, and leave it to them to figure out how to communicate the error to you. I trust I don't need to go in to all the ways in which this sucks.
  2. dump everything into a log file, which is easy to do, but has insufficient structure to get a high-level view of what the current state of things is.
  3. send an email with every error, which works fine in many cases, but it makes bad problems worse, because now in addition to dealing with a bug that needs fixing, you now have to deal with a torrential flow of emails (with large chunks of debugging data attached) clogging your developers' inboxes.

The balance I need is to be informed of a new type of error as quickly as possible, but to not be flooded with redundant reports. I need to know if the problem is affecting 80% of our users, or just one in a thousand. I need all the debugging information stored somewhere for inspection if I need it, but not all pushed down to my email/phone/jabber/whatever in case I don't. I want to classify reports by exception type, code path, and perhaps other random details (browser version, IP address, etc).

I know I'm not the only one with these requirements, so I'm sure an application for managing this exists somewhere, I just haven't found it yet. What is it?


(Leave a comment)
From:[info]krotty
Date:2008-03-15 08:33 pm (UTC)
(Link)
Exception Logger plugin for rails puts exceptions and debug state into the database, with a web interface to browse it. I prefer to err on the side of torrent of emails though, it is not that hard to create a filter for email these days.
(Reply) (Thread)
From:[info]keturn
Date:2008-03-15 08:46 pm (UTC)
(Link)
Exception Logger looks like it might be something like what I'm looking for. (At least, for those times I'm doing Rails development.) I'll have to check it out; thanks for the tip.

(Reply) (Parent) (Thread)
From:[info]keystricken
Date:2008-03-15 08:47 pm (UTC)
(Link)
[info]lindseykuper is a great person to talk to about bug software. I've heard her talking incomprehensibly about it many times!

I believe [info]jes5199 was mending something in OpenID with regard to the helper robots website. Did he tell you about it?
(Reply) (Thread)
From:[info]keturn
Date:2008-03-15 08:51 pm (UTC)
(Link)
I haven't heard anything from jes on that subject in recent weeks, no.
(Reply) (Parent) (Thread)
From:[info]glyf
Date:2008-03-19 09:14 am (UTC)
(Link)
I have this idea for traceback hashing, which one of these days I'm going to implement and put on Failure. The idea is, you encode the traceback in some deterministic way: for example, "ExceptionType\nmodule.function:lineno\nmodule2.function2:lineno", hash that, and only report each hash once.

I don't know how realistically this would stem the torrential flow of email though, or whether it leaves out important information that you need reported. Any intuitions?
(Reply) (Thread)
From:[info]keturn
Date:2008-03-20 05:44 am (UTC)

Failure.hash

(Link)
I think that hash would function well as a classifier for limiting notifications. I'd still want a histogram of occurrences over time, and the values of the variables in the stack for all occurrences stored somewhere for retrieval, in case a sample size of one was insufficient to determine the cause.

(Is there a ticket for this yet?)
(Reply) (Parent) (Thread)
From:[info]glyf
Date:2008-03-20 05:47 am (UTC)

Re: Failure.hash

(Link)
It's not just for limiting notifications. If you want a histogram of occurances of "this exception" over time, the hash could be used as a key for that too.

And no, I don't think there's a ticket for this yet. Maybe buried somewhere in the Divmod tracker there's something that mentions it. Feel free to file a new one; worst case, we delete a duplicate.
(Reply) (Parent) (Thread)

(Leave a comment)