Elaboration through sharing

Arseni Mourzenko

Founder and lead developer

179

articles

December 13, 2014

The risk of working alone is that one may start to have a simplified vision of a problem, and not being able to see beyond this simple model. This happens to me a lot; today, I have an excellent illustration of the problem.

Everything started with the question How should I handle logger failures? The author of the question was wondering how to deal with exceptions which occur within the exceptions logger itself.

Easy peasy, I had to handle it for dozens of projects before, including with the legacy in-house logging and reporting system which collected exceptions from server applications and services and allowed developers to process the list of exceptions and deal with each one. When logging itself failed (for example because database was down), the exceptions were stacked locally on-disk and reported later when the logger was working again. An exception within the logger was encompassing the originally reported exception through InnerException. There were tests for the fallback to ensure it works as expected, so I was confident that the approach is good enough for any business app.

That was what my answer on Stack Exchange was about.

Then, Jon Raynor added a different view of the subject. Especially, he mentioned two things I never thought of:

The distinction between critical and non-critical logging. In a case of critical logging, the application should simply stop, which, indeed, is the only acceptable solution. Since I was working exclusively with applications with non-critical logging, I never thought about the difference.
The importance of log messages frequency. In other words, if an application was reporting in average 5 messages per minute for the last six months, but haven't reported any message for the last two days, chances are something is wrong with the application or the logging.

I could have grasped that aspect since one of the visualizations I often use for logging is the chart showing the number of messages per minute or hour. But, indeed, the primary goal of those charts was to react quickly if there is a sudden peak. This happened twice when Continuous Deployment pushed in production an app which wasn't tested enough, which caused thousands of errors to hit the logging platform in the next few minutes.

Then, Aaronaught commented my question, adding two other things I was never really thinking about:

Cycle detection,
Exponential backoff.

Now that's really disturbing, because I was considering that I thought enough about logging exceptional behavior within the logger itself, but I never took in consideration those two things (nor that I knew what exponential backoff is before looking Wikipedia).

Now, the next time I implement logging, I'll have at least three things to reconsider and do differently.

This is a good example of the reason why is it so crucial to share own ideas and solutions with others and see what they think. No matter how confident a person is that his approach is fine, he probably missed a lot of aspects and details.