Quoted, but misunderstood: What’s Missing from Production System Troubleshooting

Michael Baum quotes my feedback on his survey article, but completely misses that we actually want the same thing. We just approach it from different angles.

To get to the (misunderstood) point:

The notion that IT people need even more data generated by developers kinda misses the point. Troubleshooting production applications is a whole lot different that debugging code in development or staging environments. Production systems involve many technologies and systems that just don’t appear in pre-production environments.

I do not ask for more data, I ask that the data format currently being used is reviewed to see whether it is actually useful for troubleshooting/monitoring and that concerned effort is made to change the format where it proves not to be useful.

For example, any messages produced by multithreaded services must include thread/transaction id in them. Trying to extract sequence out of the logs that just intermingle their log entries is next to impossible. Same problem with having timestamps in a format that does not allow to correlate to other log types.

My point was that developer would not see this kind of issues, until they have to do the troubleshooting themselves. Then, they might be more amenable to pleas for better logging formats.

And yes, I did spend 3 years as technical support engineer for BEA looking at the multi-megabyte (sometimes multi-gigabyte) log files for people whose configuration I did not know 30 minutes earlier. So, I believe I did have to deal with the issues Michael have seen at Yahoo and at Splunk too. In fact, I will be delivering JavaOne presentation about this very issue this year.

Speaking of Splunk, it is a great idea and a step in a right direction (as I wrote was 9 months ago). I could see how it would have been useful to me when dealing with large data sets.

Unfortunately, it is only a first step; to replace my advanced troubleshooting environment (*cough* Vim *cough*), I would need to at least be able highlight several patterns at the same time in different colors (e.g. IP address, time sequence and URL type).

But I will be evaluating Splunk in more details and probably will be mentioning it in my presentation at JavaOne. Especially, if it will be downloadable as VMWare image to try under Windows environment.

(Update from Feb 14th: we are on the same page now)

BlogicBlogger Over and Out