Is Troubleshooting Becoming a Lost Art?

The logo of Naval Submarine School

The logo of Naval Submarine School

I recently helped solve an intermittent network/server problem for a client. This problem had been confounding the IT staff of this firm for quite a while and although this wasn’t why I was on site it was impacting my ability to complete my specific consulting  work. So I stepped in and started troubleshooting the problem in hopes of getting it fixed so I could get back to work.

While I didn’t know all of the details of their network with a little troubleshooting I was able to pinpoint the problem quickly by simply asking a few simple questions and using simple, free, built-in MS command line tools. The power of ipconfig, tracert and ping still amaze me. Once I showed them their DNS server seemed to be at the root of the issues within 10 minutes they had pinpointed the problem to be a fluky network port on a switch which fed their main DNS server.

Simple and unexpected. After all how often do network ports go bad while still showing a green indicator light? But the premise was easy enough to test – just switch the ports and see what happens. Sure enough the problem didn’t reoccur and they replaced the switch under warranty and a problem that had frustrated end-users and IT staff alike wasting many hours of productivity was quietly resolved.

That a team of IT professionals weren’t able to quickly identify and resolve this issue on their own speaks volumes about the lack of formal troubleshooting training and practice in today’s IT workplace. Excellent staff members who in other aspects are great assets to their company, may not have much hands-on experience resolving problems. Sure they know a lot and they are hard workers but they don’t know how to best utilize that time or knowledge. More and more I realize how lucky I was to complete my Navy training because every single day of electronics school we were solving problems.

At first the problems were small and simple, but as we got closer to graduation and completion of our one year of training, they became increasingly complex. They would put us into a room and tell us something was wrong with our system. Now this “system” consisted of 27 racks of gear with 4 to 8 pieces of gear in each rack. We would at first have to figure out what and if there was a problem, isolate the problem and then repair or replace the components. And the problem might actually be multiple separate problems. Oh yeah, and we were timed. No notes or cram sheets were allowed. Only system technical manuals – which consisted of about 20 4-inch thick 3-ring binders with full schematics of every part of the system including logical diagrams.

Sometimes the problems could be solved in 10 minutes and sometimes no one solved the problem in the allotted time. Sometimes you were doing it solo and sometimes it was a liberty dependent item – meaning no one from the class went home until the problem was resolved and our class of 15 guys had to all work together to resolve the problem –  those problems were invariably the toughest to solve. As class leader it fell to me to direct our efforts efficiently, after all the sooner we solved the problem, the sooner we’d be at the bowling alley.  Hey what do you want, we were sailors and liked bowling. I still own my own shoes and average about 170 or so if you want to bowl a frame or two.

We quickly learned to take a step back and track everything through the flow-charts and logical diagrams. Does the system have power? Does the system boot? Are signals going to the appropriate sub-components? Can we isolate the issue to particular rack or piece of gear? Is system actually getting info from the antennas? And so on. During the final week of training the instructors would all gather in the room and heckle you as you went about your troubleshooting steps.

Not very nice, but very effective at teaching you to believe in yourself – sort of like Regis Philbin’s “are you sure” or “is that your final answer”, but not nearly as polite. And it was great training for having to recommend a solution to an angry, short-tempered Captain or Executive Officer. Or signing off on that the system was A-Ok and ready for deployment, something that was never done lightly because if there were problems it could impact the safety of the ship, of you and your fellow crew. It turns out I had to do just this shortly after graduation.

We didn't spend much time on the surface, but we looked god when we did.

We didn't spend much time on the surface, but we looked good when we did.

On a Friday afternoon at 4pm the day before a deployment to the North Atlantic,  we were about to go ashore for our last liberty when I had to tell my chief there was a problem with the system.  It was pretty scary since I was so junior, I had only been on the submarine one month, and was the new guy to boot. This meant no one from my entire division was going to leave the boat until it was fixed. The good news was we isolated and confirmed the problem and were able to get the replacement component from the Sub Tender and have it in place and tested by 7pm with plenty of liberty left before the next day’s deployment. I’m not sure if school hadn’t been so tough that I would have found the problem or had the courage to confront the problem.

Obviously in today’s economic environment no one has the luxury to spend a year training their staff but the basics of good troubleshooting should always be kept in mind and used whenever possible. Quarterly scheduled fire drills are a great way to teach and practice troubleshooting by validating everything is working as expected and to practice troubleshooting skills.  They are also a great way to cross-train your staff and prepare for vacation season. Planned fire-drills – especially ones performed in your test lab – also allow you to see how your staff performs under a crisis. Not too mention it makes them more efficient when problems do occur.

Too often IT staffs go into fire-fighting mode running around and putting out individual fires without taking a breath and thinking holistically about the issue or bringing together disparate incidents to form a cohesive picture.  This is further compounded by a lack of  proper documentation of their network. Lack of time and lack of resources will always be a constant in business and IT, but properly documenting your network and business work-flow quickly pays dividends over the long run and is essential to effective troubleshooting and improving operations.

Do you have any troubleshooting stories or tips? How does your organization ensure troubleshooting skills are kept current? -t

, , , , , , ,

  1. No comments yet.
(will not be published)

This blog is kept spam free by WP-SpamFree.