In my 20+ years of doing this networking thing, I have been part of many a network debugging party. Often in the capacity of the vendor that “caused” the problem, but also at times as part of a team attempting to create a network that was probably slightly overbuilt for its purpose (remember those multiprotocol InteropNets in the mid 90s?) And even this week while working with a customer on something that simply appeared to be “too bizarre” to be real, it struck me that debugging a network is an art. And while I am ok with it being an art form, it must be removed from the black magic realm of art.
I have long ago learned that networks require time to settle. Throughout my career of building new networks or significantly overhauling existing networks, it is very rare that any significant installation works flawlessly from day one. Like a brand new house, it will squeak a little. That brand new air conditioner need a bit of tuning. The toilet or tub may leak after a week. A network as architected and designed on a piece of paper and the real world instantiation of the same will have differences. And differences cause problems.
An actual deployment of a network requires attachment to things that are not pristine. It needs to be attached to existing cabling. It needs to interact with other network equipment, from other vendors. It will have servers, storage and and in a campus setting wireless access points, PCs, VoIP phones, printers and all sorts of other stuff attached to it. Most of those behave well together, some may not all the time. And it is exactly when you insert the shiny new network into the existing infrastructure that you will find interactions that do not quite work. Or expose a bug that may have been around for a long time, but that specific interaction, that specific timing of events makes it show up and cause havoc.
That is the time where the most skilled in the art of debugging networks make their keep. If I look back at all those fine engineers that I worked with that were not just good at debugging networks and finding bugs in code or deployment, but those that were excellent, they all had some of the same personality qualities:
- they love the hunt. These folks love the puzzle, the path to understanding what is wrong. They carefully think through the steps to take next, the step that brings them closer to their prey. Most of them do not like the fix part much, once it is understood it is time to find the next problem.
- they are masters at recognizing anomalies. Some of their favorite phrases may be “that does not look right”, or “that should not be here”. They find things that are not normal by instinct. They notice differences in behavior that many others do not.
- they are seriously tenacious. Once engaged, they do not let go, literally. It will occupy their every minute, at the office, in the lab, at home or during sleep. Their brain swirls with all the possibilities, creating theories against the symptoms and trying to dismiss them one at a time.
- their brain is wired for packet flows. They have a separate sense for how packets flow, how they get transformed, who touches them, where they are supposed to go. There is again that sense of “normal”, a very vivid view of how it is expected to flow.
The best of the best network folks have not been taught how to do this through programs and certifications, the thought process comes natural and the details have been self taught through experience, discovery and pure curiosity.
The problem is that these folks are not easy to find. And when you find them, they are not easy to keep. They are awesome folks, but their skills should be used in much more productive ways. These are the same folks that can create new network solutions, proactively finding solutions for new network capabilities, optimizations and overall simplification.
As a networking industry we have not progressed much to make life of these folks easier. Most troubleshooting sessions still start and end with port mirrors, packet captures and endless staring at packets going back and forth between systems. Those are the types of problems the network should do so much better with in providing diagnostics to tell you what is wrong. Or at least point you much closer towards the solution. As @mbushong mentioned yesterday, we have 20 years worth of features in our products, but we have done so little to make those core 10 features trivial to configure, debug and diagnose.
Those skilled in the art of debugging must be freed up to provide a much more proactive contribution to the IT organization. Of course there will be problems where their specific debugging expertise is required, but we need to provide more simplicity to ensure that those are the exception. That network settling-in period will not go away, but it must be reduced to the equivalent of a lick of paint on that corner, tightening some bolts, replacing an outlet cover plate. Things that most DIY’ers can do. That way the experts can focus on that much smaller set of really hard problems, and create much better network services on top of much simpler infrastructures at a much faster pace.[Today’s fun fact: Students at the Ho Chi Min City University of Economics put together the world’s largest jigsaw puzzle in 2011. It consisted of 551,232 pieces and measured just over 48×76 feet. It took 1600 students 17 hours to put together. Who says students don’t work hard?]