Can you hear me now?
Even with claims by the nation's major service providers of superior "six-9" reliability, the facts remain clear: downtime happens and federal network outage data proves it. Just three months into 2002, every major network supplier has reported intermittent failures linked to inevitable causes.
Case in point: Sprint was laid out for just over five hours on February 13th, 2002 when a public works employee in Yadkinville, NC cut a fiber cable while placing water lines. Some 273,000 calls were blocked in the mishap. Then, twelve days later in southern California, AT&T endured a similar outage. According to the filed report, "both fuses in the Fuse and Alarm Panel (FAP) failed due to power trouble, rendering the entire device out of service." For five hours and fifteen minutes, toll and 911 calls were on hold.
For most, outages like these prove, in the end, to be nothing more than an inconvenience. But for those trying to run a business, particularly one that depends upon a reliable, "always on" network, the losses can be substantial.
For a quick lesson in network crash and revenue burn, one need only look back to the well-publicized August 1999 case of MCI Worldcom. It began about 10 p.m. on August 5th when technicians noticed a high level of congestion on the frame relay network. It evolved into a nightmarish house of cards for both MCI and its customers.
MCI had recently upgraded to a more scalable infrastructure, a move that reportedly caused the initial congestion and led to under-performance and complete network instability for over a week. As efforts to fix the problem repeatedly failed, MCI was forced to shut down the whole system for 24 hours.
The Chicago Board of Trade was one of MCI's 3,000 customers rendered helpless by the outage. The failure disabled the electronic system that governs the board's exchange leading to an estimated loss of some 180,000 trades. At anywhere between $10,000 and $100,000 per trade, the loss of business was significant and tough to calculate.
The same could be said for national truck stop operator TravelCenters of America, another customer for whom the wheels of commerce ground to a halt. In an InternetWeek story published at the time of the meltdown, Bill Bartkus, vice president of information systems for TCA, said he would seek compensation for lost business. "We're not satisfied at all with this," he said. "There has been a serious impact on our business." In the end, the only thing everyone involved has learned to rely upon is imperfection. Even the most reliable of systems will fail-it's only a matter of when.
"...explosion, embargo, acts of God..."
Such a glaring reality should be dimmed by service level agreements (SLAs), but industry analysts have traditionally been critical of how much customer satisfaction they really provide. "Customers should carefully review contract and Service Level Agreement language regarding such events [as outages]," said Bill Harris, senior advisor at QCI, a Louisville, KY-based telecom consulting and management company. Generally, Harris indicated, SLAs are designed to protect the service provider from liability. Representative language, he said, "effectively absolves all responsibility for an outage while requiring the customer to pay for service, even though it is not usable."
In a major failure where compensation is appropriate, a payout-often in the form of a free month of service or a hardware upgrade-fails to make up for the financial damage a business incurs. And because SLAs are too often poor substitutes for lost business, a service provider could emerge from an outage with a badly tarnished reputation.
"That really depends on lot of things," Harris said. "Was it covered in the press? Did they immediately come forward and address the issue or let it linger? Did they make good even if they didn't need to?" From the customer's perspective, an SLA should always be a case of buyer beware. "If you do not have adequate SLAs and contract language, you are completely at the mercy of the supplier," Harris said.
Once an outage occurs, for whatever reason, the question often becomes one of duration. A study of headlines coming out of the telecommunications industry reveals one reason why failures take a long time to notice and an even longer time to fix: a scaling back of the workforce. Companies, said Harris, "are reducing headcount across the board."
Some of these reductions are coming in the area of network support with service providers using fewer people to monitor an increasing number of network miles and switches. The logical result of less dependence upon human capital is greater need, according to Harris, for "more robust, better installed and better maintained" alarm reporting and control systems.
AT&T's annals provide some insight into Harris's contention. On January 2, 2002, a storm knocked out commercial power in Walterboro, SC engaging the system's back up batteries. After 18 hours the emergency battery voltage fell below necessary operating requirements and the system shut down on 64,000 customers.
The root cause listed in the FCC outage report was an "inadequate/missing power alarm." "Monitoring services were not activated in a database and the carrier did not know what was happening at the switch," Harris said. Had the monitoring system been functioning appropriately, network downtime would have been significantly reduced.
Bob Berry, president and chief executive officer of Fresno, CA-based DPS Telecom, agreed. "Our busiest days are typically Monday mornings when administrators find out the hard way that their network alarms were not adequately monitored. However, the overall expense of network outages is oftentimes underestimated by people." Berry adds, "In addition to the loss of revenue associated with outages, companies are also faced with FCC fines, SLA penalties, customer churn, and even a damaged reputation."
Quick reaction is the key
While some outages cannot be avoided, they can certainly be corrected quickly with the proper monitoring system in place. Today's advanced remote site monitoring equipment can notify administrators across the country of an event that is happening or about to happen to their network. Some systems can even notify multiple people and OSS' so that a technician can be deployed to the troubled site while an administrator in another region can reroute traffic through a different path. Berry adds, "Today's monitoring systems allow you to build rules for derived alarms and notification escalation. For example, if two non-critical events occur at the same time, it could be considered very serious. And alarm escalation lets upper management know when SLA sensitive clients may be affected." In the telecom industry, an ounce of prevention truely is worth a pound of cure.
Listed as an Inc. 500 company in 1997, DPS Telecom has been developing telemetry monitoring and alarm gathering devices for the telecommunications, utility, and cable television markets since 1986. The company specializes in creating custom alarm management solutions.
Have a specific question? Ask our team of expert engineers and get a specific answer!
Click here for more information.
Download our free Monitoring Fundamentals Tutorial.
An introduction to Monitoring Fundamentals strictly from the prospective of telecom network alarm management.
Click here for more information.