Hundreds of thousands of Delta Air Lines passengers were forced to endure severe turbulence last week — without ever leaving the ground! —after a brief power failure at the carrier’s Atlanta hub spiraled into a global aviation disaster. Backup systems failed to kick in, causing a computer system meltdown that resulted in more than 2,100 flights being canceled over a multiple-day period.
Not surprisingly, passengers were inconvenienced and angered by the downtime event —and freely shared their frustrations with the world via social media. But there are valuable lessons to be learned in the aftermath of the incident, especially for IT professionals. To help prevent your own wings from being clipped by a similar catastrophe, consider these key points underscored by Delta’s bumpy flight:
- Troubleshoot before turmoil. Practice exactly how you will work to restore your critical systems in the event of an emergency (and long before unexpected outages force your hand). Have a plan to quickly implement any necessary remedies, as well as teams identified to work on restoration.
- Review your IT environment on a regular basis. Keep a pulse on your network, systems, software, architecture, and even your staff. Do single points of failure exist for critical systems? Do necessary systems have required redundancy? Unexpected outcomes can result when personnel changes, systems merge, upgrades occur, or when elements are added to or removed from network and system architectures. As one IT pro mused following Delta’s debacle: “Kind of amazing they haven't figured out how to make their system redundant, distributed, and/or robust. It makes zero sense that a power outage in Atlanta should have any effect on a flight going from Salt Lake City to Seattle.”
- Don’t put all of your eggs in one basket. Or tie all your critical functions to one system, as the case may be. Last week’s incident highlights how an error in one geographical location can have a drastic ripple effect, cascading into a crisis that disrupted travel across the U.S. and beyond. The computer crash affected ticketing, flight planning, and customer tracking systems, ultimately forcing the airline to declare a “ground stop” that halted takeoffs around the world. Adding insult to injury, many affected travelers reported that they never received messages or updates from Delta on their flight status because communications systems were in disarray.
- Don’t trust outdated equipment. The latest incident —ultimately attributed to failed switchgear — comes just weeks after a similar incident at Southwest Airlines, in which more than 2,000 flights were canceled after a faulty network router sparked an outage. Incidents like these demonstrate that software, hardware and power systems for huge operations like airlines are just as critical as physical infrastructure that needs to be regularly upgraded and maintained.
- Invest in a high-quality backup solution. Research from IDC shows infrastructure failures can cost large enterprises $100,000 per hour, and the failure of critical applications can ring in at up to a million dollars an hour. Yet one of the most reliable and cost-effective measures you can take to prevent downtime is the installation of a premium (and preferably, redundant) uninterruptible power system (UPS). Keep in mind, like other critical IT systems, UPSs should be inspected regularly and upgraded as needed.
Sometimes accidents happen, but checking all your bases ahead of time can lessen the amount of downtime. Check out Eaton's Blackout Tracker to explore power outages for your state or region. You may be surprised at how vulnerable the U.S. power grid is to power outages!