Nearly every IT professional dreads unplanned downtime. Depending on which systems are hit, it can mean angry communications from employees and the C-suite and often a Twitterstorm of customer ire. See the recent Samsung SmartThings dustup for an example of how much trust can be lost in just one day.
Garner pegs the financial costs of downtime at $5,600 per minute, or over $300,000 per hour. And a survey by IHC found enterprises experience an average of five downtime events each year, resulting in losses of $1 million for a midsize company to $60 million or more for a large corporation. In addition, the time spent recovering can leave businesses with an “innovation gap,” an inability to redirect resources from maintenance tasks to strategic projects.
The quest for downtime-minimizing technologies remains hot, especially as demand for high-availability IT has grown. Where “four nines” (99.99%) uptime might once have sufficed, five nines or six nines is now expected.
Enter server virtualization, the powerful technology enabling administrators to partition servers, increase utilization rates, and spread workloads across multiple physical devices. It’s a powerful and increasingly popular technology, but it can be a mixed blessing when it comes to downtime.
Virtualization Minimizes Some Causes, Exacerbates Some Impacts of Downtime
Virtualization is no panacea, but that’s not a call to reconsider industry enthusiasm for it. Doing so would be unproductive anyway. The data center virtualization market, already worth $3.75 billion in 2017, is expected to grow to $8.06 billion by 2022. For good reason. Virtualization has many advantages, some of them downtime-related. For example, it’s easier to employ continuous server mirroring for more seamless backup and recovery.
These benefits are well documented by virtualization technology vendors like VMWare and in the IT literature generally. Less frequently discussed are the compromises enterprises make with virtualization, which often boil down to an “all eggs in one basket” problem.
What used to be discrete workloads running on multiple, separate physical servers can in a virtualized environment be consolidated to a single server. The combination of server and hypervisor then become a single point of failure, which can have an outside impact on operations for many reasons.
First of all, today’s virtualized servers are doing more work. According to a McKinsey & Company report, utilization rates in non-virtualized equipment was mired at 6% to 12%, and Gartner research had similar findings. Virtualization can drive that figure up to 30% or 50% and sometimes higher. Even back-of-the-napkin math shows any server outage has several times the impact of yesteryear, simply because there is more compute happening within any given box.
Diverse customer consequences
Prior to virtualization, co-location customers, among others, demanded dedicated servers to handle their workloads. Although some still do, the cloud has increased comfort with sharing physical resources by using virtual machines (VMs). Now a single server with virtual partitions could be a resource for dozens of clients, vastly expanding the business impact of downtime. Instead of talking to one irate individual demanding a refund, customer service representatives could be getting emails, tweets, and calls from every corner.
This holds true for on-premises equipment as well. The loss of a single server could as easily affect the accounting systems the finance department relies on, the CRM system the sales team needs, and resources various customer-facing applications demand, all at the same time. It’s a recipe for help desk meltdown.
According to CIO Magazine, many virtualization projects “have shifted rather than eliminated complexity in the data center.” In fact, of the 16 annual outages per year their survey respondents reported, 11 were caused by system failure resulting from complexity. And the more complex the environment, the more difficult the troubleshooting process can be, which can lead to longer, more harmful downtime experiences.
Although not a direct result of virtualization, the industry has made yet another swing of the centralization versus decentralization pendulum. After years of powerful PCs loaded with local applications, we have entered an age of mobile, browser-based, and other very thin client solutions. In many cases, the client does little but collect bits of data for processing elsewhere. Not much can happen at the device level if the cloud-based or other computing resources are unavailable. The slightest problem can result in mounting user frustration as apps crash and error messages are returned.
In summary, the data center of 2018 houses servers that are doing more, for more internal and external customers. At the same time complexity is bringing about downtime risk with problems that can be more difficult to solve, which can lead to extended outages. Although effective failover, backup, and recovery processes can help mitigate the combined effects, these tactics alone are not enough.
Additional Solutions for Minimizing Server Downtime
It may sound old school, but data center managers need to stay focused on IT equipment. These failures account for 40% of all reported downtime. Compare that figure with the 25% caused by human error, whether by internal staff or service providers, and the 10% by cyberattacks. To have the greatest positive effect on uptime, hardware should obviously be the first target.
There are several recommendations data center managers should implement, if they haven’t already done so:
- Perform routine maintenance regularly. It should go without saying but often doesn’t. Install recommended patches, check for physical issues like airflow blockages, and heed all alerts and warnings. Maintenance is fundamental but it is no less essential. That means training employees, scheduling tasks, and tracking completion. If maintenance can’t happen on time, all the time, seek outside assistance to get it done so available internal resources can focus on strategic projects and those unavoidable fire drills without leaving systems in jeopardy.
- Monitor your resources. The first you hear of an outage should never be from a customer. Full-time, 24/7 systems monitoring is a must for any enterprise. Fortunately, there are new, AI-driven technologies combining monitoring with advanced predictive maintenance capabilities for immediate fault detection and integrated, quick-turnaround response. Access is less expensive than you might think.
- Upgrade your break/fix plan. A disorganized parts closet or an eBay strategy won’t work. Rapid access to spares is vital in getting systems back online without delay. Especially for mission critical systems, station repair kits on site or work with a vendor who can do so and/or deliver spares within hours.
- Invest in expertise. Parts are only part of the equation. There is significant skill involved in troubleshooting systems in these increasingly complex data center environments. The current IT skills gap may necessitate looking outside the enterprise to complement existing engineering capabilities with those of a third-party provider.
- Test everything. Data centers evolve, but conducting proof-of-principle testing on each workload before any changes are made will cut down on virtualization problems before they happen. By the same token, systems recovery and DR scenarios are unknowns unless they are real-world verified. Try pulling a power cord and see what happens. Does that idea give you pause? It might be time for some enhancements.
There is good news for IT organizations already overwhelmed by demands to maintain more complex environments, execute the digital transformation, and achieve it all with fewer resources and less money, in a tight labor market to boot. Alternatives exist.
Third-party maintenance providers can take on a substantial portion of the equipment-related upkeep, troubleshooting, and support tasks in any data center. With a premium provider on board, it’s possible radically reduce downtime and reach the availability and reliability goals you’d hoped to achieve when you took the virtualization path in the first place.
By Paul Mercina