We know we are stable at the current time, but we don’t yet have an answer for the root cause of the failure, and we have enterprise Fortune 500 companies that are relying on us to give them an answer immediately as our entire business is at risk. One of the chief executives looked at me across the table and said, “Jason, we love you and understand the incredible efforts that you and your teams have put forth. But we didn’t yet have a root-cause analysis. I distinctly recall sitting in front of the Board of Directors and Executives at 2:00 AM trying to explain what we knew up to that point in time. In 2006, RagingWire experienced one particularly bad incident caused by a defective main breaker that resulted in a complete outage. But learning from them makes us better as a company, and sharing the lessons make us better as an industry. I must admit that data center incidents are not pleasant experiences. By the end of this year, that will equate to almost 100 megawatt (MW) of generating capacity and 50 MW of critical IT UPS power. I have been involved with every data center incident at RagingWire since its opening in 2001. Data provided by Emerson shows that human and mechanical failures cause the vast majority of unplanned outages in data centers. This means that, as operators, we play a major role with facility management.įigure 1. Uptime Institute Network data provide similar results. There is a lot of public data available on the causes of data center outages and incidents, but I particularly like a data set published by Emerson (see Figure 1) because it highlights the fact that most data center incidents are caused by human and mechanical failures, not weather or natural disasters. Ask what they do when they have a failure. Neither of these types of incidents are an outage at the rack level, yet either of them could result in one if not mitigated properly.ĭo not ask whether any particular data center has failures. Others will be major like an underground cable fault that takes out an entire megawatt of live UPS load (which happened to us). Some will be minor like a faulty vibration switch. It means loss of both cords, if dual fed, or loss of single cord, if utilizing a transfer switch either upstream or at the rack level.Įquipment and systems will break, period. An outage represents a complete loss of power or cooling to a customer’s rack or power supply. These terms are often confused, misinterpreted or exaggerated. And when someone does, it hurts our entire industry.Īn incident does not mean a data center outage. None of us can afford to take our customers off-line… period. We are an insurance policy for every type of critical IT business. Let’s acknowledge the fact that we need to be better as an industry. Clients are far less concerned about the fact that we will have incidents than about how we actually manage them. I suppose we’re reluctant to share details with each other because of fear that the information could be used against us in future sales opportunities, but I’ve learned that customers and prospects alike expect data center incidents. Our industry remains very secretive about sharing incidents and details of incidents, yet we can all learn so much from each other! Why isn’t every major data center provider already a member of this Network? Why don’t we share more experiences with each other? Why are we reluctant to share details of our incidents? Of course, we love to talk about each other’s incidents as though our competitors were hit with a plague while we remain immune from any potential for disaster. Yet Uptime Institute maintains an entire membership organization called the Uptime Institute Network that is dedicated to providing owners and operators the ability to share experiences and best practices by facilitating a rich information nexus exchange between members and the Uptime Institute. As data center operators, we simply don’t share enough of our critical facilities incidents with each other. I recently began a keynote speech at Uptime Institute Symposium 2013 by making a bold statement. The lack of transparency can be seen as a root cause of outages and incidents Accredited Operations Professional Course.Accredited Sustainability Advisor (ASA) Course.Accredited Tier Specialist (ATS) Course.Accredited Operations Specialist (AOS) Course.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |