Understanding and Dealing with Failures in Cloud-Scale Systems

Sep 15, 2016
Understanding and Dealing with Failures in Cloud-Scale Systems

Ph.D. candidate Peng (Ryan) Huang is getting ready to complete his degree. Next week he'll stage the final defense of his dissertation on "Understanding and Dealing with Failures in Cloud-Scale Systems." Huang is not worried about a future job: He has already accepted a tenure-track offer to join the Johns Hopkins University Department of Computer Science in Fall 2017 as an Assistant Professor. He'll help them further develop a research depth in computer systems (Huang's own research focus). Prior to Johns Hopkins, he decided however to do a postdoctoral year at Microsoft Research in its Systems Group in Redmond, WA.

Date: Friday, September 23
Time: 11am
Location: Room 2217, CSE Building

Huang's advisor, CSE Prof. YY Zhou, will chair the committee consisting of three other CSE faculty (Ranjit Jhala, George Porter and Stefan Savage) and ECE Prof. Tara Javidi.

Abstract:  In cloud-scale systems, fault is a fact of life. To tolerate faults and provide highly-available service is arguably the single most important task for cloud builders. Yet, despite the considerable efforts into fault-tolerance and software engineering for reliability, all cloud-scale services continue to experience costly failures. A natural question to ask is: why do cloud-scale services still fail despite the abundant fault-tolerance, and how we can further improve? Ryan Huang's dissertation attempts to shed light on this question.

In the first part of this thesis, we study a set of 34 publically disclosed cloud service outages that we gathered and consider them from the point of view of fault-tolerance mechanisms. We present a novel taxonomy to categorize why the mechanisms may be ineffective; it includes faults that cannot be handled by replication, insufficient redundancy, and undetected faults. We also explore the root causes of failures, and investigate the interactions of system components in failures that were caused by multiple faults.

We find that, in many cases, while cloud systems are robust to tolerate traditional faults, they are fragile under misconfiguration, which is a major source of service unavailability. To further improve cloud service quality, it is crucial to reduce misconfiguration.

In the second half of this thesis, we propose a framework, ConfValley, to systematically validate configuration and catch errors before production. At the core of ConfValley is a language called CPL to allow experts to express configuration specifications declaratively. To further reduce operators' burdens of writing configuration specifications, our framework also includes a component to automatically infer specifications.

We evaluate ConfValley in a leading cloud service provider, Microsoft Azure, on its various types of configuration data. We rewrite existing configuration validation code in Microsoft Azure in CPL with more than 10x fewer lines of code. The framework also automatically infers thousands of CPL specs with high accuracy. With the translated and automatically generated specifications, we prevented a number of configuration errors from rolling out in production in Microsoft Azure.