Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track
Fri 28 May 2021 03:05 - 03:25 at Blended Sessions Room 1 - 3.3.1. Monitoring Cloud-Based Services
Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur a severe economic loss. Locating the root cause service, i.e., the service where the propagation of anomaly originates, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner, which largely depends on human efforts. A candidate service that directly causes the outage is identified first, and the suspected root cause may be traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production clouds typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first correlation-based outage triage approach by constructing a global view of service correlations. COT mines the correlations of the performance indicators collected from hundreds of services. After learning from historical outages, COT can infer the root cause of emerging ones accordingly. We implement COT and evaluate it on a real-world dataset containing one year of data collected from a production Cloud A, one of the representative cloud computing platforms around the world. Our experimental results show that COT can reach a triage accuracy of 82.1∼83.5%, which outperforms the state-of-the-art triage approach by 28.0∼29.7%.
Thu 27 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
15:05 - 16:05 | 3.3.1. Monitoring Cloud-Based ServicesTechnical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 +12h Chair(s): Andrea Zisman The Open University | ||
15:05 20mPaper | Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track Technical Track Yaohui Wang Fudan University, Guozheng Li Peking University, Zijian Wang Fudan University, Yu Kang Microsoft Research, Beijing, China, Yangfan Zhou Fudan University, Hongyu Zhang The University of Newcastle, Feng Gao Microsoft Azure, Jeffrey Sun Microsoft Azure, Li Yang Microsoft Azure, Pochian Lee Microsoft Azure, Zhangwei Xu Microsoft Azure, Pu Zhao Microsoft Research, Beijing, China, Bo Qiao Microsoft Research, Beijing, China, Liqun Li Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Qingwei Lin Microsoft Research, Beijing, China Pre-print Media Attached | ||
15:25 20mPaper | Neural Knowledge Extraction From Cloud Service IncidentsSEIP SEIP - Software Engineering in Practice Manish Shetty Microsoft Research, India, Chetan Bansal Microsoft Research, Sumit Kumar Microsoft, Nikitha Rao Microsoft Research, Nachiappan Nagappan Microsoft Research, Thomas Zimmermann Microsoft Research Link to publication DOI Pre-print Media Attached | ||
15:45 20mPaper | FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP SEIP - Software Engineering in Practice Jinho Hwang IBM Research, Larisa Shwartz IBM, Qing Wang Institute of Software, Chinese Academy of Sciences, Raghav Batta IBM, Harshit Kumar IBM, Michael Nidd IBM Pre-print Media Attached |
Fri 28 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
03:05 - 04:05 | 3.3.1. Monitoring Cloud-Based ServicesTechnical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 | ||
03:05 20mPaper | Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track Technical Track Yaohui Wang Fudan University, Guozheng Li Peking University, Zijian Wang Fudan University, Yu Kang Microsoft Research, Beijing, China, Yangfan Zhou Fudan University, Hongyu Zhang The University of Newcastle, Feng Gao Microsoft Azure, Jeffrey Sun Microsoft Azure, Li Yang Microsoft Azure, Pochian Lee Microsoft Azure, Zhangwei Xu Microsoft Azure, Pu Zhao Microsoft Research, Beijing, China, Bo Qiao Microsoft Research, Beijing, China, Liqun Li Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Qingwei Lin Microsoft Research, Beijing, China Pre-print Media Attached | ||
03:25 20mPaper | Neural Knowledge Extraction From Cloud Service IncidentsSEIP SEIP - Software Engineering in Practice Manish Shetty Microsoft Research, India, Chetan Bansal Microsoft Research, Sumit Kumar Microsoft, Nikitha Rao Microsoft Research, Nachiappan Nagappan Microsoft Research, Thomas Zimmermann Microsoft Research Link to publication DOI Pre-print Media Attached | ||
03:45 20mPaper | FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP SEIP - Software Engineering in Practice Jinho Hwang IBM Research, Larisa Shwartz IBM, Qing Wang Institute of Software, Chinese Academy of Sciences, Raghav Batta IBM, Harshit Kumar IBM, Michael Nidd IBM Pre-print Media Attached |