Understanding Exception-Related Bugs in Large-Scale Cloud Systems
Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems.
In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first eBug study that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% eBugs are triggered by non-semantic conditions such as network errors; 40% eBugs can be triggered by simulating the conditions at simple system states. Interestingly, we find that exception triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, which reports 31 bugs and bad practices from the latest versions of the studied systems. So far developers have confirmed that 23 of them are “previously-unknown” bugs or bad practices.
Wed 13 NovDisplayed time zone: Tijuana, Baja California change
10:40 - 12:20
Cloud and Online ServicesJournal First Presentations / Research Papers / Demonstrations at Hillcrest
Chair(s): Dan Hao Peking University
|Understanding Exception-Related Bugs in Large-Scale Cloud Systems|
Haicheng Chen The Ohio State University, Wensheng Dou Institute of Software, Chinese Academy of Sciences, Yanyan Jiang Nanjing University, Feng Qin Ohio State University, USAPre-print Media Attached
|iFeedback: Exploiting User Feedback for Real-time Issue Detection in Large-Scale Online Service Systems|
Wujie Zheng Tencent, Inc., Haochuan Lu Fudan University, Yangfan Zhou Fudan University, Jianming Liang Tencent, Haibing Zheng Tencent, Yuetang Deng Tencent, Inc.
|Software Microbenchmarking in the Cloud. How Bad is it Really?|
Journal First Presentations
Christoph Laaber University of Zurich, Joel Scheuner Chalmers | University of Gothenburg, Philipp Leitner Chalmers University of Technology & University of GothenburgLink to publication Pre-print
|Continuous Incident Triage for Large-Scale Online Service Systems|
Junjie Chen Tianjin University, Xiaoting He Microsoft, Qingwei Lin Microsoft Research, China, Hongyu Zhang The University of Newcastle, Dan Hao Peking University, Feng Gao Microsoft, Zhangwei Xu Microsoft, Yingnong Dang Microsoft Azure, Dongmei Zhang Microsoft Research, China
|Kotless: a Serverless Framework for Kotlin|
Vladislav Tankov JetBrains, ITMO University, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research, Saint-Petersburg State University
|FogWorkflowSim: An Automated Simulation Toolkit for Workflow Performance Evaluation in Fog Computing|
Xiao Liu School of Information Technology, Deakin University, Lingmin Fan School of Computer Science and Technology, Anhui University, Jia Xu School of Computer Science and Technology, Anhui University, Xuejun Li School of Computer Science and Technology, Anhui University, Lina Gong School of Computer Science and Technology, Anhui University, John Grundy Monash University, Yun Yang Swinburne University of Technology