Write a Blog >>
ICSE 2021
Mon 17 May - Sat 5 June 2021
Wed 26 May 2021 10:00 - 10:25 at Plenary Room - ICSE Keynotes Chair(s): Tao Xie

Reliability-Driven AIOps for Cloud Resilience
Cloud computing platforms have recently become the main host of many IT enterprises to deploy their applications and services, such as search engine, instant messaging apps, and online shopping. As cloud systems continue to grow in terms of complexity and volume, cloud failures become inevitable, which further lead to service interruptions and performance degradation. Whether cloud failures can be properly managed will greatly affect company revenue and customer trust. Consequently, resilient cloud operations are of paramount importance to cloud vendors. However, as cloud systems are actively undergoing continuous feature upgrade and system evolution, the statistical properties of system monitoring data may change from time to time. Furthermore, there is currently a lack of means to incorporating human expert knowledge into the training of cloud data-analytics models. When diagnosing failures for large-scale systems, such knowledge is essential. In this talk, we identify several critical challenges commonly seen in industrial cloud systems, and provide a general roadmap from fault prevention and fault removal techniques toward resilient cloud operations. We propose to develop a reliability-driven AIOps (Artificial Intelligence for IT Operations) framework to achieving resilient cloud systems. Our goal is to improve the reliability of cloud systems and services comprehensively with AI-based data analytics, where data are collected from multiple sources of heterogeneous information such as logs, traces, and KPIs, and properly labeled with cloud domain expert’s knowledge. Particularly, the framework consists of an end-to-end pipeline of software reliability engineering, including anomaly detection, failure diagnosis, and fault localization. Anomalies are events or observations that deviate significantly from a system’s normal behaviors. When anomalies become severe and hinder the system from fulfilling a required function, failures occur, which often manifest themselves with human-perceivable symptoms. Failure diagnosis attempts to find the most significant problems directly induced by the failures. To achieve this objective, we explore data-driven approaches to pursue an efficient failure diagnosis from multiple perspectives of cloud systems. We investigate on what failures are caused by the anomalies underneath, which is generally indicated by a sudden increase or drop of KPIs. For example, the KPI “packet number” monitoring the cloud network may abruptly decrease because of anomalies happening in some network services. This may point to a serious failure in the network. To this end, we design an incident aggregation procedure based on heterogeneous information fusion from incidents, topology, and fine-grained system monitoring data to identify the cascaded failures in a cloud system. Furthermore, we probe into human experts’ activities to enhance the failure diagnosis procedures. Maintainers generally assign different importance of system performance to different KPIs in the cloud. To employ expert knowledge into the training of automated detection models, we introduce an adaptive failure diagnosis mechanism via human-in-the-loop, in which we systematically select informative samples for manual labeling and largely improve the performance of supervised learning algorithms. With this method, we could train a more accurate model from both historical data and human knowledge. More specifically, cloud maintainers could interact with a serving model with minimal efforts. When false alarms and misses happen, the model can adaptively learn from them with the help of the human interaction. As a result, the model could be more accurate over time by systematically accumulating the human knowledge. Finally, we explore fault localization approaches to cluster microservice in the cloud based on logs and KPIs. We employ a PC algorithm for microservice dependency construction, and formulate a probabilistic matrix factorization algorithm for root cause recommendation. Various analytical models associated with the proposed reliability-driven AIOps framework are constructed, experimentations on real cloud data are conducted, and effectiveness of our proposed software reliability engineering techniques are demonstrated.

Wed 26 May

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

10:00 - 11:15
ICSE KeynotesKeynotes at Plenary Room +12h
Chair(s): Tao Xie Peking University
10:00
25m
Keynote
Michael Lyu's Keynote: "Reliability-Driven AIOps for Cloud Resilience" Keynote
Keynotes
Michael Lyu The Chinese University of Hong Kong
Media Attached
10:25
25m
Social Event
Meet Michael Lyu
Keynotes

10:50
25m
Live Q&A
Questions and Answers (included in the keynote video)
Keynotes

22:00 - 23:15
ICSE KeynotesKeynotes at Plenary Room

The Meet Michael Lyu activity will not happen during the mirroring.

22:50
25m
Live Q&A
Questions and Answers (included in the keynote video)
Keynotes


Information for Participants
Info for event:

This keynote is available on Clowdr