Breadcrumb

COLLOQUIUM- Beyond High Availability: Taming Complex Failures in Cloud Systems

Add to Calendar 03/15/2023 11:00 03/15/2023 11:50 America/Los_Angeles COLLOQUIUM- Beyond High Availability: Taming Complex Failures in Cloud Systems

Over the years, significant progress has been made in enhancing the availability of cloud
systems. However, the increasing prevalence of complex failures is now posing a major
challenge to cloud availability. As cloud systems continue to grow in both scale and
complexity, failures become more difficult to detect and can have disastrous

consequences. Such failures challenge the underlying assumptions of conventional fault-
tolerant designs, requiring the development of innovative system designs to address

them.
In this talk, I will share my experience in addressing three emerging types of failures in
cloud systems. I will start by discussing partial failures of system components and
introduce OmegaGen, a tool that generates customized checkers for detecting and
localizing such failures. Next, I will address the issue of silent failures in system
components and present my study findings, as well as OathKeeper, a tool that uses past
failures to infer and expose such silent issues. I will also briefly discuss our work on
addressing slow failures through RESIN, a solution for detecting, diagnosing, and
mitigating memory leaks at a cloud-scale infrastructure, developed in collaboration with
Microsoft Azure. To conclude, I will provide a glimpse into exciting future directions for
building dependable cloud systems.

-
Bourns A265

Over the years, significant progress has been made in enhancing the availability of cloud
systems. However, the increasing prevalence of complex failures is now posing a major
challenge to cloud availability. As cloud systems continue to grow in both scale and
complexity, failures become more difficult to detect and can have disastrous

consequences. Such failures challenge the underlying assumptions of conventional fault-
tolerant designs, requiring the development of innovative system designs to address

them.
In this talk, I will share my experience in addressing three emerging types of failures in
cloud systems. I will start by discussing partial failures of system components and
introduce OmegaGen, a tool that generates customized checkers for detecting and
localizing such failures. Next, I will address the issue of silent failures in system
components and present my study findings, as well as OathKeeper, a tool that uses past
failures to infer and expose such silent issues. I will also briefly discuss our work on
addressing slow failures through RESIN, a solution for detecting, diagnosing, and
mitigating memory leaks at a cloud-scale infrastructure, developed in collaboration with
Microsoft Azure. To conclude, I will provide a glimpse into exciting future directions for
building dependable cloud systems.

Type
Colloquium
Target Audience
Faculty
Admission
Free
Registration Required
No
Let us help you with your search