Breadcrumb

COLLOQUIUM-Designing Exascale Distributed Systems

Add to Calendar 02/17/2023 11:00 02/17/2023 11:50 America/Los_Angeles COLLOQUIUM-Designing Exascale Distributed Systems

Fundamental physical limitations have slowed down hardware scaling, thus ending the “free”
scaling benefits of processing power and storage capacity. At the same time, data is growing at an
unprecedented rate. This data juggernaut is highly disruptive. It morphs benign assumptions into
critical bottlenecks, and forces radical system (re-)designs. My work replaces design decisions of
distributed systems that are disrupted by scale with new, data-driven solutions that are efficient,
scalable, nimble, and robust. As an example, I will describe disk-adaptive redundancy (DARE): a
novel redesign of data reliability in exascale storage clusters driven by insights gleaned from
studying over 5.3 million disks from production environments of Google, NetApp and Backblaze.
I will also describe three new DARE systems that reduce conservative over-protection of data by
up to 20% amounting to millions of dollars of cost savings along with a significant carbon
footprint reduction, while always meeting desired data reliability targets. Additionally, I will
briefly describe some past and current research efforts to improve the availability and
performance of local and distributed storage systems including new erasure codes that reduce
observed unavailability events at Google by up to 33%, a novel aging framework that can
systematically age local file systems to look over 20 years old in less than 6 hours, and an efficient
packing and indexing layer in public cloud infrastructures that boosts the throughput of accessing
tiny objects by over 60x while simultaneously reducing the cost of accessing them by over 25000x.
Finally, I will touch upon the open challenges in designing exascale distributed systems and
highlight promising future directions.

-
Bourns A125

Fundamental physical limitations have slowed down hardware scaling, thus ending the “free”
scaling benefits of processing power and storage capacity. At the same time, data is growing at an
unprecedented rate. This data juggernaut is highly disruptive. It morphs benign assumptions into
critical bottlenecks, and forces radical system (re-)designs. My work replaces design decisions of
distributed systems that are disrupted by scale with new, data-driven solutions that are efficient,
scalable, nimble, and robust. As an example, I will describe disk-adaptive redundancy (DARE): a
novel redesign of data reliability in exascale storage clusters driven by insights gleaned from
studying over 5.3 million disks from production environments of Google, NetApp and Backblaze.
I will also describe three new DARE systems that reduce conservative over-protection of data by
up to 20% amounting to millions of dollars of cost savings along with a significant carbon
footprint reduction, while always meeting desired data reliability targets. Additionally, I will
briefly describe some past and current research efforts to improve the availability and
performance of local and distributed storage systems including new erasure codes that reduce
observed unavailability events at Google by up to 33%, a novel aging framework that can
systematically age local file systems to look over 20 years old in less than 6 hours, and an efficient
packing and indexing layer in public cloud infrastructures that boosts the throughput of accessing
tiny objects by over 60x while simultaneously reducing the cost of accessing them by over 25000x.
Finally, I will touch upon the open challenges in designing exascale distributed systems and
highlight promising future directions.

Type
Colloquium
Target Audience
Students
Admission
Free
Registration Required
No
Let us help you with your search