CS Events Monthly View

Computer Science Department Colloquium

Designing Exascale Distributed Systems

 

Download as iCal file

Friday, March 03, 2023, 10:30am - 11:30am

 

Speaker: Saurabh Kadekodi

Bio

Saurabh Kadekodi obtained his PhD in the Computer Science Department at Carnegie Mellon University (CMU) in 2020 as part of the Parallel Data Laboratory (PDL) under the guidance of Prof. Gregory Ganger and Prof. Rashmi Vinayak. After graduation Saurabh joined Google as a Visiting Faculty Researcher, and is currently a Research Scientist in the Storage Analytics team. Saurabh is broadly interested in designing distributed systems with special focus on the performance and reliability of storage systems. At Google, Saurabh is working towards implementing his PhD thesis on disk-adaptive redundancy and other exciting research ideas in some of the largest systems in the world.

Location : CoRE 301

Event Type: Computer Science Department Colloquium

Abstract: Fundamental physical limitations have slowed down hardware scaling, thus ending the “free”scaling benefits of processing power and storage capacity. At the same time, data is growing atan unprecedented rate. This data juggernaut is highly disruptive. It morphs benign assumptionsinto critical bottlenecks, and forces radical system (re-)designs. My work replaces designdecisions of distributed systems that are disrupted by scale with new, data-driven solutions thatare efficient, scalable, nimble, and robust. As an example, I will describe disk-adaptiveredundancy (DARE): a novel redesign of data reliability in exascale storage clusters driven byinsights gleaned from studying over 5.3 million disks from production environments of Google,NetApp and Backblaze. I will also describe three new DARE systems that reduce conservativeover-protection of data by up to 20% amounting to millions of dollars of cost savings along witha significant carbon footprint reduction, while always meeting desired data reliability targets.Additionally, I will briefly describe some past and current research efforts to improve theavailability and performance of local and distributed storage systems including new erasurecodes that reduce observed unavailability events at Google by up to 33%, a novel agingframework that can systematically age local file systems to look over 20 years old in less than 6hours, and an efficient packing and indexing layer in public cloud infrastructures that boosts thethroughput of accessing tiny objects by over 60x while simultaneously reducing the cost ofaccessing them by over 25000x. Finally, I will touch upon the open challenges in designingexascale distributed systems and highlight promising future directions.

Contact  Yipeng Huang

Livestream available via Zoom:
https://rutgers.zoom.us/j/99551035958?pwd=d0Uvb1RncGh1cW5CK3NBNGU1Z2d4QT09