CS Events Monthly View

PhD Defense

Addressing Fault Tolerance for Staging Based Scientific Workflows


Download as iCal file

Friday, April 24, 2020, 11:00am - 12:30pm


Speaker: Shaohua Duan

Location : Remote via Webex


Prof. Manish Parashar (Chair)

Prof. Santosh Nagarakatte

Prof. Sudarsun Kannan

Prof. George Bosilca (External Member, University of Tennessee)

Event Type: PhD Defense

Abstract: Scientific in-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, running in-situ scientific workflows on extreme-scale computing systems presents fault tolerance challenges which significantly affect the correctness and performance of workflows. This presentation addresses challenges related to data resilience and fault tolerance for in-situ scientific workflows, and makes the following contributions. Firstly, I present CoREC, a scalable resilient in-memory data staging runtime to support the data resilience for large-scale in-situ workflows. Then, I present a staging based error detection approach which leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption. Finally, I present a loose coupled checkpoint/restart with data logging framework to provide a scalable and flexible fault tolerance scheme for in-situ workflows while still maintaining the data consistency and low resiliency cost.


Meeting number: 790 751 469
Password: C3pWc8a8JbF


Join by video system
Dial This email address is being protected from spambots. You need JavaScript enabled to view it.
You can also dial and enter your meeting number.

Join by phone
+1-650-429-3300 USA Toll
Access code: 790 751 469