Past Events
PhD DefenseAddressing Fault Tolerance for Staging Based Scientific Workflows |
|
||
Friday, April 24, 2020, 11:00am - 12:30pm |
|||
Speaker: Shaohua Duan
Location : Remote via Webex
Committee:
Prof. Manish Parashar (Chair)
Prof. Santosh Nagarakatte
Prof. Sudarsun Kannan
Prof. George Bosilca (External Member, University of Tennessee)
Event Type: PhD Defense
Abstract: Scientific in-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, running in-situ scientific workflows on extreme-scale computing systems presents fault tolerance challenges which significantly affect the correctness and performance of workflows. This presentation addresses challenges related to data resilience and fault tolerance for in-situ scientific workflows, and makes the following contributions. Firstly, I present CoREC, a scalable resilient in-memory data staging runtime to support the data resilience for large-scale in-situ workflows. Then, I present a staging based error detection approach which leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption. Finally, I present a loose coupled checkpoint/restart with data logging framework to provide a scalable and flexible fault tolerance scheme for in-situ workflows while still maintaining the data consistency and low resiliency cost.
: