CS Events

PhD Defense

Addressing Fault Tolerance for Staging Based Scientific Workflows

 

Download as iCal file

Friday, April 24, 2020, 11:00am - 12:30pm

 

Speaker: Shaohua Duan

Location : Remote via Webex

Committee

Prof. Manish Parashar (Chair)

Prof. Santosh Nagarakatte

Prof. Sudarsun Kannan

Prof. George Bosilca (External Member, University of Tennessee)

Event Type: PhD Defense

Abstract: Scientific in-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, running in-situ scientific workflows on extreme-scale computing systems presents fault tolerance challenges which significantly affect the correctness and performance of workflows. This presentation addresses challenges related to data resilience and fault tolerance for in-situ scientific workflows, and makes the following contributions. Firstly, I present CoREC, a scalable resilient in-memory data staging runtime to support the data resilience for large-scale in-situ workflows. Then, I present a staging based error detection approach which leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption. Finally, I present a loose coupled checkpoint/restart with data logging framework to provide a scalable and flexible fault tolerance scheme for in-situ workflows while still maintaining the data consistency and low resiliency cost.

 

Meeting number: 790 751 469
Password: C3pWc8a8JbF

https://rutgers.webex.com/rutgers/j.php?MTID=me16faa22ced12f818ab1a708d9af3f5c

Join by video system
Dial 790751469@rutgers.webex.com
You can also dial 173.243.2.68 and enter your meeting number.

Join by phone
+1-650-429-3300 USA Toll
Access code: 790 751 469