CS Events Monthly View
PhD DefenseAddressing Fault Tolerance for Staging Based Scientific Workflows |
|
||
Friday, April 24, 2020, 11:00am - 12:30pm |
|||
Speaker: Shaohua Duan
Location : Remote via Webex
Committee:
Prof. Manish Parashar (Chair)
Prof. Santosh Nagarakatte
Prof. Sudarsun Kannan
Prof. George Bosilca (External Member, University of Tennessee)
Event Type: PhD Defense
Abstract: Scientific in-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, running in-situ scientific workflows on extreme-scale computing systems presents fault tolerance challenges which significantly affect the correctness and performance of workflows. This presentation addresses challenges related to data resilience and fault tolerance for in-situ scientific workflows, and makes the following contributions. Firstly, I present CoREC, a scalable resilient in-memory data staging runtime to support the data resilience for large-scale in-situ workflows. Then, I present a staging based error detection approach which leverages idle computation resource in data staging to enable timely detection and recovery from silent data corruption. Finally, I present a loose coupled checkpoint/restart with data logging framework to provide a scalable and flexible fault tolerance scheme for in-situ workflows while still maintaining the data consistency and low resiliency cost.
:
Meeting number: 790 751 469
Password: C3pWc8a8JbF
https://rutgers.webex.com/rutgers/j.php?MTID=me16faa22ced12f818ab1a708d9af3f5c
Join by video system
Dial This email address is being protected from spambots. You need JavaScript enabled to view it.
You can also dial 173.243.2.68 and enter your meeting number.
Join by phone
+1-650-429-3300 USA Toll
Access code: 790 751 469