CS Events Monthly View

PhD Defense

Integrating Accelerators

 

Download as iCal file

Wednesday, December 16, 2020, 01:00pm - 03:00pm

 

Speaker: Jan Vesely

Location : Remote via Webex

Committee

Prof. Abhishek Bhattacharjee (Advisor)

Prof. Thu Nguyen

Prof. Ulrich Kremer

Olivier Giroux (nVidia)

Event Type: PhD Defense

Abstract: Accelerators have emerged as a popular way to improve both the performance and energy efficiency of computation. The level at which an accelerator is integrated with the rest of the system varies -- from tightly integrated vector floating-point units to standalone GPU and FPGA add-in boards. In this work, we consider loosely coupled accelerators -- compute units that run programs in their own context and address spaces. However, independence also makes loosely coupled accelerators more challenging to program, and it is more difficult for applications to benefit from their improved performance and energy efficiency. This thesis focuses on some of the challenges in im- proving the programmability of accelerators and their integration with the rest of the system. We examine the issues that arise in such heterogeneous systems in three major areas; data access, system services, and acceleration of high-level languages. First, we examine challenges with data access. To avoid expensive data marshaling overhead, accelerators often support unified virtual address space (also called unified virtual memory). This feature allows the operating system to synchronize CPU and accelerator address spaces. However, designing such a system needs to make several trade-offs to accommodate the complexities of maintaining the mirror layout and at the same time matching accelerator specific data access patterns. This work investigates integrated GPUs as a case study of loosely coupled accelerators and identifies several opportunities for improvement in designing device-side address translation hardware to provide unified virtual address space. The second major area we study is the access to system services from accelerator's programs. While accelerators often work as memory to memory devices, there is an increasing amount of evidence in favour of providing them with direct access to network or permanent storage. This work discusses the suitability of existing operating system interfaces (POSIX) and their semantics for inclusion in GPU programs. We consider the differences between CPU and GPU execution model and the suitability of CPU system calls from both semantics and performance point of view. Finally, we study mapping of high-level dynamic languages to accelerators. High-level languages, like Python, are increasingly popular with designers of scientific applications with a large selection of support libraries. However, Python's ease of use and excellent programmability comes at a performance cost. We examine a specific case of cognitive modeling workloads written in Python and propose a path to efficient execution on accelerators. We demonstrate that it is often possible to extract and optimize core computational kernels using standard compiler techniques. Extracting such kernels offers multiple benefits; it improves performance, it eliminates dynamic language features for more efficient mapping to accelerators, and it offers opportunities for exploiting compiler-based analyses to provide direct user feedback. All demonstrated systems were implemented and data collected on real systems without any use of system simulators or hardware emulation.

 

https://rutgers.webex.com/rutgers/j.php?MTID=m74e2ef076f77507ee2a03ddabb43b5a6