TaskProf Tutorial @ PPoPP 2018
Debugging and Profiling Task Parallel Programs with TaskProf
When: Sunday, February 25, 2018. 8:30 AM - 12 PM
Where: PPoPP 2018, Europa 6
Task Parallelism is an effective approach for writing performance portable parallel programs. In this model, the programmer expresses the parallelism in the program by specifying tasks and the task runtime dynamically maps these tasks to processors. The runtime balances the workload using work stealing algorithms. There are a number of languages and frameworks that support task parallelism (e.g., Intel Threading Building Blocks (TBB), Cilk, Habanero Java, X10, Microsoft Task Parallel Library, Java Fork/Join tasks, and the Rust language). This tutorial will present tools and techniques for performance and correctness debugging of task parallel programs written using the Intel Thread Building Blocks library.
Task parallel programs can have correctness issues such as data races and atomicity violations, similar to multithreaded programs. A program exhibits a data race when there are multiple accesses to a shared memory location, at least one of them is a write, and there is no ordering between these accesses. In the absence of locks, a data race occurs when these accesses are not ordered by task management (spawn/sync) constructs. In the presence of locks, a data race occurs when two parallel accesses (one of which is a write) are not protected by a common lock. This tutorial demostrates techniques to detect data races per-input for a given task parallel program.
This tutorial will educate participants on identifying parallelism bottlenecks in task parallel programs. A common metric used to quantify the performance of a task parallel program is asymptotic parallelism, which measures the potential speedup when the program is executed on a large number of processors. It is constrained by the longest chain of tasks that must be executed sequentially (also known as the span or the critical work). Hence, asymptotic parallelism is the ratio of the total work and the critical work performed by the program for a given input. A scalable program must have large asymptotic parallelism. A task parallel program can have low asymptotic parallelism due to multiple factors: coarse-grained tasks, limited work performed by the program, and secondary effects of execution such as contention, low locality, and false sharing.
In this tutorial, we will introduce TaskProf, a profiler that measures asymptotic parallelism in task parallel programs for a given input. TaskProf computes an accurate parallelism profile by performing fine-grained attribution of work to various parts of the program using the structure of a task parallel execution. We will highlight the causal profile feature of TaskProf. TaskProf's causal profile allows users to estimate improvements in parallelism when regions of code are optimized, even before concrete optimizations for them are known. Finally, we will have a hands-on exercise where participants use TaskProf to identify parallelism bottlenecks, estimate the improvement from optimizing the bottlenecks, and design concrete optimizations to obtain performance improvements.
8:30 - 8:45 : Introduction to task parallelism and challenges in writing performance portable programs
8:45 - 9:45 : Core algorithms used for parallelism profiling and what-if analyses
9:45 - 10:00: Demo of profiling with TaskProf
10:00 - 10:30 : Break
10:30 - 11:00 : Demo of bottleneck identifcation with TaskProf
11:00 - 12:00 : Hands-on session with TaskProf to find bottlenecks with two TBB applications
Familiarity with C++ and interest in performance analysis of parallel programs.
Artifacts for the Tutorial
- Measurement of inherent parallelism in a task parallel program
- Construction of series-parallel relationships and fine-grained measurements with task parallel programs
- What-If analyses to identify serialization bottlenecks
- Building debugging tools using series-parallel relationships
- Adarsh Yoga,and Santosh Nagarakatte - A Fast Causal Profiler For Task Parallel Programs - FSE 2017.
- Adarsh Yoga, Santosh Nagarakatte, and Aarti Gupta - Parallel Data Race Detection for Task Parallel Programs with Locks - FSE 2016.