Cascaded Execution: Speedup Up Unparallelized Execution on Multiprocessor Systems
Ruth Anderson, Thu D. Nguyen, and John Zahorjan

Appears in Proceedings of the 2nd Merged Symposium IPPS/SPDP 1999, April 1999.

[Postscript][PDF] (TR Version: [PDF])

Abstract. Both inherently sequential code and limitations of analysis techniques prevent full parallelization of many applications by parallelizing compilers. Amdahl's Law tells us that as parallelization becomes increasingly effective, any unparallelized loop becomes an increasingly dominant performance bottleneck.

We present a technique for speeding up the execution of unparallelized loops by cascading their sequential execution across multiple processors: only a single processor executes the loop body at any one time, and each processor executes only a portion of the loop body before passing control to another. Cascaded execution allows otherwise idle processors to optimize their memory state for the eventual execution of their next portion of the loop, resulting in significantly reduced overall loop body execution times.

We evaluate cascaded execution using loop nests from wave5, a Spec95fp benchmark application, and a synthetic benchmark meant to assess the impact of the increasingly dominant memory access times of future processors. Running on a PC with 4 Pentium Pro processors and an SGI Power Onyx with 8 R10000 processors, we observe an overall speedup of 1.35 and 1.7, respectively, for the wave5 loops we examined, and speedups as high as 4.5 for individual loops. Our extrapolated results using the synthetic benchmark show a potential for speedups as large as 16 on future machines.