|
|
In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (editors), Volume 1162 of Lecture Notes in Computer Science. Springer-Verlag, 1996.
Abstract. Much of the recent work on multiprocessor scheduling disciplines has used abstract workload models to explore the fundamental, high-level properties of the various alternatives. As continuing work on these policies increases their level of sophistication, however, it is clear that the choice of appropriate policies must be guided at least in part by the typical behavior of actual parallel applications. Our goal in this paper is to examine a variety of such applications, providing measurements of properties relevant to scheduling policy design. We give measurements for both hand-coded parallel programs (from the SPLASH benchmark suites) and compiler-parallelized programs (from the PERFECT Club suite) running on a KSR-2 shared-memory multiprocessor. The measurements we present are intended primarily to address two aspects of multiprocessor scheduling policy design: Much of the recent work on multiprocessor scheduling disciplines has used abstract workload models to explore the fundamental, high-level properties of the various alternatives. As continuing work on these policies increases their level of sophistication, however, it is clear that the choice of appropriate policies must be guided at least in part by the typical behavior of actual parallel applications. Our goal in this paper is to examine a variety of such applications, providing measurements of properties relevant to scheduling policy design. We give measurements for both hand-coded parallel programs (from the SPLASH benchmark suites) and compiler-parallelized programs (from the PERFECT Club suite) running on a KSR-2 shared-memory multiprocessor. The measurements we present are intended primarily to address two aspects of multiprocessor scheduling policy design:
We address these questions through three sets of measurements:
In Proceedings of the 10th International Parallel Processing Symposium, pages 463-468, April 1996.
PostScript (TR)
Abstract. We address the problem of maximizing application speedup through runtime, self-selection of an appropriate number of processors on which to run. Automatic, runtime selection of processor allocations is important because many parallel applications exhibit peak speedups at allocations that are data or time dependent. We propose the use of a runtime system that: (a) dynamically measures job efficiencies at different allocations, (b) uses these measurements to calculate speedups, and (c) automatically adjusts a job's processor allocation to maximize its speedup. Using a set of 10 applications that includes both hand-coded parallel programs and compiler-parallelized sequential programs, we show that our runtime system can reliably determine dynamic allocations that match the best possible static allocation, and that it has the potential to find dynamic allocations that outperform any static allocation.
In Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (editors), Volume 1162 of Lecture Notes in Computer Science. Springer-Verlag, 1996.
Abstract. We consider the use of runtime measured workload characteristics in parallel processor scheduling. Although many researchers have considered the use of application characteristics in this domain, most of this work has assumed that such information is available a priori. In contrast, we propose and evaluate experimentally dynamic processor allocation policies that rely on determining job characteristics at runtime; in particular, we focus on measuring and using job efficiency and speedup. Our work is intended to be a first step towards the eventual development of production schedulers that use runtime measured workload characteristics in making their decisions. The experimental results we present validate the following observations:
We consider both interactive environments, in which a response time directed scheduler is appropriate, and batch environments, in which maximizing useful instruction throughput is the primary goal. Our experiments are performed using prototype implementations running on a 50-node KSR-2 shared memory multiprocessor.