Bitonic Sort: CUDA

Contents


Code

cuda_bitonic.c

#include <stdio.h> // The original array, stored in global memory. The final result will // eventually overwrite the original and be stored here. __device__ float* array; __global__ __forceinline__ void bitonicSort(float* a, float* b); __global__ __forceinline__ void bitonicBuild(float* a, float* b); void bitonicBuildRunner(float* a, int size); void bitonicSortRunner(float* a, int size); int main(int argc, char **argv) { // Input in array... float* array; // BEGIN int n, i,s; FILE *f = fopen(argv[1],"r"); if(f == NULL) { fprintf(stderr,"File not found.\n"); exit(1); } // Size of input fscanf(f,"%d",&n); array = (float*) malloc(n * sizeof(float)); for(i = 0; i < n; i++) { fscanf(f,"%d",(arr + i)); } // END // Size of array; int size; // Transfer flow of control to device bitonicBuildRunner(array, size); bitonicSortRunner(array, size); } void bitonicSortRunner(float* a, int size) { // Copy over memory float* array; int mem = sizeof(float) * size; cudaMalloc(array, mem); cudaMemcpy(array, a, mem, cudaMemcpyHostToDevice); int blocks = 1; while(blocks != size / 2) { // Execution config dim3 numBlocks = blocks; dim3 threadsPerBlock = size / blocks / 2; bitonicSort<<<numBlocks, threadsPerBlock>>>(array, size / blocks); size *= 2; } } void bitonicBuildRunner(float* a, int size) { int blocks = size / 2; while(blocks != 1) { int i = blocks, blockSize = size * (1 - 1 / blocks); while(i != 1) { dim3 numBlocks = i, threadsPerBlock = blockSize; for(j = 0; j < size; j += blockSize, a++) { bitonicBuild<<<numBlocks, threadsPerBlock>>>(a, blockSize, i); } i /= 2; } blocks /= 2; } } /** * Applies the bitonic sorting algorithm to each thread. It swaps two * elements in the two lists if they are out of place. */ __global__ __forceinline__ void bitonicSort(float* a, int blockSize) { // First we need to figure out what index each thread will access int index = threadIdx.x + blockIdx.x * blockSize; atomicMin(&a[i + index], atomicMax(&a[i + blockSize + index], a[i + index])); __syncthreads(); } /** * Combines two bitonic sequences together to create a new bitonic sequence. * @param a Pointer to the start of the bitonic sequence. * @param blockSize The size of each sub-array partition. * @param t Determines when to switch between ascending and descending. */ __global__ __forceinline__ void bitonicBuild(float* a, int blockSize, int t) { int index = threadIdx.x + blockIdx.x * blockSize, x = 0, asc = 1; float* b = a + index + (blockSize / 2); while(x > index) { x += t; asc++; } if(asc % 2 == 1) { atomicMin(&a[index], atomicMax(&b, a[index])); } else { atomicMax(&a[index], atomicMin(&b, a[index])); } }

Parallelization

Although recursive implementations seem tempting, the added overhead from method calls would eliminate most of the speedup gained from paralleization. Instead, simpler loops are used to iterate through the algorithm. The collection that holds the sequence to sort is stored on the host, where it can persist throughout the entire application. However, when the sequence is actually being operated on, data is transferred to the device, where caching and other optimizations make data accesses there significantly faster than global memory on the host.

In the sorting network, each comparator is implemented as an individual thread. For both the bitonic build and split procedures, the sequence is partitioned into blocks; then comparators are used to examine and swap elements that are out of order. Threads within the same block can be synchronized and communicate with each other as the process occurs. On each step of the algorithm, there exists one block for every two sub-lists being processed. After the bitonic build, the sequence must be sent back to the host and re-copied onto the device, as data on the device does not persist between function calls.

Communication and Data Sharing

Because the comparisons are almost embarrassingly parallel, there is very little time wasted on communication. There is a minimal amount of time spent waiting for threads in the same block waiting for the others to finish (a barrier). There is also another barrier for one block waiting for adjacent sublists to complete their work before proceeding onto the next iteration. However, the most significant delay is the latency in transferring data to and from the host, which takes much longer compared to the quick memory accesses on the device. As hard drives become faster and more efficient, this latency can be reduced.

Communication versus Computation

Thanks to parallelism, each iteration of comparisons is done simultaneously, which greatly reduces the total number of comparisons done. Tradeoffs in using a CUDA implementation of this sort include the aforementioned device memory latency and limited data on the device, which may hinder the algorithm or require extra partitioning of the sequence. Also, given sufficiently fast processors on the host, the algorithm may run nearly as quickly sequentially as in parallel due to the unavoidable memory delays.

External Parameters

External factors that affect the speed of the sort include: