Bitonic Sort: CUDA
Contents
cuda_bitonic.c
#include <stdio.h>
// The original array, stored in global memory. The final result will
// eventually overwrite the original and be stored here.
__device__ float* array;
__global__ __forceinline__ void bitonicSort(float* a, float* b);
__global__ __forceinline__ void bitonicBuild(float* a, float* b);
void bitonicBuildRunner(float* a, int size);
void bitonicSortRunner(float* a, int size);
int main(int argc, char **argv) {
// Input in array...
float* array;
// BEGIN
int n, i,s;
FILE *f = fopen(argv[1],"r");
if(f == NULL) {
fprintf(stderr,"File not found.\n");
exit(1);
}
// Size of input
fscanf(f,"%d",&n);
array = (float*) malloc(n * sizeof(float));
for(i = 0; i < n; i++) {
fscanf(f,"%d",(arr + i));
}
// END
// Size of array;
int size;
// Transfer flow of control to device
bitonicBuildRunner(array, size);
bitonicSortRunner(array, size);
}
void bitonicSortRunner(float* a, int size) {
// Copy over memory
float* array;
int mem = sizeof(float) * size;
cudaMalloc(array, mem);
cudaMemcpy(array, a, mem, cudaMemcpyHostToDevice);
int blocks = 1;
while(blocks != size / 2) {
// Execution config
dim3 numBlocks = blocks;
dim3 threadsPerBlock = size / blocks / 2;
bitonicSort<<<numBlocks, threadsPerBlock>>>(array, size / blocks);
size *= 2;
}
}
void bitonicBuildRunner(float* a, int size) {
int blocks = size / 2;
while(blocks != 1) {
int i = blocks, blockSize = size * (1 - 1 / blocks);
while(i != 1) {
dim3 numBlocks = i, threadsPerBlock = blockSize;
for(j = 0; j < size; j += blockSize, a++) {
bitonicBuild<<<numBlocks, threadsPerBlock>>>(a, blockSize, i);
}
i /= 2;
}
blocks /= 2;
}
}
/**
* Applies the bitonic sorting algorithm to each thread. It swaps two
* elements in the two lists if they are out of place.
*/
__global__ __forceinline__ void bitonicSort(float* a, int blockSize) {
// First we need to figure out what index each thread will access
int index = threadIdx.x + blockIdx.x * blockSize;
atomicMin(&a[i + index],
atomicMax(&a[i + blockSize + index], a[i + index]));
__syncthreads();
}
/**
* Combines two bitonic sequences together to create a new bitonic sequence.
* @param a Pointer to the start of the bitonic sequence.
* @param blockSize The size of each sub-array partition.
* @param t Determines when to switch between ascending and descending.
*/
__global__ __forceinline__ void bitonicBuild(float* a, int blockSize, int t) {
int index = threadIdx.x + blockIdx.x * blockSize, x = 0, asc = 1;
float* b = a + index + (blockSize / 2);
while(x > index) {
x += t;
asc++;
}
if(asc % 2 == 1) {
atomicMin(&a[index], atomicMax(&b, a[index]));
}
else {
atomicMax(&a[index], atomicMin(&b, a[index]));
}
}
Although recursive implementations seem tempting, the added overhead
from method calls would eliminate most of the speedup gained from
paralleization. Instead, simpler loops are used to iterate through the
algorithm. The collection that holds the sequence to sort is stored
on the host, where it can persist throughout the entire application.
However, when the sequence is actually being operated on, data is
transferred to the device, where caching and other optimizations make
data accesses there significantly faster than global memory on the
host.
In the sorting network, each comparator is implemented as an individual
thread. For both the bitonic build and split procedures, the sequence
is partitioned into blocks; then comparators are used to examine and
swap elements that are out of order. Threads within the same block can
be synchronized and communicate with each other as the process occurs.
On each step of the algorithm, there exists one block for every two
sub-lists being processed. After the bitonic build, the sequence must
be sent back to the host and re-copied onto the device, as data on the
device does not persist between function calls.
Because the comparisons are almost embarrassingly parallel, there is
very little time wasted on communication. There is a minimal amount of
time spent waiting for threads in the same block waiting for the others
to finish (a barrier). There is also another barrier for one block
waiting for adjacent sublists to complete their work before proceeding
onto the next iteration. However, the most significant delay is the
latency in transferring data to and from the host, which takes much
longer compared to the quick memory accesses on the device. As hard
drives become faster and more efficient, this latency can be reduced.
Thanks to parallelism, each iteration of comparisons is done
simultaneously, which greatly reduces the total number of comparisons
done. Tradeoffs in using a CUDA implementation of this sort include
the aforementioned device memory latency and limited data on the device,
which may hinder the algorithm or require extra partitioning of the
sequence. Also, given sufficiently fast processors on the host, the
algorithm may run nearly as quickly sequentially as in parallel due to
the unavoidable memory delays.
External factors that affect the speed of the sort include:
- Communication latency, as described above.
- Device memory. When shared memory is limited, we can use
tiling only on moderately sized sublists. When the sequence to sort is
extremely large, data must be kept in global memory and it may not be
practical to run the sort.