Andrin Von Rechenberg Sandbox
Sandbox
Super Scalar Sample Sort
This is a paper I wrote for ETH, I thought I might as well publish it here
Introduction
Abstract
The Sample Sort Algorithm
How it works
The input is an array of
n integers.
We take the first
k numbers in the array and use them as splitters.
Splitters are elements of the input that will divide the input into smaller pieces. It is fairly simple: You take out the splitters and sort them (using quicksort or inline algorithm). Between the splitters you put buckets. If our input are the numbers 1-100 and our splitters are 32,57 and 73 it will look like this:
1 32 57 73 100
\ Bucket / \ Bucket / \ Bucket / \ Bucket /
"""""""" """""""" """""""" """"""""
Once we have the splitters and the buckets, we run through the input sequentially and throw every element in to the according bucket. This is on step.
Do the whole thing recursivly for every bucket (with the elements in the bucket as the input). The recursion stops when there is at most one element in every bucket. On the way, store the buckets and the splitters in sequential order.
In the end we will have many small buckets in a sorted order that only contain at most one element. If we now run over all buckets and splitters sequentially we have the sorted input.
[Image]
The Speed-Ups (Super Scalar Sample Sort)
Let us look a little more into processor architecture. In the following paragraphs the word processor is an Intel Pentium 4 (later 90nm generation) processor.
An instruction is not executed by the processor in one step. In fact an instruction in a processor runs through 31 stages called the execution pipeline (reading the instruction, getting all the data from memory that is needed to execute this instruction, performing the instruction, writing the result of the instruction back, etc). It would not be very wise to wait until every instruction ran through all the execution stages before processing the next one. That's why processors use instruction parallelism, which means that they already start the next instruction, before the previous instruction has completly finished. This is not always trivial or possbile and can result in big delays, which unnecessarily slow down the algorithm.
Conditional Branches
When the processor comes to a Conditional Branch it tries to predict its direction in order to keep the execution pipeline full.
The prediction rule is simple: It always evaluates the condition as true and hence executes the fist branch. Should the processor finally evaluate the expression as false and hence mispredicted the branch it needs to flush the execution pipeline (which costs 31 cycles!) in order to be in a consistent state again.
In many cases a programmer can predict the control flow of an application and can optimize the conditional branches, so that at execution time, the processor doesn't mispredict branches all the time.
With sorting though the problem of predicting a branch becomes impossible. If we have a conditional branch and we have to predict whether an element will be smaller or greater than another element, we have a 50% chance of beeing correct. This means that every second time we compare two elements we loose 31 processor cycles.
element = randomNumber();
splitter = randomNumber();
if ( element < splitter ) {
// The processor will predict this branch. There is a 50% chance of beeing correct
LeftBucket.add(element)
} else {
// If this is the correct branch, we lose 31 processor cycles.
RightBucket.add(element)
}
The k-Splitters and the buckets
In order to get rid of conditional branches, we do a neat trick to place the elements in the correct bucket, without using conditional branches.
k = 1
If we have only one splitter (
k = 1) it is fairly easy. We put the two buckets in an array and are using a boolean expression to find the correct bucket number in the array (assuming that
true==1
and
false==0
):
element = randomNumber();
splitter = randomNumber();
BucketArray[0]=LeftBucket;
BucketArray[1]=RightBucket;
BucketArray[element>splitter].add(element);
any k
If we have any
k (preferably a power of 2 -1) we build a binary search tree out of the k splitters in order to find the correct bucket.[Image] If we decide k to be small enough to fit the
k splitters into the processors registers we do so, but usually we should choose
k = 255 to allow byte sized comparisons. We keep the 255 splitters in a buffered array.
This is the example with
k = 3 splitters, which results in 4 buckets:
element = randomNumber();
s1,s2,s3 = getSplitters();
BucketArray[] = getBuckets();
firstBranch= element > s2;
BucketArray[ firstBranch*2
+ (!fistBranch & element > s1)*1
+ (fistBranch & element > s3)*1
].add(element)
Sanders does not do it exactly like that, he uses special
conditional move instructions that only modern processors support, but the result is the same.
Pre-Allocating Memory
We can get an additional speed-up, by first allocating buckets with the correct bucket sizes in memory. In order to do that we need to know the bucket sizes in advance.
Instead of throwing the elements directly into the buckets, we first run over all elements and tell them in which bucket they will be thrown and at the same time we keep track how big the buckets will get. After we processed all elements we allocate the buckets with the correct size and throw every element into the right bucket (it will remember in which bucket it belongs).
Instead of always allocating each bucket individually we allocate an auxilary array in the beginning with size
n and use it for our buckets. After every iteration step this auxilary array becomes the input for the next iteration.
Oversampling k
To have more equaly sized buckets we should oversample
k by a factor
a (look at
a*k possible splitters) and take the most equally distributed
k ones.
Sorting smaller subproblems
Ofcourse it we would have a massive overhead if the buckets get really small and we still would use SSSS for sorting them.
Sanders suggests the following:
#elements | Algorithm |
1000 - ... | Super Scalar Sample Sort |
100 - 1000 | quicksort |
5 - 100 | insertion sort |
0 - 5 | customized straight-line code |
Things we get for free
No Data Dependencies
What it is
There is instruction
A that comes before an instruction
B. Data Dendency occurs if
B gets stalled because it waits for the result of
A, but
A is has not yet finished:
c = a + b; // Instruction A
d = c + 1; // Instruction B
Why we don't have it
As soon as we have the
k -splitters fixed, all elements can be placed in the correct bucket independently.
The Speed-Up
Since we don't have any data dependency we can interleave the corresponding instructions for several elements. The compiler will automaticly be able to do loop-unrollment and software-pipelining.
Memory Hierarchies / Cache-Oblivious
n = Number of input elements
k = Number of splitters
M = Number of elements that fit into primary memory
B = Number of elements that can be transfered from secondary to primary memory in one step.
Since we can read through all the elements sequentially we have perfect spatial locality. Quicksort needs to write the entire input and output
log(n/m) times before a subproblem fits into
M. But since Super Scalar Sample Sort is like a divide-and-conquer algorithm and splits the problem into subproblems of about size
n/k we only need to read and write the input
log.k.(n/m) times. In the I/O Model with an omniscent cache of size
M and transfers of size
B, we have
k=O(M/B).
By appropratly choosing the size of
k, this algorithm becomes cache oblivious, under the assumption that all k-splitters fit into memory and they are optimally distributed.
Experiments