kenics.net

Technical notes on perl, python, php, sql, cgi, c/c++, q/kdb+, unix/shell, revision control tools, data structures & algorithms, and their applications into web services and other various forms of software engineering.

bucket sort

########################### 
####    bucket sort    #### 
########################### 
 
quite an elaborate algo that can beat quicksort, but only effective when the input is truly uniformly distributed and auxiliary O(n+k) mem space is affordable. otherwise it becomes O(n^2). 
 
similar to counting sort, bucket sort puts each of N input elements into k buckets. but this time, k is just any number [0,N], determined as per space affordability. bucket sort thus overcome the restriction of the counting sort requiring the knowledge of the input value range, as in [0,k). 
 
each of N input elements is hashed to decide which bucket to be stored. just like hash table, each bucket maintains a linked list (implementation details can be modified if needed.) 
 
and each bucket is insertion-sorted back into the original array. 
 
in the most basic form of bucket sort, we define k=N. 
 
lets go thru some concrete examples. 
 
 
 
imagine hash(x) = x % N 
 
for(int i=0; i<N ;i++) 
  b[i] = hash(n[i]);    // this will yield below 
 
input data set (N=9)   |      buckets k=N 
-------------------------------------------- 
   3                   |       b[0] = 0 
   0                   |       b[1] = 1 
   5                   |       b[2] = 2 
   2                   |       b[3] = 3 
   6                   |       b[4] = 4 
   8                   |       b[5] = 5 
   7                   |       b[6] = 6 
   1                   |       b[7] = 7 
   4                   |       b[8] = 8 
 
====> perfect. no need to insertionSort(b[i]), we can just copy b[i] back to n[i] 
====> lets see a less ideal example 
 
input data set (N=9)   |      buckets k=N 
-------------------------------------------- 
   3                   |       b[0] = null   // now we begin to see that to determine the empty 
   3                   |       b[1] = 1      // each bucket better maintain a var for size, (or we check head ptr != null) 
   5                   |       b[2] = 2 
   2                   |       b[3] = 3-3-3 
   3                   |       b[4] = 4-4 
   8                   |       b[5] = 5 
   4                   |       b[6] = null 
   1                   |       b[7] = null 
   4                   |       b[8] = 8 
 
 
=====> actually, this is the same as counting sort. no need for insertion sort for each b[i] 
=====> not a good example to appreciate the wonder of bucket sort. 
=====> lets try a diff one. 
 
 
hash(x) =  floor(x / 10) 
 
input data set (N=9)   |      buckets k=N 
-------------------------------------------- 
   35                  |       b[0] = null 
   31                  |       b[1] = 13 
   57                  |       b[2] = 29 
   29                  |       b[3] = 35-31-38 
   38                  |       b[4] = 42-46 
   80                  |       b[5] = 57 
   42                  |       b[6] = null 
   13                  |       b[7] = null 
   46                  |       b[8] = 80 
 
 
===> now iterate each b[i] and insertion-sort(b[i]) back to the original array n[] 
===> as you can see, if input is not uniformly distributed, it can be very inefficient, because you will have one bucket having all N elements, at which point you just have insertionSort(n)=O(n^2) with added cost of O(n) time and space. O(n) time is for hashing each n[i] into b[i], and O(n) space is for buckets b[] 
 
===> now for the generalization's sake, lets try k = N/3 
 
hash(x) = floor(x / 30) 
// hash(x) = floor(x/k) is ok too cos we assume uniformity of input data set. 
 
input data set (N=9)   |      buckets k=N/3 
-------------------------------------------- 
   35                  |       b[0] = 13-29 
   31                  |       b[1] = 35-31-38-42-46-57 
   57                  |       b[2] = 80 
   29                  | 
   38                  | 
   80                  | 
   42                  | 
   13                  | 
   46                  | 
 
 
 
===> hence, some regard bucket sort as the generalized counting sort. 
 
===> also instead of insertion sort for each b[i], if you take k=2 and recursively apply bucketSort(b[i],k=2), then it essentially becomes a version of quicksort where your pivot selection scheme depends on your hash function implementation. 
 
similarly, bucket sort can be seen as k-way merge sort. 
 
 
(ref) 
http://java.dzone.com/articles/algorithm-week-bucket-sort 
http://en.wikipedia.org/wiki/Bucket_sort 
 
 
 
 
################################### 
####    complexity analysis    #### 
################################### 
 
lets assume k = N 
 
our usual best/average/worst case analysis here is very subject to the conditions/assumptions we make for the input and also the hash function performance. 
 

# time 

 
best    : O(n+k) 
average : O(n+k)  // see statistical theory below 
worst   : O(n^2)  // you can argue it is O(n) if the unoformity and O(n) auxiliary space are given, but that is what we consider for the best/average cases. some textbook says the uniformity and the ordered hash function are given when considering the bucket sort, in such a case, then the worst case time complexity is O(n). it's just a matter of definition. 
 
hashing each of N elements takes O(n), and then, 
when we have a total uniformity and a good hash function that can hash input_array[i] into bucket[i], then bucket array is the final sorted array. 
 
 
for the average case analysis, lets have the input data set of size=n, and define x[i] to be the number of elements partitioned in bucket b[i]. and p[i] to be the probability distribution of each element, p = 1/n always for all p[i] in this case. 
 
the expected value of x[i] is; 
 
E(x[i]) = n * p 
        = 1 
 
we can confirm this from multi-nomial distribution of probability. // E(x[i]) = np[i],  V(x[i]) = np[i](1-p[i]) 
 
E(x) = E(x[0]) + E(x[1]) + ... + E(x[n]) 
     = sum[i=0,n-1] x[i]*p[i]             // x[0] + x[1] + x[2] ... + x[n] = N 
     = n * 1/n 
     = 1 
 
so in each bucket, we should expect the average of 1 elem. 
 
 
lets compute the variance also. again, p = 1/n for all p[i] 
 
an intuitive way of computing it is; 
 
V(x[i]) = n*p*(1-p)    // E(x)'s fluctuation probability is (1-p) 
 
also, we can deduce the following formula from the generalized theory 
 
V(x) = sum[i=0,n-1] (x[i] - E(x)) ^2 * p[i]                    // just expand (x-y)^2  = x^2 - 2xy + y^2 
     = sum[i=0,n-1] (x[i]^2 - 2*x[i]*E(x) + E^2(x)) * p[i]     // now we will manipulate this 
     = sum[i=0,n-1] (x[i]^2 * p[i])    // x^2 
     - 2*E(x)*sum[i=0,n-1] x[i]*p[i]   // -2xy 
     + E^2(x)*sum[i=0,n-1] p[i]        // y^2 
     = E(x^2) - 2*E(x)*E(x) + E^2(x)*1 
     = E(x^2) - 2*E^2(x) + E^2(x) 
     = E(x^2) - E^2(x) 
 
 
now we can deduce E(x^2) from the above 
 
E(x[i]^2) = n(n-1)p^2 + np        // proof on the ref link 
          = 2 - 1/n               // 1/n = (0,1] 
          < 2 
          = O(1) 
 
===>   now what does x[i]^2 mean in terms of our time complexity analysis? when x[i] = 1, then insertionSort cost is O(n), but as the number grows, it ultimately becomes O(n^2), so in this case x[i]^2 which in itself means a square of x[i] but E(x[i]^2) gives us the assessment of the worst case in the average case. since it's proven O(1), the total time cost is only the initial hashing part = O(n). 
 
 
 

# space 

 
total     : O(2n+k) 
auxiliary : O(n+k) 
 
original input data set takes O(n) space. and k is the size of buckets. and we put N elements into the buckets. so auxiliary space complexity will be O(n+k). 
 
 
 
 
(ref) 
http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/Sorting/bucketSort.htm 
http://en.wikipedia.org/wiki/Multinomial_distribution 
http://kakuritsu.com/nikou2.html 
http://aoki2.si.gunma-u.ac.jp/lecture/Bunpu/takou.html 
 
 
 
############################################# 
####    implementation example in C++    #### 
############################################# 
 
gonna implement a "n buckets for n input elems" example. 
 
 
--------------------------------------------- // bucket.cpp 
 
#include <iostream> 
using namespace std; 
 
typedef struct node{ 
  node(int input):data(input),next(0){} 
  node* next; 
  int data; 
} node; 
 
int hash(int x, int M) 

  return x % M; 

 
void bucketSort(int* inputArr,int size) 

  node* bucket[size];  // creating buckets of size N 
  for(int i=0; i < size ; i++) // init bucket 
    bucket[i] = 0; 
 
  // putting each of N elem into buckets 
  for(int i=0; i < size ;i++) 
    { 
      node* new_node = new node(inputArr[i]); 
      int hash_val = hash(inputArr[i], size); 
      node* tmp = bucket[hash_val]; 
      if(tmp == 0){ 
        bucket[hash_val] = new_node; 
      } 
      else 
        while( true )  // insertion loop 
          { 
            if(tmp->next == 0) 
              { 
                tmp->next = new_node; 
                break; 
              } else{ 
              tmp = tmp->next; 
            } 
          } 
    } 
 
  int idx = 0; // index counter for inputArr 
  for(int i=0; i < size ;i++) // i for index for bucket 
    { 
 
      if(bucket[i] == 0) 
        continue; 
 
      node* tmp = bucket[i]; 
 
      while(true) // insertionSort insert each element from LL of bucket[i] into inputArr[] 
        { 
          inputArr[idx++] = tmp->data; 
 
          node* ptr_del = tmp; 
          tmp = tmp->next; 
          delete ptr_del; 
 
          // insertion sort 
          for(int k=idx-1; k > 0 ;k--) 
            { 
              if(inputArr[k] < inputArr[k-1]) 
                { 
                  inputArr[k] = inputArr[k] + inputArr[k-1]; 
                  inputArr[k-1] = inputArr[k] - inputArr[k-1]; 
                  inputArr[k] = inputArr[k] - inputArr[k-1]; 
                } 
              else 
                break; 
            } 
 
          if(tmp == 0) 
            break;     // emptied this bucket. goto next bucket. 
 
        } 
    } 

 
 
void printArr(int* arr, int size) 

  for(int i=0; i < size ;i++) 
    cout << arr[i] << " "; 
  cout << endl; 

 
int main() 

  int arr[] = {3,5,1,8,5,9,7,2,0,8,5}; 
  int size = sizeof(arr)/sizeof(arr[0]); 
 
  printArr(arr,size);     // 3 5 1 8 5 9 7 2 0 8 5 
  bucketSort(arr,size); 
  printArr(arr,size);     // 0 1 2 3 5 5 5 7 8 8 9 
 
  return 0; 

 
--------------------------------------------------

  1. 2014-07-06 22:59:40 |
  2. Category : algo
  3. Page View:

Google Ads