Hash

 
############################### 
####    hash table(map)    #### 
############################### 
 
a key-value map (aka associative array) data structure that comes with a "hash" function, which enables O(1) search/insert/delete for average case. 
 
for a given input data set (= a key-value pair), a hash function h() will take in the key, computes a value (called hash value) which can be used as index to indicate a position in an array container (called hash table), to store the value. 
each entry in a hash table is called a bucket. e.g. a hash table of M buckets. 
 
e.g. 
---------------------------  // suppose a person's name - his phone number data set 
 
string key = "sugimoto"; 
string value = "090.1234.5678"; 
 
int slot = h(key); 
arr[slot] = value; 
 
-------------------------- 
 
====> a few important items beg more detailed explanation. 
(1) calculation logic of a hash function   // to avoid collision as much as possible 
(2) collision resolution 
# how to resolve when a hash function returns the same value for diff inputs. 
 
as above, hash table is a typical time-space tradeoff example. 
 
(ref) 
http://en.wikipedia.org/wiki/Hash_table 
 
 
 
############################### 
####     hash function     #### 
############################### 
 
if you have an array based hash table of size M, your hash function h() should return an integer [0,M-1] . 
so we will need some form of "key mod M" kind of operation. 
 
i.e.  h() just does arithmetic operations to transform keys into array indices. 
 
since we cannot have an infinite-size array, h() will need to produce the same index for some of the diff keys. 
 
### 
###  core property: 
### 
 
(0) same input should always produce the same hash value // = consistent, deterministic 
(1) efficient computation speed 
(2) hash values should be uniformly-distributed (i.e. every hash value should have equal oppotunity) 
====> but implementing this requires the knowledge on the pattern/nature of input data. 
====> also the knowledge of probability theory for full mathematical analysis. 
====> (2) is sometimes called "simple uniform hashing" assumption. 
 
 
### 
###  core operations: 
### 
 
(1) hash code map (translate input key to a positive int v1) // sometimes split into serialization and diffusion 
(2) compression (computing a bucket index from v1. as in v1 mod M) // where M is the size of the hash table (for modular hashing method) 
 
====> these core operations are merely conceptual steps. so there a few ways to implement them. 
modular(division) hashing method VS multiplication hashing method VS universal hashing method, etc 
 
below, we will assume the division hashing method. 
 

# (1) hashcode map 

 
an input key can be integer, String, float or any other data type. 
but we need to be able to transform it to an array index [0,M-1], a positive integer (including zero). 
 
sometimes this hashCode() operation is split into two: 
- serialization : transform the key into a stream of bytes 
- diffusion     : transform the stream of bytes into an integer 
 
since any data is essentially a sequence of bits, it should be easy. 
 
(a) integer cast 
 
casting everything to an int, using its binary representation. 
but this can be too big of a number, in which case try the below ones. 
 
 
(b) component sum 
 
partition the input data by a fixed length of bits, and then add up each partition. 
instead of addition, taking XOR or multiplication suffice also. 
 
 
(c) polynomial accumulation 
 
lets say the input key is "hello". 
if you employ the simple addition scheme of each byte, h() will produce the same result for "hello" "elloh" or any other combination of those five characters. 
 
polynomial accumulation is a solution to such a collision case. 
 
a0*(x^0) + a1*(x^1) + a2*(x^2) + ... + an*(x^n) 
 
// x is the radix/base 
 
a0 = h 
a1 = e 
a2 = l 
a3 = l 
a4 = o 
 
for x, we take on some prime (e.g. 13) 
 
in ascii 
h = 104 
e = 101 
l = 108 
o = 111 
 
so, the calculation for "hello" will be; 
 
h("hello") = 104*1 + 101*13 + 108*169 + 108*2197 + 111*28561 
 
====> this is known to be effective for string. e.g. English dictionary. 
 
 

# (2) compression (v1 mod M) 

 
M should be a prime number. so the remainder of the division will be more dispersed. 
proving this is not easy. but an intuitive explanation is below. 
 
imagine mod by M gives the same result for inputs i and j.  // both i and j are positive integers 
 
i % M   = j % M 
i       = j + k*M    // where k is some constant positive integer 
(i-j)/k = M          // since k can be anything, M should be difficult a number to get, such as prime 
                     // should avoid any numbers (such as small numbers or a power of 2) that can allow many patterns of the divisor k 
                     // meaning the more possible divisors for k, the more collision. 
 
 
NOTE:  why a power of 2 bad ?   // wikipedia explains well 
 
i % 2^n = i & (2^n - 1)     // i is a positive integer 
 
e.g. 
 
i % 2 = i & 1   // one least significant bit 
i % 4 = i & 3   // two least significant bits 
i % 8 = i & 7   // three least significant bits 
.. 
 
==> since we should be utilizing all the bits of the input key (for the sake of the uniform distribution of the has val), the power of 2 always trancating the upper bits is not good in this case. 
 
 
NOTE: should we avoid a prime that is close to a power of 2 ? 
 
some articles say we should avoid prime numbers close to powers of 2. I have not found the official explanation for it. 
in fact, wikipedia and Sedgewick's book seems to even consiously use mersenne prime, because it's easy to compute. 
 
(ref) 
http://en.wikipedia.org/wiki/Mersenne_prime 
http://algs4.cs.princeton.edu/34hash/ 
 
the only tricky case is when you are using polynomial accumulation with radix r = 2^n. 
if you simply permutate the coefficients, the diff between the two equations will be k(2^n - 1) 
 
X   = a*(r^0) + b*(r^1) + c*(r^2) + d*(r^3) + ... 
Y   = c*(r^0) + a*(r^1) + d*(r^2) + b*(r^3) + ... 
X-Y = (r-1)(chunk)   // proof later 
 
 
### 
###  modular(division) VS multiplication VS universal, CRC, MD5 
### 
 
if keys are uniformly distributed and inserted, lets say [0,t) 
 
h(k) = floor(k*m / t) 
 
 
(1) modular(division) hash function 
 
h(k) = k % M   // M is the size of the bucket array, and should be a prime number 
 
 
(2) multiplication 
 
h(k) = floor( m * frac(k*A))                //  0 < frac(k*A) < 1 
     = floor( m * ((k*A) - floor(k*A)))     // so the output will be  [0,m-1] 
     = floor( m * ((k*A) % 1)) 
 
A is a real number  // Knuth suggests A = (sqrt(5)-1)/2 
0 < A < 1           //                  = approx. 0.6180 
                    // read his "the art of computer programming" book 
 
unlike the division method, M is not that important in the multiplication method. 
suppose the table size is 2^r (so let M = 2^r)  // yes, M being a power of 2 is ok, actually it gives the benefit of speed. 
let w = a word size (= the number of bits processed by CPU at once, thus 32 or 64) 
 
h(k) = ( k * floor(A * 2^w) ) >> (w-r) 
     = ( k * floor(A << w) ) >> (w-r) 
 
===> all it's doing is multiplying k with some val and taking the r bits. floor() is for trancating float points. 
===> taking the r bits will give you 2^r choices, which is what we need out of h(k) 
===> because we only use multiplication and bit shift, this method is faster than division method. 
 
 
(3)  universal hashing, CRC, MD5 
 
===> read wikipedia 
http://en.wikipedia.org/wiki/Universal_hashing 
 
 
(ref) 
http://www.cs.vu.nl/~tcs/ds/lecture6.pdf 
https://courses.csail.mit.edu/6.857/2014/files/lecture5.pdf 
https://courses.csail.mit.edu/6.857/2014/files/lecture6.pdf 
page 459-463 of Algorithms(4th edition) by Sedgewick 
http://en.wikipedia.org/wiki/Modulo_operation 
http://homepages.ius.edu/rwisman/C455/html/notes/Chapter11/MulMethod.htm 
http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html 
http://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html 
https://www.cs.auckland.ac.nz/software/AlgAnim/hash_func.html 
http://www.cs.nthu.edu.tw/~wkhon/ds/ds12/lecture/lecture18.pdf 
 
 
#### 
####  performance analysis 
#### 
 
while we leave the formal mathematical analysis of the above hash function logic, we can always measure performance. 
 
(1) prepare input keys (both uniformly distributed and skewed ones) and count the frequency of each bucket being computed. 
if the frequency is almost the same for every bucket, then it's a good hash function that gives you less collision. 
 
 
 
 
 
#################################### 
####    collision resolution    #### 
#################################### 
 
When a hash function returns the same index for two diff keys, it's called collision. 
 
There are several methods, but two major ones; 
 
(1) separate chaining 
(2) open addressing (aka closed hashing) 
(3) coalesced hashing(chaining)  ## hybrid of (1)&(2) 
 
 
## 
## (1) separate chaining 
## 
 
for each bucket, we maintain a linked list of key-value pairs whose keys h() hashes to the index. 
 
bucket-LL 
--------------------------------- 
   [0]-[key/value]-[key/value]-... 
   [1]-[joe/1986]-[george/1984] 
   [2]-[]-[] 
   [3]-[] 
   [4]-[]-[]-[] 
   [5]-[]-[] 
   [6]-[] 
    . 
    . 
    . 
 
A badly designed hash function h() keeps producing the same hash value for many diff keys (i.e. many collisions) and then all you have is just a linked list. 
in such a case, your computation complexity approaches O(n) for search/delete operation. 
 

# analysis of (discrete) probability distribution of k keys in a list when you have m buckets and n input keys 

 
assuming your hash function h() is built to satisfy uniformity for the expected input keys, the probability of a given bucket list containing k number of keys can be calculated as below. 
 
P(k) = nCk * (1/m)^k * (1 - 1/m)^(n-k) 
 
this is a typical binomial distribution. 
 
k = [0,m] 
1/m : probability of a given key being hashed to a particular bucket. (this is where the above assumption is critical) 
 
when n is big, and (1/m) is small, science says Poisson distribution gives a good approximiation of the above probability distribution calculation. 
here is the classical poisson frequency function, i.e. probability mass function. 
 
p(k) =(approx)=  ( L^k * e^(-L) ) / k! 
 
L = n/m 
 
====> notice n/m is the "average" size of the bucket lists, as we distributed n keys over m lists. 
====> (this means your average search/delete complexity should be O(1 + n/m), and insert is always O(2), which all resolve to O(1) if you choose your m properly ) 
      (assuming hash() is O(1), and we still have overhead of allocating a new chunk of mem at each insert, so this performs worse than open addressing while LF is below 0.8) 
 
 
y (probability) 


|          OO 
|        OOOOOO 
|      OOOOOOOOOO 
|    OOOOOOOOOOOOOO 
|  OOOOOOOOOOOOOOOOOOO 
|oOOOOOOOOOOOOOOOOOOOOOOOOooooo 
----------------------------------------- x (size of a list = k) 
0          n/m        2(n/m) 
 
 
=====> so just integrate the cumulative probability k = [0,2(n/m)] 
                k 
CDF(k) = e^(-L) Σ (L^i)/i!     // see wikipedia 
               i=0 
====> now for any n, you have control over your parameter m (with time-space tradeoff of course). 
====> so L can be part of your hash tuning. to keep L in a certain range, i.e. to keep the average list size in a certain range, you can optionally implement resize(). 
====> resize(2*M) should create a double sized array, and rehash all keys again. 
 
as you can see from the above poisson distrib graph, probability-distribution-wise, for any L, you can assume almost none of the bucket lists will have the size above 2*L. 
in fact, we can calculate the probability of a list having more than 2*L keys as 
 
P(2*L) = ( 1- CDF(2*L) ) 
 
1: 0.0803007785            // if you allocate n buckets for n keys, then you expect almost no separate chaining. (assuming uniformity of input keys and hash()) 
2: 0.0526517429 
3: 0.0335065850 
4: 0.0213608014 
5: 0.0136919514 
6: 0.0088234833 
7: 0.0057125209 
8: 0.0037126601 
9: 0.0024203630 
10: 0.0015815448           // 0.158 % chance of a list having more than 20 keys 
11: 0.0010349571 
12: 0.0006775669 
13: 0.0004431560 
14: 0.0002889555 
15: 0.0001872253 
16: 0.0001199076 
17: 0.0000752103 
18: 0.0000454119 
19: 0.0000254443 
20: 0.0000119736          // again, it depends on time-space tradeoff decision, but expecting too many keys in each bucket can degrade speed while saving space. 
 
 
above data calculated using below code. 
 
 
---------------------------------------------- // poisson.cpp 
#include <iostream>                            // wont work due to buff overflow. lets use java bignum feature. 
#include <cmath> 
using namespace std; 
 
//  http://www.cplusplus.com/reference/cmath/ 
 
long int factorial(int x) 

  if(x < 2) 
    return 1; 
  else 
    return x * factorial(x-1); 

 
int main() 

 
  long double cp; // cumulative probability 
 
  for(int L = 1; L < 51 ; L++ ) 
    { 
      cp = 0; 
      for(int i = 0; i <= 2*L ; i++ ) 
        { 
          cp = cp + ( pow(L,i) / factorial(i) ); 
        } 
 
      cp = cp * exp(L * -1) * 100; 
 
      cout << L << ": " << (100 - cp) << endl; 
    } 
 
  return 0; 

----------------------------------------------- 
 
----------------------------------------------- // Poisson.java 
import java.math.*;                             // using BigNum  http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic 
 
public class Poisson { 
 
  public static BigDecimal factorial(int input) 
  { 
    BigDecimal x = new BigDecimal(Integer.toString(input)); 
  if(x.longValue() < 2) 
      return new BigDecimal("1"); 
    else 
    return x.multiply( factorial(x.intValue() - 1) ); 

 
public static void main(String[] args){ 
 
BigDecimal bigL; 
        BigDecimal one = new BigDecimal("1.0"); 
        BigDecimal cp = new BigDecimal("0"); // cumulative probability 
        BigDecimal e = new BigDecimal("2.71828"); 
        cp = one.setScale(10,0); // setScale() dictates how many decimal places, and the routing method 
                                 // round() specifies how many digits in total 
 
        /*   round const field value 
 
        java.math.BigDecimal 
---------------------------------------------- 
public static final int ROUND_CEILING 2 
public static final int ROUND_DOWN 1 
public static final int ROUND_FLOOR 3 
public static final int ROUND_HALF_DOWN 5 
public static final int ROUND_HALF_EVEN 6 
public static final int ROUND_HALF_UP 4 
public static final int ROUND_UNNECESSARY 7 
public static final int ROUND_UP 0 
--------------------------------------------- 
 
       */ 
 
        for(int L = 1; L < 21 ; L++ ) 
          { 
            cp = cp.multiply(new BigDecimal("0"));  // initialize to zero 
         bigL = new BigDecimal(Integer.toString(L)); 
 
            for(int i = 0; i <= 2*L ; i++ ) 
              { 
                cp = cp.add(  bigL.pow(i).divide( factorial(i), 300, 0)   ); 
              } 
 
            cp = cp.multiply( one.divide( e.pow(L).setScale(10,0), 300, 0) ); 
            cp = one.subtract(cp).setScale(10,0); 
 
            System.out.println(L + ": " + cp); 
          } 
 
 } 

---------------------------------------------------------------- 
 
 
(ref) 
http://en.wikipedia.org/wiki/Binomial_distribution 
http://en.wikipedia.org/wiki/Poisson_distribution 
http://en.wikipedia.org/wiki/Probability_distribution 
 
 
## 
## (2) open addressing (aka closed hashing) 
## 
 
N keys and M buckets, where M > N, so that in case of any collision, we just probe thru the buckets and put that key-value pair into an open un-occupied address. 
 
how do we decide the "probe sequence" ? 
there are a few well established ways. 
 
(a) linear probing        // if M[h] is already taken, keep trying M[h++] 
(b) quadratic probing     // if M[h] is already taken, keep trying M[h + (k++)^2]  where k starts from 1 
(c) double hashing        // use a different/second hash function 
 
 
here we only consider (1) linear probing: 
 
instead of LL, we implement the table with parallel arrays. keys[] and vals[] 
 
when searching for a key, if you reach an empty bucket, then we know it's a miss. 
so there must be at least least one empty slot, i.e. we must keep LF < 1 
 

# load factor (aka LF) = (number of elements in the table / size of the hash table ) 

 
obviously, N may keep increasing, so M needs to be dynamically resized to keep LF < 0.7 
ideally  0.125 < LF < 0.5 
 
pros and cons compared to separate chaining method somewhat overlaps with LL vs array discussion. 
open addressing pros are no new data space allocation at each insert(), no pointer overhead, thus faster, but more space inefficiency. 
(see the graph in wikipedia):  open addressing performs much better than separate chaining as long as LF < 0.7 
                               when LF > 0.8 significantly degrades. 
 
where do we get those numbers? (0.125, 0.5, 0.7, 0.8) 
 
some analysis is in a below section, but ultimately the clustering analysis requires a very rigorous maths which we dont cover here. 
see wikipedia graph on the performance comparison between separate chaining and open addressing for diff values of LF. 
0.125 is just a lower bound to not waste too much space. 
as you can see in the graph, after 0.8 efficiency drops bad for open addressing. 
 

# delete(remove) operation in open addressing (with linear probing) 

 
lets say 
 
hash(foo) = 0 
hash(bar) = 2 
 
now our array would look like this 
 
index     0   1    2   3   4 ... 
      [foo][  ][bar][  ][  ] ... 
 
and, insert more as below 
 
hash(tom) = 0 
hash(ken) = 0 
 
index     0    1    2   3    4 ... 
      [foo][tom][bar][ken][  ] ...   // now we have a cluster foo,tom,bar,ken 
 
now delete(tom), and search(ken) 
 
index     0    1    2   3    4 ... 
      [foo][   ][bar][ken][  ] ... 
 
===> search(ken) would terminate at array[1] being null. 
 
===> thus, when you delete a key from a hash table, we need to re-do hashing all the rest of the keys in the "cluster" which, in this example, includes the key "bar" who the hash function didnt produce any conflicting index, but "bar" was unlucky enough to be part of the cluster. 
 
here is an example java code from Sedgewick book (i modified a bit) 
----------------------------- 
 
public void delete(Key key) 

   int i = hash(key); 
 
   while(!key.equals(keys[i])) 
   { 
      if(keys[i] == null) return; 
      i = (i + 1) % M; 
   } 
   keys[i] = null; 
   vals[i] = null; 
   N--; 
   i = (i + 1) % M; 
 
   while(keys[i] != null) 
   { 
      Key keyToRedo = keys[i]; 
      Value valToRedo = vals[i]; 
      keys[i] = null; 
      vals[i] = null; 
      N--; 
      put(keyToRedo, valToRedo); 
      i + (i + 1) % M; 
   } 
   if(N > 0 && N < M/8) resize(M/2);   // if LF < 0.125, then resize to M/2 

 
----------------------------- 
 
 
(ref) 
p469 of Sedgewick Algo book shows this very well. 
 
 

# performance analysis: average number of probes for a search miss 

 
lets consider LF = 0.5 
we have N keys in M buckets. //  2N = M 
                             // 1/2 = N/M 
 
compexity-wise, search miss is the same as insert (both are about finding an open address). and search hit is less than search miss. so we consider search miss. 
 
average number of probes really depend on how "clustered" the keys are. 
 
- best case: 
imagine all your keys are at odd(or even) numbered index of the array. so, half the time, your first probe is already a search miss. and the other half, you probe only twice, because your first probe is a collision (odd numbered index), but your next probe is a search miss. 
so, in terms of the number of probes for each index, you get below. 
 
index: 0 1 2 3 4 5 6 ... 2N-2 2N-1 
       1 2 1 2 1 2 1 ... 1    2     // adding all these up, you get  2N(1+2)/2 = 3N 
 
so the average is dividing the sum by 2N as below 
 
3N/2N = 1.5  // pretty good. definitely O(1) 
 
 
- worst case: 
lets say you have all your N keys in the first N indices. and the second half of your array is all empty. 
 
in terms of 
 
index: 0   1 2   3   4   ... 2N-2 2N-1 
       N+1 N N-1 N-2 N-3 ... 1    1     // adding all these up, you get  2N + N(N+1)/2 =  N^2 + 3N 
 
again, get the average by dividing the sum by 2N 
 
(N^2 + 3N) / 2N =  (N+5)/4 
 
=====> lets generalize. 
 
if the size=M array has N keys with three clusters whose length are t,k,h, the total number of probes is as below; 
 
  M  plus               // M is because every search miss is one probe 
 (t(t+1)/2)   plus      // t+k+h = N 
 (k(k+1)/2)   plus 
 (h(h+1)/2) 
 
and the average is dividing the sum by M; 
 
M/M  plus 
(t^2 + t)/2M   plus 
(k^2 + k)/2M   plus 
(h^2 + h)/2M 
 
= 1 + (t^2 + k^2 + h^2)/2M + (t+k+h)/2M 
= 1 + the squares of the lengths of the clusters  + N/2M 
 
 
therefore, hash function h() needs to avoid clustering hash values consecutive in probe sequence. 
 
 
(ref) 
http://en.wikipedia.org/wiki/Hash_table#Open_addressing 
http://en.wikipedia.org/wiki/Quadratic_probing 
p469 Sedgewick Algorithms 
 
 
## 
## (3)  coalesced hashing(chaining) 
## 
 
this is a hybrid of separate chaining and open addressing. 
 
you use open addresses in size=M array, but instead of linear probing, you may have parallel arrays of keys[] vals[] ptr[] to navigate you to the next open address. 
 
this is a good approach combining both methods. one can still argue that while benefiting from the pros of both methods, we inherited the cons also. 
but it's really up to the particular situation(where the conditions for h(),N,M are determined) for which we use/test this method. 
 
(ref) 
http://en.wikipedia.org/wiki/Coalesced_hashing 
 
 
 
########################## 
####    complexity    #### 
########################## 
 
as discussed above, the worst case (of the separate chaining method) can make hash just another LL, thus making search/delete O(n) 
also in the open addressing, insert may require the whole bucket array iteration to find that last piece of the open address, thus O(n) 
but both can be ameliorated significantly via time-space tradeoff. 
 
ops/case | worst  | average 
----------------------------- 
space    |  O(n)  |  O(n) 
search   |  O(n)  |  O(1) 
insert   |  O(n)  |  O(1) 
delete   |  O(n)  |  O(1) 
 
 
 
################################################################### 
####   implementation example in C++   (1) separate chaining   #### 
################################################################### 
 
-------------------------------------------------  // hash.h 
 
#include <iostream> 
using namespace std; 
 
template<typename K, typename V> struct node 

node() : key(-99999),value(-99999),next(0){} // lets define this to be an empty node 
node(K k, V v): key(k),value(v),next(0){} 
  K key; 
  V value; 
  node* next; 
}; 
 
template<typename K, typename V> class Hash 

 public: 
 Hash(int n_size):hsize(0),bsize(n_size/10) 
    { 
      bsize = prime[0];  // set it the smallest prime by default, then increment as below 
      for(int i=0; i < psize ; i++){ 
        if(prime[i] >= bsize)   // finding the smallest prime[i] that is equal to or bigger than n_size/10 
          { 
            bsize = prime[i]; 
            break; 
          } 
      } 
      bucket = new node<K,V>[bsize]; 
    } 
  void insert(K,V); 
  V search(K); 
  void remove(K); 
  int size(){return this->hsize;} 
 private: 
  node<K,V>* bucket; 
  int hash(K); 
  int hsize;  // number of elem in the hash table 
  int bsize;  // bucket size 
  static const int prime[]; 
  static const int psize; 
}; 
 
// defining outside class cos it does not make sense to carry the same prime number array for eash instance of Hash class 
template<typename K, typename V> const int Hash<K,V>::prime[] = {13,31,59,79,167,347,751,997}; // prime array 
template<typename K, typename V> const int Hash<K,V>::psize = sizeof(prime)/sizeof(prime[0]);  // prime num array size 
 
template<typename K, typename V> int Hash<K,V>::hash(K k) 

  int num = 0; 
  for(int i=0; i < sizeof(k) ; i++) 
    { 
      num = ( (255 & num) + (255 & k) )*(31*(i+1));  // d255 == b11111111 
      k >> 8; 
    } 
  cout << "hash value for " << k << " is " << num%bsize << endl; 
  return num % bsize; 

 
template<typename K, typename V> void Hash<K,V>::insert(K k, V v) 

  if(search(k) != 0)  // if already the same key node inserted, just skip. 
    return;           // lets just say this is the design 
 
  int i = hash(k); 
 
  node<K,V>* tmp = &bucket[i]; 
 
  if(tmp->key == -99999 && tmp->value == -99999) 
    { 
      bucket[i].key = k; 
      bucket[i].value = v; 
    } 
  else 
    { 
      node<K,V>* new_node = new node<K,V>(k,v); 
      while(true) 
        { 
          if(tmp->next == 0) 
            { 
              tmp->next = new_node; 
              break; 
            } 
          else 
            tmp = tmp->next; 
        } 
    } 
  hsize++; 

 
template<typename K, typename V> V Hash<K,V>::search(K k) 

  int i = hash(k); 
  node<K,V>* tmp = &bucket[i]; 
 
  while(true) 
    if(tmp->key == -99999 && tmp->value == -99999) 
      { 
        if(tmp->next == 0) 
          return 0;   // lets just say this is the design 
        else 
          tmp = tmp->next; 
      } 
    else 
      if(tmp->key == k) // now if K is a custom data type, better overload operator==() 
        return tmp->value; 
      else if(tmp->next == 0) 
        return 0; 
      else 
        tmp = tmp->next; 

 
template<typename K, typename V> void Hash<K,V>::remove(K k) 

  int i = hash(k); 
  node<K,V>* tmp = &bucket[i]; 
  node<K,V>* prev = 0; 
 
  while(true) 
    if(tmp->key == -99999 && tmp->value == -99999) 
      { 
        if(tmp->next == 0) 
          return; 
        else 
          tmp = tmp->next; 
      } 
    else 
      if(tmp->key == k)  // now if K is a custom data type, better overload operator==() 
        { 
          if( prev != 0) 
            { 
              prev->next = tmp->next; 
              delete tmp; 
            } 
          else // prev == 0 
            { 
              bucket[i].key = -99999; 
              bucket[i].value = -99999; 
              bucket[i].next = tmp->next; 
            } 
          hsize--; 
          return; 
        } 
      else if(tmp->next == 0) 
        break; 
      else 
        { 
          prev = tmp; 
          tmp = tmp->next; 
        } 

 
------------------------------------------------ // hash.cpp 
 
#include "hash.h" 
#include <iostream> 
using namespace std; 
 
int main() 

 
  Hash<int,int> h0(15); 
 
  cout << h0.size() << endl;   // 0 
 
  h0.insert(123,456);          // hash(123) = 12 
  h0.insert(789,999);          // hash(789) = 7 
  h0.insert(789,444);          // this gets ignored 
 
  cout << h0.size() << endl;   // 2 
 
  h0.insert(555,500);          // hash(555) = 3 
  h0.insert(444,400);          // hash(444) = 2 
  h0.insert(333,300);          // hash(333) = 1 
  h0.insert(222,200);          // hash(222) = 0 
  h0.insert(111,100);          // hash(111) = 12 
 
  cout << h0.size() << endl;   // 7 
 
  h0.insert(1,600);            // hash(1) = 2 
  h0.insert(29,700);           // hash(29) = 12 
 
  cout << h0.size() << endl;   // 9 
 
  cout <<  h0.search(111)  << endl;  // 100 
  cout <<  h0.search(123)  << endl;  // 456 
  cout <<  h0.search(222)  << endl;  // 200 
  cout <<  h0.search(135)  << endl;  // 0 
  cout <<  h0.search(789)  << endl;  // 999 
 
  h0.remove(111); 
  h0.remove(123); 
 
  cout <<  h0.search(111)  << endl;  // 0 
  cout <<  h0.search(123)  << endl;  // 0 
  cout <<  h0.search(29)   << endl;  // 700 
 
  h0.remove(29); 
 
  cout <<   h0.search(29)  << endl;  // 0 
 
  cout << h0.size() << endl;    // 6 
 
  return 0; 

 
--------------------------------------------------- 
 
 
################################################################## 
####   implementation example in C++   (2) open addressing    #### 
################################################################## 
 
I left some parts deliberately vulnerable to certain edge case for the sake of simplification of the code, e.g. not blocking insert(-99999,-99999) 
the purpose here is to demonstrate open addressing, not perfect most efficient robust code 
 
--------------------------------------------------- // hashoa.h 
 
#include <iostream> 
using namespace std; 
 
template<typename K, typename V> class Hash 

 public: 
 Hash():bsize(10),hsize(0)      // lets just say we start with the default table size 10 
    { 
      key = new K[bsize]; 
      val = new V[bsize]; 
      for(int i = 0; i < bsize ; i++) 
        { 
          key[i] = -99999; // lets just define this to be empty state 
          val[i] = -99999; // or we can set up another array like flag[bsize] 
        } 
    } 
  int hash(K); 
  void insert(K,V); 
  V search(K); 
  void remove(K); 
  void printArr(); 
  int bucket_size(){return bsize;} 
  int elem_size(){return hsize;} 
 private: 
  K* key; 
  V* val; 
  void resize(int); 
  int bsize;  // size of buckets 
  int hsize;  // num of elems currently in the table 
}; 
 
// hash func, we just reuse from separate chaining implementation 
template<typename K, typename V> int Hash<K,V>::hash(K k) 

  int num = 0; 
  for(int i=0; i < sizeof(k) ; i++) 
    { 
      num = ( (255 & num) + (255 & k) )*(31*(i+1));  // d255 == b11111111 
      k >> 8; 
    } 
  cout << "hash value for " << k << " is " << num%bsize << endl;   // for debugging/verification purpose 
  return num % bsize; 

 
template<typename K, typename V> void Hash<K,V>::insert(K k, V v) 

  if(search(k) != 0) 
    return;            // lets just call this the design 
 
  int h = hash(k); 
 
  while(true) 
    { 
      if(key[h] == -99999 && val[h] == -99999) 
        { 
          key[h] = k; 
          val[h] = v; 
          break; 
        } 
      h = (h+1)%bsize;  // linear probing 
    } 
 
  hsize++; 
 
  if( (hsize*10) / bsize > 7)  // ==  if( hsize/bsize > 0.7 )   ==  if LF > 0.7 
    resize(bsize*2); 
 

 
template<typename K, typename V> V Hash<K,V>::search(K k) 

  int h = hash(k); 
 
  while(true) 
    if(key[h] == k) 
      return val[h]; 
    else if(key[h] == -99999 && val[h] == -99999) 
      return 0;   // lets just call this the design 
    else 
      h = (h+1)%bsize; 

 
template<typename K, typename V> void Hash<K,V>::remove(K k) 

  int h = hash(k); 
 
  while(true) 
    if(key[h] == k) 
      { 
        key[h] = -99999; 
        val[h] = -99999; 
        hsize--; 
        while(true) 
          { 
            h = (h+1)%bsize; 
            if(key[h] == -99999 && val[h] == -99999) 
              { 
                if( (hsize*10) / bsize < 2)   // ==  if LF < 0.2 
                  resize(bsize/2); 
                return; 
              } 
            else 
              { 
                K keyToRedo = key[h]; 
                V valToRedo = val[h]; 
                key[h] = -99999; 
                val[h] = -99999; 
                hsize--; 
                insert(keyToRedo,valToRedo); 
              } 
          } 
 
      } 
    else if(key[h] == -99999 && val[h] == -99999) 
      return; 
    else 
      h = (h+1)%bsize; 

 
// i wonder there is a more efficient way to implement this 
template<typename K, typename V> void Hash<K,V>::resize(int newSize) 

  K* newKey = new K[newSize]; 
  V* newVal = new V[newSize]; 
  K* tmpKey = new K[bsize]; 
  V* tmpVal = new V[bsize]; 
  int old_bsize = bsize; 
  hsize = 0; 
  bsize = newSize; 
 
  for(int i = 0; i < newSize ; i++) 
    { 
      newKey[i] = -99999; 
      newVal[i] = -99999; 
    } 
 
  for(int i = 0; i < old_bsize ; i++) 
    { 
      tmpKey[i] = key[i]; 
      tmpVal[i] = val[i]; 
    } 
 
  delete [] key; 
  delete [] val; 
  key = newKey; 
  val = newVal; 
 
  for(int i = 0; i < old_bsize ;i++) 
    { 
      if( !(tmpKey[i] == -99999 && tmpVal[i] == -99999) ) 
        insert(tmpKey[i],tmpVal[i]); 
    } 
 
  delete [] tmpKey; 
  delete [] tmpVal; 
 

 
template<typename K, typename V> void Hash<K,V>::printArr() 

  for(int i = 0; i < bsize ;i++) 
    cout << i << ": "<< key[i] << " " << val[i] << endl; 

 
---------------------------------------------------------------- // hashoa.cpp 
 
#include <iostream> 
#include "hashoa.h" 
using namespace std; 
 
int main() 

  Hash<int,int> h0; 
 
  cout << h0.bucket_size() << endl;   // 10 
  cout << h0.elem_size() << endl;     // 0 
 
  h0.insert(111,100);     // hash(111) = 2 
  h0.insert(222,200);     // hash(222) = 0 
  h0.insert(333,300);     // hash(333) = 8 
  h0.insert(444,400);     // hash(444) = 6 
  h0.insert(555,500);     // hash(555) = 4 
  h0.insert(555,123);     // ignored by design 
 
  cout << h0.bucket_size() << endl;   // 10 
  cout << h0.elem_size() << endl;     // 5    // so LF = 0.5 
  h0.printArr(); 
 
  /*    no collision yet 
 
0: 222 200 
1: -99999 -99999 
2: 111 100 
3: -99999 -99999 
4: 555 500 
5: -99999 -99999 
6: 444 400 
7: -99999 -99999 
8: 333 300 
9: -99999 -99999 
 
   */ 
 
 
  h0.insert(666,600);     // hash(666) = 2 
 
  cout << h0.bucket_size() << endl;   // 10 
  cout << h0.elem_size() << endl;     // 6    // LF = 0.6 
  h0.printArr(); 
 
  /* 
 
0: 222 200 
1: -99999 -99999 
2: 111 100 
3: 666 600               // yes successfully linear probed 
4: 555 500 
5: -99999 -99999 
6: 444 400 
7: -99999 -99999 
8: 333 300 
9: -99999 -99999 
 
   */ 
 
  // lets test remove() 
 
  h0.remove(111); 
  h0.remove(123);  // ignored by design 
 
  cout << h0.bucket_size() << endl;   // 10 
  cout << h0.elem_size() << endl;     // 5 
  h0.printArr(); 
 
  /* 
 
0: 222 200 
1: -99999 -99999 
2: 666 600         // successfully moved to its rightful position 
3: -99999 -99999 
4: 555 500         // this was rehashed also 
5: -99999 -99999 
6: 444 400 
7: -99999 -99999 
8: 333 300 
9: -99999 -99999 
 
   */ 
 
  h0.insert(111,100);     // hash(111) = 2     // LF = 0.6 
  h0.insert(777,700);     // hash(777) = 6     // LF = 0.7 
  h0.insert(888,800);     // hash(888) = 8     // LF = 0.8  resize!! 
 
  cout << h0.bucket_size() << endl;   // 20 
  cout << h0.elem_size() << endl;     // 8    so LF = 0.4 
  h0.printArr(); 
 
  /* 
 
0: 222 200 
1: -99999 -99999 
2: -99999 -99999 
3: -99999 -99999 
4: 555 500 
5: -99999 -99999 
6: -99999 -99999 
7: -99999 -99999 
8: 333 300 
9: 888 800 
10: -99999 -99999 
11: -99999 -99999 
12: 666 600 
13: 111 100 
14: -99999 -99999 
15: -99999 -99999 
16: 444 400 
17: 777 700 
18: -99999 -99999 
19: -99999 -99999 
 
   */ 
 
  // now lets test resize(bsize/2) 
 
  h0.remove(111);  // LF = 7/20 
  h0.remove(222);  // LF = 6/20 
  h0.remove(333);  // LF = 5/20 
  h0.remove(444);  // LF = 4/20 
  h0.remove(555);  // LF = 3/20  resize!! 
 
  cout << h0.bucket_size() << endl;  // 10 
  cout << h0.elem_size() << endl;    // 3 
  h0.printArr(); 
 
  /* 
 
0: -99999 -99999 
1: -99999 -99999 
2: 666 600 
3: -99999 -99999 
4: -99999 -99999 
5: -99999 -99999 
6: 777 700 
7: -99999 -99999 
8: 888 800 
9: -99999 -99999 
 
   */ 
 
  return 0; 

 
---------------------------------------------------

  1. 2014-06-26 15:19:52 |
  2. Category : data_structure
  3. Page View:

Google Ads