### Hash

###############################
####    hash table(map)    ####
###############################

a key-value map (aka associative array) data structure that comes with a "hash" function, which enables O(1) search/insert/delete for average case.

for a given input data set (= a key-value pair), a hash function h() will take in the key, computes a value (called hash value) which can be used as index to indicate a position in an array container (called hash table), to store the value.
each entry in a hash table is called a bucket. e.g. a hash table of M buckets.

e.g.
---------------------------  // suppose a person's name - his phone number data set

string key = "sugimoto";
string value = "090.1234.5678";

int slot = h(key);
arr[slot] = value;

--------------------------

====> a few important items beg more detailed explanation.
(1) calculation logic of a hash function   // to avoid collision as much as possible
(2) collision resolution
# how to resolve when a hash function returns the same value for diff inputs.

as above, hash table is a typical time-space tradeoff example.

(ref)
http://en.wikipedia.org/wiki/Hash_table

###############################
####     hash function     ####
###############################

if you have an array based hash table of size M, your hash function h() should return an integer [0,M-1] .
so we will need some form of "key mod M" kind of operation.

i.e.  h() just does arithmetic operations to transform keys into array indices.

since we cannot have an infinite-size array, h() will need to produce the same index for some of the diff keys.

###
###  core property:
###

(0) same input should always produce the same hash value // = consistent, deterministic
(1) efficient computation speed
(2) hash values should be uniformly-distributed (i.e. every hash value should have equal oppotunity)
====> but implementing this requires the knowledge on the pattern/nature of input data.
====> also the knowledge of probability theory for full mathematical analysis.
====> (2) is sometimes called "simple uniform hashing" assumption.

###
###  core operations:
###

(1) hash code map (translate input key to a positive int v1) // sometimes split into serialization and diffusion
(2) compression (computing a bucket index from v1. as in v1 mod M) // where M is the size of the hash table (for modular hashing method)

====> these core operations are merely conceptual steps. so there a few ways to implement them.
modular(division) hashing method VS multiplication hashing method VS universal hashing method, etc

below, we will assume the division hashing method.

# (1) hashcode map

an input key can be integer, String, float or any other data type.
but we need to be able to transform it to an array index [0,M-1], a positive integer (including zero).

sometimes this hashCode() operation is split into two:
- serialization : transform the key into a stream of bytes
- diffusion     : transform the stream of bytes into an integer

since any data is essentially a sequence of bits, it should be easy.

(a) integer cast

casting everything to an int, using its binary representation.
but this can be too big of a number, in which case try the below ones.

(b) component sum

partition the input data by a fixed length of bits, and then add up each partition.

(c) polynomial accumulation

lets say the input key is "hello".
if you employ the simple addition scheme of each byte, h() will produce the same result for "hello" "elloh" or any other combination of those five characters.

polynomial accumulation is a solution to such a collision case.

a0*(x^0) + a1*(x^1) + a2*(x^2) + ... + an*(x^n)

a0 = h
a1 = e
a2 = l
a3 = l
a4 = o

for x, we take on some prime (e.g. 13)

in ascii
h = 104
e = 101
l = 108
o = 111

so, the calculation for "hello" will be;

h("hello") = 104*1 + 101*13 + 108*169 + 108*2197 + 111*28561

====> this is known to be effective for string. e.g. English dictionary.

# (2) compression (v1 mod M)

M should be a prime number. so the remainder of the division will be more dispersed.
proving this is not easy. but an intuitive explanation is below.

imagine mod by M gives the same result for inputs i and j.  // both i and j are positive integers

i % M   = j % M
i       = j + k*M    // where k is some constant positive integer
(i-j)/k = M          // since k can be anything, M should be difficult a number to get, such as prime
// should avoid any numbers (such as small numbers or a power of 2) that can allow many patterns of the divisor k
// meaning the more possible divisors for k, the more collision.

NOTE:  why a power of 2 bad ?   // wikipedia explains well

i % 2^n = i & (2^n - 1)     // i is a positive integer

e.g.

i % 2 = i & 1   // one least significant bit
i % 4 = i & 3   // two least significant bits
i % 8 = i & 7   // three least significant bits
..

==> since we should be utilizing all the bits of the input key (for the sake of the uniform distribution of the has val), the power of 2 always trancating the upper bits is not good in this case.

NOTE: should we avoid a prime that is close to a power of 2 ?

some articles say we should avoid prime numbers close to powers of 2. I have not found the official explanation for it.
in fact, wikipedia and Sedgewick's book seems to even consiously use mersenne prime, because it's easy to compute.

(ref)
http://en.wikipedia.org/wiki/Mersenne_prime
http://algs4.cs.princeton.edu/34hash/

the only tricky case is when you are using polynomial accumulation with radix r = 2^n.
if you simply permutate the coefficients, the diff between the two equations will be k(2^n - 1)

X   = a*(r^0) + b*(r^1) + c*(r^2) + d*(r^3) + ...
Y   = c*(r^0) + a*(r^1) + d*(r^2) + b*(r^3) + ...
X-Y = (r-1)(chunk)   // proof later

###
###  modular(division) VS multiplication VS universal, CRC, MD5
###

if keys are uniformly distributed and inserted, lets say [0,t)

h(k) = floor(k*m / t)

(1) modular(division) hash function

h(k) = k % M   // M is the size of the bucket array, and should be a prime number

(2) multiplication

h(k) = floor( m * frac(k*A))                //  0 < frac(k*A) < 1
= floor( m * ((k*A) - floor(k*A)))     // so the output will be  [0,m-1]
= floor( m * ((k*A) % 1))

A is a real number  // Knuth suggests A = (sqrt(5)-1)/2
0 < A < 1           //                  = approx. 0.6180
// read his "the art of computer programming" book

unlike the division method, M is not that important in the multiplication method.
suppose the table size is 2^r (so let M = 2^r)  // yes, M being a power of 2 is ok, actually it gives the benefit of speed.
let w = a word size (= the number of bits processed by CPU at once, thus 32 or 64)

h(k) = ( k * floor(A * 2^w) ) >> (w-r)
= ( k * floor(A << w) ) >> (w-r)

===> all it's doing is multiplying k with some val and taking the r bits. floor() is for trancating float points.
===> taking the r bits will give you 2^r choices, which is what we need out of h(k)
===> because we only use multiplication and bit shift, this method is faster than division method.

(3)  universal hashing, CRC, MD5

http://en.wikipedia.org/wiki/Universal_hashing

(ref)
http://www.cs.vu.nl/~tcs/ds/lecture6.pdf
https://courses.csail.mit.edu/6.857/2014/files/lecture5.pdf
https://courses.csail.mit.edu/6.857/2014/files/lecture6.pdf
page 459-463 of Algorithms(4th edition) by Sedgewick
http://en.wikipedia.org/wiki/Modulo_operation
http://homepages.ius.edu/rwisman/C455/html/notes/Chapter11/MulMethod.htm
http://www.cs.hmc.edu/~geoff/classes/hmc.cs070.200101/homework10/hashfuncs.html
http://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html
https://www.cs.auckland.ac.nz/software/AlgAnim/hash_func.html
http://www.cs.nthu.edu.tw/~wkhon/ds/ds12/lecture/lecture18.pdf

####
####  performance analysis
####

while we leave the formal mathematical analysis of the above hash function logic, we can always measure performance.

(1) prepare input keys (both uniformly distributed and skewed ones) and count the frequency of each bucket being computed.
if the frequency is almost the same for every bucket, then it's a good hash function that gives you less collision.

####################################
####    collision resolution    ####
####################################

When a hash function returns the same index for two diff keys, it's called collision.

There are several methods, but two major ones;

(1) separate chaining
(2) open addressing (aka closed hashing)
(3) coalesced hashing(chaining)  ## hybrid of (1)&(2)

##
## (1) separate chaining
##

for each bucket, we maintain a linked list of key-value pairs whose keys h() hashes to the index.

bucket-LL
---------------------------------
[0]-[key/value]-[key/value]-...
[1]-[joe/1986]-[george/1984]
[2]-[]-[]
[3]-[]
[4]-[]-[]-[]
[5]-[]-[]
[6]-[]
.
.
.

A badly designed hash function h() keeps producing the same hash value for many diff keys (i.e. many collisions) and then all you have is just a linked list.
in such a case, your computation complexity approaches O(n) for search/delete operation.

##
#  analysis of (discrete) probability distribution of k keys in a list when you have m buckets and n input keys
##

assuming your hash function h() is built to satisfy uniformity for the expected input keys, the probability of a given bucket list containing k number of keys can be calculated as below.

P(k) = nCk * (1/m)^k * (1 - 1/m)^(n-k)

this is a typical binomial distribution.

k = [0,m]
1/m : probability of a given key being hashed to a particular bucket. (this is where the above assumption is critical)

when n is big, and (1/m) is small, science says Poisson distribution gives a good approximiation of the above probability distribution calculation.
here is the classical poisson frequency function, i.e. probability mass function.

p(k) =(approx)=  ( L^k * e^(-L) ) / k!

L = n/m

====> notice n/m is the "average" size of the bucket lists, as we distributed n keys over m lists.
====> (this means your average search/delete complexity should be O(1 + n/m), and insert is always O(2), which all resolve to O(1) if you choose your m properly )
(assuming hash() is O(1), and we still have overhead of allocating a new chunk of mem at each insert, so this performs worse than open addressing while LF is below 0.8)

y (probability)

|          OO
|        OOOOOO
|      OOOOOOOOOO
|    OOOOOOOOOOOOOO
|  OOOOOOOOOOOOOOOOOOO
|oOOOOOOOOOOOOOOOOOOOOOOOOooooo
----------------------------------------- x (size of a list = k)
0          n/m        2(n/m)

=====> so just integrate the cumulative probability k = [0,2(n/m)]
k
CDF(k) = e^(-L) Σ (L^i)/i!     // see wikipedia
i=0
====> now for any n, you have control over your parameter m (with time-space tradeoff of course).
====> so L can be part of your hash tuning. to keep L in a certain range, i.e. to keep the average list size in a certain range, you can optionally implement resize().
====> resize(2*M) should create a double sized array, and rehash all keys again.

as you can see from the above poisson distrib graph, probability-distribution-wise, for any L, you can assume almost none of the bucket lists will have the size above 2*L.
in fact, we can calculate the probability of a list having more than 2*L keys as

P(2*L) = ( 1- CDF(2*L) )

1: 0.0803007785            // if you allocate n buckets for n keys, then you expect almost no separate chaining. (assuming uniformity of input keys and hash())
2: 0.0526517429
3: 0.0335065850
4: 0.0213608014
5: 0.0136919514
6: 0.0088234833
7: 0.0057125209
8: 0.0037126601
9: 0.0024203630
10: 0.0015815448           // 0.158 % chance of a list having more than 20 keys
11: 0.0010349571
12: 0.0006775669
13: 0.0004431560
14: 0.0002889555
15: 0.0001872253
16: 0.0001199076
17: 0.0000752103
18: 0.0000454119
19: 0.0000254443
20: 0.0000119736          // again, it depends on time-space tradeoff decision, but expecting too many keys in each bucket can degrade speed while saving space.

above data calculated using below code.

---------------------------------------------- // poisson.cpp
#include <iostream>                            // wont work due to buff overflow. lets use java bignum feature.
#include <cmath>
using namespace std;

//  http://www.cplusplus.com/reference/cmath/

long int factorial(int x)

if(x < 2)
return 1;
else
return x * factorial(x-1);

int main()

long double cp; // cumulative probability

for(int L = 1; L < 51 ; L++ )
{
cp = 0;
for(int i = 0; i <= 2*L ; i++ )
{
cp = cp + ( pow(L,i) / factorial(i) );
}

cp = cp * exp(L * -1) * 100;

cout << L << ": " << (100 - cp) << endl;
}

return 0;

-----------------------------------------------

----------------------------------------------- // Poisson.java
import java.math.*;                             // using BigNum  http://en.wikipedia.org/wiki/Arbitrary-precision_arithmetic

public class Poisson {

public static BigDecimal factorial(int input)
{
BigDecimal x = new BigDecimal(Integer.toString(input));
if(x.longValue() < 2)
return new BigDecimal("1");
else
return x.multiply( factorial(x.intValue() - 1) );

public static void main(String[] args){

BigDecimal bigL;
BigDecimal one = new BigDecimal("1.0");
BigDecimal cp = new BigDecimal("0"); // cumulative probability
BigDecimal e = new BigDecimal("2.71828");
cp = one.setScale(10,0); // setScale() dictates how many decimal places, and the routing method
// round() specifies how many digits in total

/*   round const field value

java.math.BigDecimal
----------------------------------------------
public static final int ROUND_CEILING 2
public static final int ROUND_DOWN 1
public static final int ROUND_FLOOR 3
public static final int ROUND_HALF_DOWN 5
public static final int ROUND_HALF_EVEN 6
public static final int ROUND_HALF_UP 4
public static final int ROUND_UNNECESSARY 7
public static final int ROUND_UP 0
---------------------------------------------

*/

for(int L = 1; L < 21 ; L++ )
{
cp = cp.multiply(new BigDecimal("0"));  // initialize to zero
bigL = new BigDecimal(Integer.toString(L));

for(int i = 0; i <= 2*L ; i++ )
{
cp = cp.add(  bigL.pow(i).divide( factorial(i), 300, 0)   );
}

cp = cp.multiply( one.divide( e.pow(L).setScale(10,0), 300, 0) );
cp = one.subtract(cp).setScale(10,0);

System.out.println(L + ": " + cp);
}

}

----------------------------------------------------------------

(ref)
http://en.wikipedia.org/wiki/Binomial_distribution
http://en.wikipedia.org/wiki/Poisson_distribution
http://en.wikipedia.org/wiki/Probability_distribution

##
## (2) open addressing (aka closed hashing)
##

N keys and M buckets, where M > N, so that in case of any collision, we just probe thru the buckets and put that key-value pair into an open un-occupied address.

how do we decide the "probe sequence" ?
there are a few well established ways.

(a) linear probing        // if M[h] is already taken, keep trying M[h++]
(b) quadratic probing     // if M[h] is already taken, keep trying M[h + (k++)^2]  where k starts from 1
(c) double hashing        // use a different/second hash function

here we only consider (1) linear probing:

instead of LL, we implement the table with parallel arrays. keys[] and vals[]

when searching for a key, if you reach an empty bucket, then we know it's a miss.
so there must be at least least one empty slot, i.e. we must keep LF < 1

# load factor (aka LF) = (number of elements in the table / size of the hash table )

obviously, N may keep increasing, so M needs to be dynamically resized to keep LF < 0.7
ideally  0.125 < LF < 0.5

pros and cons compared to separate chaining method somewhat overlaps with LL vs array discussion.
open addressing pros are no new data space allocation at each insert(), no pointer overhead, thus faster, but more space inefficiency.
(see the graph in wikipedia):  open addressing performs much better than separate chaining as long as LF < 0.7
when LF > 0.8 significantly degrades.

where do we get those numbers? (0.125, 0.5, 0.7, 0.8)

some analysis is in a below section, but ultimately the clustering analysis requires a very rigorous maths which we dont cover here.
see wikipedia graph on the performance comparison between separate chaining and open addressing for diff values of LF.
0.125 is just a lower bound to not waste too much space.
as you can see in the graph, after 0.8 efficiency drops bad for open addressing.

# delete(remove) operation in open addressing (with linear probing)

lets say

hash(foo) = 0
hash(bar) = 2

now our array would look like this

index     0   1    2   3   4 ...
[foo][  ][bar][  ][  ] ...

and, insert more as below

hash(tom) = 0
hash(ken) = 0

index     0    1    2   3    4 ...
[foo][tom][bar][ken][  ] ...   // now we have a cluster foo,tom,bar,ken

now delete(tom), and search(ken)

index     0    1    2   3    4 ...
[foo][   ][bar][ken][  ] ...

===> search(ken) would terminate at array[1] being null.

===> thus, when you delete a key from a hash table, we need to re-do hashing all the rest of the keys in the "cluster" which, in this example, includes the key "bar" who the hash function didnt produce any conflicting index, but "bar" was unlucky enough to be part of the cluster.

here is an example java code from Sedgewick book (i modified a bit)
-----------------------------

public void delete(Key key)

int i = hash(key);

while(!key.equals(keys[i]))
{
if(keys[i] == null) return;
i = (i + 1) % M;
}
keys[i] = null;
vals[i] = null;
N--;
i = (i + 1) % M;

while(keys[i] != null)
{
Key keyToRedo = keys[i];
Value valToRedo = vals[i];
keys[i] = null;
vals[i] = null;
N--;
put(keyToRedo, valToRedo);
i + (i + 1) % M;
}
if(N > 0 && N < M/8) resize(M/2);   // if LF < 0.125, then resize to M/2

-----------------------------

(ref)
p469 of Sedgewick Algo book shows this very well.

# performance analysis: average number of probes for a search miss

lets consider LF = 0.5
we have N keys in M buckets. //  2N = M
// 1/2 = N/M

compexity-wise, search miss is the same as insert (both are about finding an open address). and search hit is less than search miss. so we consider search miss.

average number of probes really depend on how "clustered" the keys are.

- best case:
imagine all your keys are at odd(or even) numbered index of the array. so, half the time, your first probe is already a search miss. and the other half, you probe only twice, because your first probe is a collision (odd numbered index), but your next probe is a search miss.
so, in terms of the number of probes for each index, you get below.

index: 0 1 2 3 4 5 6 ... 2N-2 2N-1
1 2 1 2 1 2 1 ... 1    2     // adding all these up, you get  2N(1+2)/2 = 3N

so the average is dividing the sum by 2N as below

3N/2N = 1.5  // pretty good. definitely O(1)

- worst case:
lets say you have all your N keys in the first N indices. and the second half of your array is all empty.

in terms of

index: 0   1 2   3   4   ... 2N-2 2N-1
N+1 N N-1 N-2 N-3 ... 1    1     // adding all these up, you get  2N + N(N+1)/2 =  N^2 + 3N

again, get the average by dividing the sum by 2N

(N^2 + 3N) / 2N =  (N+5)/4

=====> lets generalize.

if the size=M array has N keys with three clusters whose length are t,k,h, the total number of probes is as below;

M  plus               // M is because every search miss is one probe
(t(t+1)/2)   plus      // t+k+h = N
(k(k+1)/2)   plus
(h(h+1)/2)

and the average is dividing the sum by M;

M/M  plus
(t^2 + t)/2M   plus
(k^2 + k)/2M   plus
(h^2 + h)/2M

= 1 + (t^2 + k^2 + h^2)/2M + (t+k+h)/2M
= 1 + the squares of the lengths of the clusters  + N/2M

therefore, hash function h() needs to avoid clustering hash values consecutive in probe sequence.

(ref)
p469 Sedgewick Algorithms

# (c) double hashing

prepare two hash functions. suppose the h1() returns the same index for a certain group of input keys.
then what you need is another hash function that returns distinct indices for those keys.

##
## (3)  coalesced hashing(chaining)
##

this is a hybrid of separate chaining and open addressing.

you use open addresses in size=M array, but instead of linear probing, you may have parallel arrays of keys[] vals[] ptr[] to navigate you to the next open address.

this is a good approach combining both methods. one can still argue that while benefiting from the pros of both methods, we inherited the cons also.
but it's really up to the particular situation(where the conditions for h(),N,M are determined) for which we use/test this method.

(ref)
http://en.wikipedia.org/wiki/Coalesced_hashing

##########################
####    complexity    ####
##########################

as discussed above, the worst case (of the separate chaining method) can make hash just another LL, thus making search/delete O(n)
also in the open addressing, insert may require the whole bucket array iteration to find that last piece of the open address, thus O(n)
but both can be ameliorated significantly via time-space tradeoff.

ops/case | worst  | average
-----------------------------
space    |  O(n)  |  O(n)
search   |  O(n)  |  O(1)
insert   |  O(n)  |  O(1)
delete   |  O(n)  |  O(1)

###################################################################
####   implementation example in C++   (1) separate chaining   ####
###################################################################

-------------------------------------------------  // hash.h

#include <iostream>
using namespace std;

template<typename K, typename V> struct node

node() : key(-99999),value(-99999),next(0){} // lets define this to be an empty node
node(K k, V v): key(k),value(v),next(0){}
K key;
V value;
node* next;
};

template<typename K, typename V> class Hash

public:
Hash(int n_size):hsize(0),bsize(n_size/10)
{
bsize = prime[0];  // set it the smallest prime by default, then increment as below
for(int i=0; i < psize ; i++){
if(prime[i] >= bsize)   // finding the smallest prime[i] that is equal to or bigger than n_size/10
{
bsize = prime[i];
break;
}
}
bucket = new node<K,V>[bsize];
}
void insert(K,V);
V search(K);
void remove(K);
int size(){return this->hsize;}
private:
node<K,V>* bucket;
int hash(K);
int hsize;  // number of elem in the hash table
int bsize;  // bucket size
static const int prime[];
static const int psize;
};

// defining outside class cos it does not make sense to carry the same prime number array for eash instance of Hash class
template<typename K, typename V> const int Hash<K,V>::prime[] = {13,31,59,79,167,347,751,997}; // prime array
template<typename K, typename V> const int Hash<K,V>::psize = sizeof(prime)/sizeof(prime[0]);  // prime num array size

template<typename K, typename V> int Hash<K,V>::hash(K k)

int num = 0;
for(int i=0; i < sizeof(k) ; i++)
{
num = ( (255 & num) + (255 & k) )*(31*(i+1));  // d255 == b11111111
k >> 8;
}
cout << "hash value for " << k << " is " << num%bsize << endl;
return num % bsize;

template<typename K, typename V> void Hash<K,V>::insert(K k, V v)

if(search(k) != 0)  // if already the same key node inserted, just skip.
return;           // lets just say this is the design

int i = hash(k);

node<K,V>* tmp = &bucket[i];

if(tmp->key == -99999 && tmp->value == -99999)
{
bucket[i].key = k;
bucket[i].value = v;
}
else
{
node<K,V>* new_node = new node<K,V>(k,v);
while(true)
{
if(tmp->next == 0)
{
tmp->next = new_node;
break;
}
else
tmp = tmp->next;
}
}
hsize++;

template<typename K, typename V> V Hash<K,V>::search(K k)

int i = hash(k);
node<K,V>* tmp = &bucket[i];

while(true)
if(tmp->key == -99999 && tmp->value == -99999)
{
if(tmp->next == 0)
return 0;   // lets just say this is the design
else
tmp = tmp->next;
}
else
if(tmp->key == k) // now if K is a custom data type, better overload operator==()
return tmp->value;
else if(tmp->next == 0)
return 0;
else
tmp = tmp->next;

template<typename K, typename V> void Hash<K,V>::remove(K k)

int i = hash(k);
node<K,V>* tmp = &bucket[i];
node<K,V>* prev = 0;

while(true)
if(tmp->key == -99999 && tmp->value == -99999)
{
if(tmp->next == 0)
return;
else
tmp = tmp->next;
}
else
if(tmp->key == k)  // now if K is a custom data type, better overload operator==()
{
if( prev != 0)
{
prev->next = tmp->next;
delete tmp;
}
else // prev == 0
{
bucket[i].key = -99999;
bucket[i].value = -99999;
bucket[i].next = tmp->next;
}
hsize--;
return;
}
else if(tmp->next == 0)
break;
else
{
prev = tmp;
tmp = tmp->next;
}

------------------------------------------------ // hash.cpp

#include "hash.h"
#include <iostream>
using namespace std;

int main()

Hash<int,int> h0(15);

cout << h0.size() << endl;   // 0

h0.insert(123,456);          // hash(123) = 12
h0.insert(789,999);          // hash(789) = 7
h0.insert(789,444);          // this gets ignored

cout << h0.size() << endl;   // 2

h0.insert(555,500);          // hash(555) = 3
h0.insert(444,400);          // hash(444) = 2
h0.insert(333,300);          // hash(333) = 1
h0.insert(222,200);          // hash(222) = 0
h0.insert(111,100);          // hash(111) = 12

cout << h0.size() << endl;   // 7

h0.insert(1,600);            // hash(1) = 2
h0.insert(29,700);           // hash(29) = 12

cout << h0.size() << endl;   // 9

cout <<  h0.search(111)  << endl;  // 100
cout <<  h0.search(123)  << endl;  // 456
cout <<  h0.search(222)  << endl;  // 200
cout <<  h0.search(135)  << endl;  // 0
cout <<  h0.search(789)  << endl;  // 999

h0.remove(111);
h0.remove(123);

cout <<  h0.search(111)  << endl;  // 0
cout <<  h0.search(123)  << endl;  // 0
cout <<  h0.search(29)   << endl;  // 700

h0.remove(29);

cout <<   h0.search(29)  << endl;  // 0

cout << h0.size() << endl;    // 6

return 0;

---------------------------------------------------

##################################################################
####   implementation example in C++   (2) open addressing    ####
##################################################################

I left some parts deliberately vulnerable to certain edge case for the sake of simplification of the code, e.g. not blocking insert(-99999,-99999)
the purpose here is to demonstrate open addressing, not perfect most efficient robust code

--------------------------------------------------- // hashoa.h

#include <iostream>
using namespace std;

template<typename K, typename V> class Hash

public:
Hash():bsize(10),hsize(0)      // lets just say we start with the default table size 10
{
key = new K[bsize];
val = new V[bsize];
for(int i = 0; i < bsize ; i++)
{
key[i] = -99999; // lets just define this to be empty state
val[i] = -99999; // or we can set up another array like flag[bsize]
}
}
int hash(K);
void insert(K,V);
V search(K);
void remove(K);
void printArr();
int bucket_size(){return bsize;}
int elem_size(){return hsize;}
private:
K* key;
V* val;
void resize(int);
int bsize;  // size of buckets
int hsize;  // num of elems currently in the table
};

// hash func, we just reuse from separate chaining implementation
template<typename K, typename V> int Hash<K,V>::hash(K k)

int num = 0;
for(int i=0; i < sizeof(k) ; i++)
{
num = ( (255 & num) + (255 & k) )*(31*(i+1));  // d255 == b11111111
k >> 8;
}
cout << "hash value for " << k << " is " << num%bsize << endl;   // for debugging/verification purpose
return num % bsize;

template<typename K, typename V> void Hash<K,V>::insert(K k, V v)

if(search(k) != 0)
return;            // lets just call this the design

int h = hash(k);

while(true)
{
if(key[h] == -99999 && val[h] == -99999)
{
key[h] = k;
val[h] = v;
break;
}
h = (h+1)%bsize;  // linear probing
}

hsize++;

if( (hsize*10) / bsize > 7)  // ==  if( hsize/bsize > 0.7 )   ==  if LF > 0.7
resize(bsize*2);

template<typename K, typename V> V Hash<K,V>::search(K k)

int h = hash(k);

while(true)
if(key[h] == k)
return val[h];
else if(key[h] == -99999 && val[h] == -99999)
return 0;   // lets just call this the design
else
h = (h+1)%bsize;

template<typename K, typename V> void Hash<K,V>::remove(K k)

int h = hash(k);

while(true)
if(key[h] == k)
{
key[h] = -99999;
val[h] = -99999;
hsize--;
while(true)
{
h = (h+1)%bsize;
if(key[h] == -99999 && val[h] == -99999)
{
if( (hsize*10) / bsize < 2)   // ==  if LF < 0.2
resize(bsize/2);
return;
}
else
{
K keyToRedo = key[h];
V valToRedo = val[h];
key[h] = -99999;
val[h] = -99999;
hsize--;
insert(keyToRedo,valToRedo);
}
}

}
else if(key[h] == -99999 && val[h] == -99999)
return;
else
h = (h+1)%bsize;

// i wonder there is a more efficient way to implement this
template<typename K, typename V> void Hash<K,V>::resize(int newSize)

K* newKey = new K[newSize];
V* newVal = new V[newSize];
K* tmpKey = new K[bsize];
V* tmpVal = new V[bsize];
int old_bsize = bsize;
hsize = 0;
bsize = newSize;

for(int i = 0; i < newSize ; i++)
{
newKey[i] = -99999;
newVal[i] = -99999;
}

for(int i = 0; i < old_bsize ; i++)
{
tmpKey[i] = key[i];
tmpVal[i] = val[i];
}

delete [] key;
delete [] val;
key = newKey;
val = newVal;

for(int i = 0; i < old_bsize ;i++)
{
if( !(tmpKey[i] == -99999 && tmpVal[i] == -99999) )
insert(tmpKey[i],tmpVal[i]);
}

delete [] tmpKey;
delete [] tmpVal;

template<typename K, typename V> void Hash<K,V>::printArr()

for(int i = 0; i < bsize ;i++)
cout << i << ": "<< key[i] << " " << val[i] << endl;

---------------------------------------------------------------- // hashoa.cpp

#include <iostream>
#include "hashoa.h"
using namespace std;

int main()

Hash<int,int> h0;

cout << h0.bucket_size() << endl;   // 10
cout << h0.elem_size() << endl;     // 0

h0.insert(111,100);     // hash(111) = 2
h0.insert(222,200);     // hash(222) = 0
h0.insert(333,300);     // hash(333) = 8
h0.insert(444,400);     // hash(444) = 6
h0.insert(555,500);     // hash(555) = 4
h0.insert(555,123);     // ignored by design

cout << h0.bucket_size() << endl;   // 10
cout << h0.elem_size() << endl;     // 5    // so LF = 0.5
h0.printArr();

/*    no collision yet

0: 222 200
1: -99999 -99999
2: 111 100
3: -99999 -99999
4: 555 500
5: -99999 -99999
6: 444 400
7: -99999 -99999
8: 333 300
9: -99999 -99999

*/

h0.insert(666,600);     // hash(666) = 2

cout << h0.bucket_size() << endl;   // 10
cout << h0.elem_size() << endl;     // 6    // LF = 0.6
h0.printArr();

/*

0: 222 200
1: -99999 -99999
2: 111 100
3: 666 600               // yes successfully linear probed
4: 555 500
5: -99999 -99999
6: 444 400
7: -99999 -99999
8: 333 300
9: -99999 -99999

*/

// lets test remove()

h0.remove(111);
h0.remove(123);  // ignored by design

cout << h0.bucket_size() << endl;   // 10
cout << h0.elem_size() << endl;     // 5
h0.printArr();

/*

0: 222 200
1: -99999 -99999
2: 666 600         // successfully moved to its rightful position
3: -99999 -99999
4: 555 500         // this was rehashed also
5: -99999 -99999
6: 444 400
7: -99999 -99999
8: 333 300
9: -99999 -99999

*/

h0.insert(111,100);     // hash(111) = 2     // LF = 0.6
h0.insert(777,700);     // hash(777) = 6     // LF = 0.7
h0.insert(888,800);     // hash(888) = 8     // LF = 0.8  resize!!

cout << h0.bucket_size() << endl;   // 20
cout << h0.elem_size() << endl;     // 8    so LF = 0.4
h0.printArr();

/*

0: 222 200
1: -99999 -99999
2: -99999 -99999
3: -99999 -99999
4: 555 500
5: -99999 -99999
6: -99999 -99999
7: -99999 -99999
8: 333 300
9: 888 800
10: -99999 -99999
11: -99999 -99999
12: 666 600
13: 111 100
14: -99999 -99999
15: -99999 -99999
16: 444 400
17: 777 700
18: -99999 -99999
19: -99999 -99999

*/

// now lets test resize(bsize/2)

h0.remove(111);  // LF = 7/20
h0.remove(222);  // LF = 6/20
h0.remove(333);  // LF = 5/20
h0.remove(444);  // LF = 4/20
h0.remove(555);  // LF = 3/20  resize!!

cout << h0.bucket_size() << endl;  // 10
cout << h0.elem_size() << endl;    // 3
h0.printArr();

/*

0: -99999 -99999
1: -99999 -99999
2: 666 600
3: -99999 -99999
4: -99999 -99999
5: -99999 -99999
6: 777 700
7: -99999 -99999
8: 888 800
9: -99999 -99999

*/

return 0;

---------------------------------------------------

########################
###    clustering    ###
########################

(1) primary clustering
- this is when a hash function for certain keys return indices that are close to each other (hence "clustering"), so when a collision occurs, and if you are using linear probing, then all your neighbor indicies are taken, thus making your probe inefficient.
- thus quadratic clustering is effective in comating primary clustering

(2) secondary clustering
- i.e. collision. when a hash function returns the same index for multiple diff keys.
- when this occurs a lot (secondary clustering), one way to combat it is double hashing.

1. 2014-06-26 15:19:52 |
2. Category : data_structure
3. Page View: