Zum Hauptinhalt springen

Hash Table - Hash Map

  • aka HashMaps, Maps, Dictionaries etc. Key-Value pairs.
  • use key (often a string) to lookup the corresponding Value.

Strategies to handle collisions.

  • Collisions are unavoidable and get more likely the higher the load-factor (aka the fuller the list / it's capacity)
  • Different Strategies deal with how to handle those collisions:

Separate Chaining

Combines a linked list with the HashTable.

  • Can be extended to use Dynamic-Sized-Arrays or Self-Balancing-Trees instead of simple linked-list.

All elements that hash into the same slot index are inserted into a lined list. (ex. a .Next Attribuzte for each Entry)

  • then we can just traverse that linked list to find our Key. If we reach the end before we find it, we know its not jet in the table.

Advantages

  • Simple to implement
  • Less sensitive to hash function or load factors. (since linked list does a bunch)
  • hash table doesnt need to be re-allocated since we can always just add more elements to the chain.

Disadvantages

  • bad cache performance. Chaining is bad since not consecutive in memory. (ex. open adressing provides better chache performance)
  • long chains near O(n) lookup in the worst case
  • Wastage of Memory. Extra space for references. Some Parts of the table never fill up.

Performance

Implementation in C - Open Adressing

For the full implementation (with context( check out https://github.com/vincepr/c_compiler/blob/main/src/table.h

  • that version also uses special lookups with string interning to get even more lookup speed.

table.h

#ifndef clox_table_h
#define clox_table_h

/*
Hash Map implementation
- Thought: We allow variables up to eight characters long (or just hash on those first 8chars)
- We get 26 unique chars ('a'-'Z' + '0'-'9' +'_'). We can squash all that info in a 64bit int
- (if we map that to a continous block we would need 295148PetaByte) so we take the value modulo of the size of the array
this folds the larger numeric range onto itself untill it fits in a smaller range of array elements
- because of the above we have to handle hash-collisions though.
- to mimiize the chance of collisions, we can dynamically calculate the load-factor (entry_number/bucket_number).
if the load factor gets to big we resize and grow the array bigger
- to avoid the 8char limit we use a deterministic/uniform/fast hash function.
- clox implements the FNV-1a Hashing-Function http://www.isthe.com/chongo/tech/comp/fnv/
*/

/*
Techniques for resolving collisions: 2 basic Techniques:

SEPARATE CHAINING (not used for this):
- each bucket containts more than one entries. For example a linked list of entries.
To lookup an entry you just walk the linked list till you found the fitting entry
- worst case: would be all data collides on the same bucket -> is basically one linked list
- conceptually simple with operation implementations simple. BUT not a great fit for modern CPUs.
Because of a lot of overhead from pointers and linked lists are scattered arround memory.

OPEN ADRESSING (aka CLOSED HASHING)
- all entries live directly in the bucket array.
- if two entries collide in the same bucket, we find a different empty bucket to use instead.
- the process of finding a free/available bucket is called PROBING.
- The Order buckets get examined is a PROBE SEQUENCE. We use a simple variant:
- LINEAR PROBING: IF the bucket is full we just go to the next entry and try that, then the next ...
- Good: since it tends to cluster data its cache friendly.
*/

/*
For notes on Tombstone strategy used, check comments for tableDelete()
*/

// The Objects our Maps Holds (ex: key:varname1, value:10.5 )
typedef struct {
ObjString* key;
Value value;
} Entry;

// The struct of our HashMap
typedef struct {
int count; // currently stores key-value pairs
int capacity; // capacity (so can easly get the load-factor: count/capacity)
Entry* entries; // array of the entries we hash
} Table;

void initTable(Table* table);
void freeTable(Table* table);
bool tableGet(Table* table, ObjString* key, Value* value);
bool tableSet(Table* table, ObjString* key, Value value);
bool tableDelete(Table* table, ObjString* key);
void tableAddAll(Table* from, Table* to);
ObjString* tableFindString(Table* table, const char* chars, int length, uint32_t hash);
void tableRemoveWhite(Table* table);
void markTable(Table* table);

/*CUSTOM:*/
bool tableFindValue(Table* table, const char* chars, int length, uint32_t hash, Value* value);


#endif

table.c

// if our load-factor (=entry_number/bucket_number) reaches this treshold we grow the HashMap size (reallocate it to be 2 times the size)
#define TABLE_MAX_LOAD 0.75

// constructor for the HashMap
void initTable(Table* table) {
table->count = 0;
table->capacity = 0;
table->entries = NULL;
}

// basically behaves like a dynamic array (with some extra rules for inserting, delting, searching a value)
void freeTable(Table* table) {
FREE_ARRAY(Entry, table->entries, table->capacity);
initTable(table);
}

// HashMap-Functionality - lookup the entry in the Map
// - for-loop keeps going bucket by bucket till it finds the key (aka probing)
// - we exit the loop if we find the key(success) or hit an Empty NULL bucket(notFound)
// - with a full HashMap the loop WOULD be infinite. BUT since we always grow it at 75% loadfactor this cant happen
// Tombstone-strategy:
// - while probing we keep going on hitting tombstones {key:NULL, value:TRUE}
static Entry* findEntry(Entry* entries, int capacity, ObjString* key) {
uint32_t index = key->hash & (capacity - 1);// since we dont have enough memory to map each value directly we fold our scope like this
Entry* tombstone = NULL; // we store the Tombstones we hit while probing

for (;;) {
Entry* entry = &entries[index];
if (entry->key == NULL) {
if (IS_NIL(entry->value)) { //<- empty entry
return tombstone != NULL ? tombstone : entry;
} else { //<- we found a tombstone
if (tombstone == NULL) tombstone = entry;
}
} else if (entry->key == key) {
return entry; //<- we found the key
}

index = (index + 1) & (capacity -1); //<- modulo wraps arround if we reach the end of our capacity
}
}

// HashMap-Functionality - If finds entry it returns true, otherwise false.
// - value-output will point to resulting value if true
bool tableGet(Table* table, ObjString* key, Value* value) {
if (table->count == 0) return false;
Entry* entry = findEntry(table->entries, table->capacity, key);
if(entry->key == NULL) return false;
*value = entry->value; // set the value-parameter found pointer to value
return true;
}

// grows our HashTable size:
// we can just write over the memory (because of collisions might become less on bigger space)
// so we just make a empty new one. Then fill the table entry by entry.
// - we dont copy Tombstones -> we need to recalculate the entries-count
static void adjustCapacity(Table* table, int capacity){
// we allocate an array of (empty) buckets
Entry* entries = ALLOCATE(Entry, capacity);
for (int i = 0; i<capacity; i++) {
entries[i].key = NULL;
entries[i].value = NIL_VAL;
}
table->count = 0;

// we walk trough the old array front to back.
for (int i=0; i<table->capacity; i++) {
Entry* entry = &table->entries[i];
if (entry->key == NULL) continue;
// We insert entries we find to the new array:
Entry* dest = findEntry(entries, capacity, entry->key);
dest->key = entry->key;
dest->value = entry->value;
table->count++;
}
FREE_ARRAY(Entry, table->entries, table->capacity); // the old table can be free'd
// we update the number of entries/capacity for the new array:
table->entries = entries;
table->capacity = capacity;
}

// HashMap-Functionality - Add the given key-value-pair to our table:
bool tableSet(Table* table, ObjString* key, Value value) {
// if our load-factor gets to big (to many entries in map) we make the map bigger:
if (table->count + 1 > table->capacity * TABLE_MAX_LOAD) {
int capacity = GROW_CAPACITY(table->capacity);
adjustCapacity(table, capacity);
}
// figure out what bucket we write to
Entry* entry = findEntry(table->entries, table->capacity, key);
// then write to that bucket:
bool isNewKey = entry->key == NULL;
// if key is already present we just overwrote to the same key (updated) -> we dont increment size:
// we also check if we write to a NOT Tombstone (the nil-check) only then do we increment
if (isNewKey && IS_NIL(entry->value)) table->count++;
entry->key = key;
entry->value = value;
return isNewKey;
}

// HashMap-Functionality - Delete a key-value pair from the Map
// - the problem is we can just delete the entry directly. Because another entry might have dependet on it's collision when beeing entered
// - the solution is Tombstones. Instead of clearing the entry on deletion, we replace it with a special entry called tombstone
// during probing we dont treat tombstones like empty but keep going (we treat them like full)
bool tableDelete(Table* table, ObjString* key) {
if (table->count == 0) return false;
Entry* entry = findEntry(table->entries, table->capacity, key);
if (entry->key == NULL) return false;

// place the tombstone. We use a {key: NULL, Value: true} to represent this.
// - this is arbitrarily chosen. Any unique combination (not in use) would work.
entry->key = NULL;
entry->value = BOOL_VAL(true);
return true;
}

// HashMap-Functionality - Copies all Data from one HashTable to another - ex. used for inheritance (of class-methods)
void tableAddAll(Table* from, Table* to) {
for (int i=0; i<from->capacity; i++) {
Entry* entry = &from->entries[i];
if (entry->key != NULL) {
tableSet(to, entry->key, entry->value);
}
}
}

Separate Chaining Hash Table in Csharp


public static class Example
{
public static void Run()
{
Console.WriteLine("--- (Simple) HashTable Example: ---");
var table = new ChainingHashTable<int, int>(10) // setting starting capactiy =10
{
{5, 2},
{23, 42},
};

for(int i = 10; i < 20; i++)
table[i] = i*i*i;

foreach (var pair in table)
Console.WriteLine($"{pair}");


Console.WriteLine("\n--- --- ---");
// everytime the growth factor gets to high it reallocs with double size.
Console.WriteLine($"Count: {table.Count} Capacity: {table.Capacity}");
Console.WriteLine($"{table.Contains(10)} {table.GetValue(10)}");
table.Remove(10);
Console.WriteLine($"{table.Contains(10)} {table.GetValue(10)}");


Console.WriteLine("\n--- --- ---");
var table2 = new ChainingHashTable<String, int>(("Bob",24),("James",34),("Gandalf", 5000));
table2["Gandalf"] = 55555;
table2.Remove("Bob");
foreach(var (key, value) in table2)
Console.WriteLine($"<{key} {value}>");
if (table2.Contains("James"))
Console.WriteLine($"James is {table2["James"]}");
}
}

/// <summary>
/// Implemention of a HashMap using SEPARATE CHAINING with Generic Key-Value pairs.
/// </summary>
/// <typeparam name="K">key</typeparam>
/// <typeparam name="V">value</typeparam>
public sealed class ChainingHashTable<K, V> : IEnumerable<KeyValuePair<K, V>> where K : notnull
{
/// <summary>
/// Represents a single key-value pair in our list
/// </summary>
sealed class Entry
{
public K Key { get; init; }
public V Value { get; set; }
public int Hashcode { get; set; }
public Entry? Next { get; set; }

public Entry(K key, V value, int hashcode, Entry? next = null)
{
this.Key = key;
this.Value = value;
this.Hashcode = hashcode;
this.Next = next;
}
}

/* Attributes */
private const int START_CAPACITY = 8;

private Entry?[] entries;

public int Count { get; private set; }

/// <summary>
/// version, so we can throw if Enumerator changed mid consuming.
/// </summary>
private int version = 0;

// Number of buckets
public int Capacity => entries.Length;

public ChainingHashTable(int CAPACITY = START_CAPACITY)
{
CAPACITY = (CAPACITY > START_CAPACITY) ? CAPACITY : START_CAPACITY;
entries = new Entry[CAPACITY];
}

public ChainingHashTable(params (K, V)[] pairs)
{
int capacity = (pairs.Length*2 > START_CAPACITY) ? pairs.Length*2 : START_CAPACITY;
entries = new Entry[capacity];
foreach (var pair in pairs)
Add(pair.Item1, pair.Item2);
}

/* functionality METHODS */
public void Add(K key, V value)
{
version++;
int hashcode = key.GetHashCode();
int targetBucket = (hashcode & int.MaxValue) % entries.Length;
Entry? targetEntry = entries[targetBucket];

// add directly to bucket (first element)
if (targetEntry is null)
{
entries[targetBucket] = new Entry(key, value, hashcode, null);
Count++;
ReallocIfNeed();
return;
}
// traverse the linked list:
while (targetEntry is not null)
{
// overwrite existing value
if(key.Equals(targetEntry.Key))
{
targetEntry.Value = value;
return;
}

// add at end of linked list
if (targetEntry.Next is null)
{
targetEntry.Next = new Entry(key, value, hashcode, null);
Count++;
ReallocIfNeed();
return;
}
targetEntry = targetEntry.Next;
}
}

public bool Contains(K key)
{
var entry = FindEntry(key);
if (entry is null) return false;
return true;

}

public void Remove(K key)
{
version++;
int hashcode = key.GetHashCode();
int targetBucket = (hashcode & int.MaxValue) % entries.Length;
Entry? previous = entries[targetBucket];
if (previous is null) return;
Entry? current = previous.Next;
if (current is null) entries[targetBucket] = null;

while (current is not null)
{
if (current.Hashcode == hashcode && key.Equals(current.Key))
{
// found Entry - so we remove it from linked list
previous.Next = current.Next;
return;
}
previous = current;
current = current.Next;
}
return;
}

/// <summary>
/// returns Value stored at key or nullvalue if not found.
/// </summary>
public V? GetValue(K key)
{
var entry = FindEntry(key);
if (entry is null) return default(V);
return entry.Value;
}

/// <summary>
/// if size reaches a certain loadfactor we reallocate with 2 times the capacity (this is slow/expensive)
/// </summary>
private void ReallocIfNeed()
{
const double LOADFACTOR = 0.75;
const int GROWFACTOR = 2;
if (Count < Capacity * LOADFACTOR) return;

var newCapacity = Capacity * GROWFACTOR;
var oldEntries = this.entries;
this.entries = new Entry[newCapacity];
foreach (var e in oldEntries)
{
if (e is not null)
{
// copy roots
this.Add(e.Key, e.Value);
var next = e.Next;
// copy linked elements
while (next is not null)
{
this.Add(next.Key, next.Value);
next = next.Next;
}
}
}
}

private Entry? FindEntry(K key)
{
int hashcode = key.GetHashCode();
int targetBucket = (hashcode & int.MaxValue) % entries.Length;
Entry? target = entries[targetBucket];
if (target is null) return null;
while (target is not null)
{
if (target.Hashcode == hashcode && key.Equals(target.Key)) return target;
target = target.Next;
}
return null;
}

// Inexer for our HashMap. Ex: 'var xyu = ourMap["someKey"]'
public V? this[K key]
{
get => GetValue(key);
set{ Add(key, value!); }
}

IEnumerator IEnumerable.GetEnumerator() => this.GetEnumerator();

public IEnumerator<KeyValuePair<K, V>> GetEnumerator()
{
var version = this.version;
foreach(var entry in entries)
{
var current = entry;
while (current is not null)
{
if (this.version != version) throw new InvalidOperationException("Collection modified.");
yield return new KeyValuePair<K, V>(current.Key, current.Value);
current = current.Next;
}
}
}
}