# Adventures in Data Structures Vol 13: Hash Tables (more like Ca$h Tables)

# Introduction to Hashes

In today’s post, I wanted to go over the basics of hash tables and how we implement them in C. They’re a super useful data structure that appear a lot, so this is really one the building blocks for advanced applications.

So far, the most well rounded data structure we’ve explored for adding, removing, and checking data is the AVL tree which has $O(log n)$ performance for all those operations. However, is there a data structure that can perform even better? If we think back to our first lecture, arrays have some really nice properties, one of them being random access. What if instead of the positions of values inside the array being based on consecutive integers, it was instead based on the value itself?

That is the basic idea of hash tables.
It is essentially based on an array except the position where a value is stored is dictated on a **hash function**.
The **hash function** takes input (like a key or value) and returns an integer index which we can then use as the location to store the value.

**Hash functions** should satisfy the following properties:

- Minimal collisions (not a lot of inputs should map to the same index)
- Uniformly distributed hash output
- Repeatable (the same input again will produce the same hash output)
- Fast! (constant time to compute)

Creating a hash function is really an art, but common hash functions can be based around ideas like mapping chars to numbers, using aggregate functions, shifting numbers by certain values, casting types to other types, etc.

What are the benefits of using hash functions? Assuming that computing a hash function is $O(1)$, then all map/bag operations theoretically have constant execution time!

Think about a common example, such as the `contains`

function.
Normally in an unordered dynamic array or linked list bag, this would be an $O(n)$ operation.
However, for a hash table, it is $O(1)$ since all we have to do is hash the value we are searching for and then check that position to see if it’s equal!

There are two different methods for implementing hash tables that I wanted to go over: open address hashing and chaining. Each represent a different way to dealing with inevitable collisions.

# Open Address Hashing

Open address hashing handles collisions by *probing* forward in the until we find an open spot and placing the new value there.
The initial hash output is just the initial index to try.
If we reach the end, we loop back to the beginning of the array and continue searching.

The `contains`

operation similarly starts looking at the initial hashing index and proves forward until either the value is found **or** an empty location is found.

The `removal`

operation is a little tricky, because removing a value can mess up the contains operation if we’re not careful.
We either disallow removals or use a *tombstone special value* which means that the contains operation can skip over it and the add operation can paste over it.

For open address hashing, the execution time is:

- Worst case: $O(n)$
- Best case: $O(1)$
- Average case: $O\left(\frac{1}{1-\lambda}\right)$ where $\lambda$ is the load factor = (# of elements/size of table)

If the load factor gets too big, this means we need to make a new array with double capacity.

The implementation of open address hashing in C is provided below:

The main operations we have to implement are `add`

and `contains`

since we’re avoiding the `remove`

operation.

The `idx = -1;`

line breaks the loop.
We could have also just called `return`

.

The `contains`

function is provided below:

# Chaining Hashing

Also called buckets, this strategy basically means maintaining a linked list at each table entry. The load factor $\lambda = \text{average # of elements in each bucket}$

For chaining, the execution time is:

- Worst case: $O(n)$
- Best case: $O(1)$
- Average case: $O(\lambda)$ where $\lambda$ is the load factor

Each operation on the hash table is divided into two steps. First, the element is hashed and the remainder taken after dividing by the table size which returns the table index. Second, the corresponding linked list is examined.

Let’s implement the `add`

function first!

Notice that the `add`

operation switches the head of the linked list and the new link so that the new link is at the head of the linked list!

Let’s do the `contains`

function next.

Finally, let’s do the `remove`

function!

The `remove`

function has a special case when the `prev`

node is `NULL`

.
In that case, the value we want to remove is at the first node in the list and the linked list head is the set to `current->next`

.
If that’s not the case, then `prev->next`

is set to `current->next`

so that it skips `current`

.

# Sorting with Hash Functions

Hash-like sorting can be used in certain contexts!

One useful example of this is the *counting sort* in which we have a bunch of integers from a limited range (say, 0 through 20).
The general idea is that:

- Create buckets for each unique possible value
- Iterate through all the elements we want to sort, incrementing the appropritae buckets as we go
- Our output is a printed enumerated list of all the buckets in order, which will automatically be sorted

The C code for something like this would look like:

In the first for loop, we iterate through all of the buckets in the hash table. In the second loop, we iterate through all the values in the bucket. Therefore, the complexity depends both on the number of buckets and the number of elements to be sorted. The complexity of the sort is $O(n + \text{max})$, which is equivalent to $O(max(n, max))$.