Introduction to Hash Tables

A very simple introduction to hash tables and hash functions.

Jul 09, 2023

Introduction

A hash table, also known as a hash map, is a data structure that allows efficient storage and retrieval of key-value pairs. It is based on the concept of a hash function, which takes an input (usually a key) and computes a unique numerical value called a hash code. This hash code is used as an index to store the corresponding value.

Hash tables offer fast lookup and insertion times, making them ideal for situations where quick access to data is crucial. They provide a way to store and retrieve information in constant time, regardless of the size of the data set.

Here's a simple overview of how a hash table works:

The hash function takes a key as input and computes its hash code.
The hash code is used to determine the index or position in the hash table where the value will be stored.
The value is stored at the computed index.
When retrieving a value, the hash function is applied to the key again, and the resulting hash code is used to locate the value in the hash table.

Some real-world applications of hash tables include:

Dictionaries and Language Processing: Hash tables are commonly used to implement dictionaries and language processing tools. Each word can be stored as a key, and its corresponding definition or translation can be stored as the value. This allows for efficient word lookup and retrieval.
Caching: Hash tables are used in caching systems to store frequently accessed data. For example, web browsers often use hash tables to cache recently visited web pages. The URL serves as the key, and the corresponding webpage content is stored as the value. This helps to speed up subsequent requests for the same web page.
Database Indexing: Hash tables are used in database systems to index and retrieve data efficiently. Instead of searching through the entire database for a specific record, a hash function can be applied to the key attributes of the record, and the resulting hash code can be used to locate the record directly.
Symbol Tables in Compilers: Hash tables are used in compilers to store identifiers (such as variables and functions) and their associated attributes. This allows for efficient lookup and management of symbols during the compilation process.

Of course - the above just includes a few examples. The use of hash tables within the software landscape can be quite broad, and the incredibly fast read and write performance isn’t free! Hash tables require additional memory to store the underlying data. In scenarios with a large number of key-value pairs or with limited memory, the overhead of using hash tables can be quite high. As with anything, there is no free lunch!

Real World Analogy and Example

Let's assume you want to come up with a system for organizing books in a library. The books are relatively random, and there’s no systematic way of organizing them by the topic, or author. Let’s imagine that it’s a library for really obscure books with extremely obscure topics!

You also can’t open the book; you only know the title! How do you come up with an efficient organizational scheme to easily find the books knowing just this piece of information?

Well, one option is to have separate sections for each letter of the alphabet. So, for books that start with letter A, you store them near one region. For books with letter B, you use another region. There are a couple of problems with this though: for one, you have no idea how the books being deposited are going to be distributed! We could have a skewed distribution, where a lot more books start with the letter A than with the letter Z, so we could run out of space in our A section, and we need to find a way to make up for this lack of space by re-distributing some of our books!

Problem number two: we also need a more specific system. Our requirements are that we need to find a book quickly. Given a title, the head office tells us, we need to be able to locate where the book is stored within 10-15 seconds of knowing the title of the book!

Hmm, you say, as you try to think of another plan. How about we simply keep a list of where each book is stored? In other words, whenever we have a new book, we assign it to a random location, and simply write down both the title and location on a notepad which we can later use to locate it.

Wait though, this won’t work either! If we don’t keep our list organized, we won’t be able to find the book quickly! In other words – unless we know how to really organize our list, it’ll take us a very long time to search for the written down title. Remember that organizing our list alphabetically is impossible, since we have no idea what the starting letter distribution will be! Even if we use a heuristic estimate, we could deal with a situation where we’ll need to re-assign some letters to different notepad locations, and finding our books might take much longer than 15 seconds!

You sit down and think about this problem for a little longer. You write down your two main requirements, just to remind yourself about the key objectives and the problem at hand:

You need a system which with a given a book title will allow you to look up the exact or close to exact storage location of where it’s stored. Each storage spot will definitely need to be ‘indexed’ such that we need to be able to refer to it and locate the book quickly.
The written down book title and location reference needs to be easily searchable, such that you should be able to easily find the storage location you’re looking for given only the title of the book.

Huh you say – it seems like this problem would be almost impossible to solve using pen and paper, but thank goodness that you’re a programmer and you have a computer which you can use! Using your computer, you devise an ingenious plan: you write a computer program which takes a title of a book, runs it through a ‘function’ which transforms the title into a random shelf number and slot on that shelf, and you use this location for storing the book!

Also, you make sure that each string or title fed into your function always returns the same set of shelf and slot pairs, regardless of the time when it is run. In other words, it has to be a deterministic program / function! That way, when someone asks you to find the location of your book, you simply type out the book name, and your program will return the same shelf number and slot number that you used to assign a storage space for the book. Using this information, you should be able to easily find the book in question, and problem solved!

Wait though, you say, is our problem solved? For one, how do we come up with a function to generate our shelf and slot combination? Number two, how do we make sure that each book gets assigned to unique shelf and slot?

Our magical function may not be as simple to design as we originally thought!

Well, this is where hashing comes to the rescue! The algorithm used to generate the book location is called a hash algorithm, and it works by taking the data fed into it (the title of the book in this case) and calculates a number from it! Using it’s own internal system, a hash function comes up with a relatively unique number for each unique combination of letters / book titles. It scrambles the information in such a way so that each shelf and slot assignment has a very low chance of being assigned to two books with different titles, so we have a low chance of mapping different books to the same location!

Also, using modular arithmetic, the algorithms ensures that we only generate numbers that are within a valid shelf / slot range! What does this mean? Well, let’s assume that we have 30 different shelves within our library. Although the algorithm might technically generate a number within a relatively large range (let’s up to one billion), we can still adjust our number so that it falls within a limited range (1-30 in our case above)!

How do we do this? Well, we simply take the generated number and we use modular division to get the shelf we want to use.

What is modular division? A simple example with a clock could help you understand the modulo.

Let’s say that we have a 12-hour clock, in which the day is divided into two 12 hour periods. Let’s say that we ask for the time, but instead of getting a number from 1-12, we have 15:00 instead! No problem, you say, since you know that this is an encoding for 3 PM!

This is exactly what modular arithmetic does!

15 / 12 = 1 with a remainder of 3 or

15 mod 12 = 3

Let’s show another example, just to make sure that we’re on the same page:

31 / 3 = 10 with a remainder of 1 or

31 mod 3 = 1

Going back to our original example, let’s say that we have 30 shelves. To generate a valid shelf, all we have to do is use modulo 30 in order to get a valid range for library system!

Let’s say that our algorithm generates 1557. Using the system above, we would get:

1557 / 30 = 51 with a reminder of 27 or

1557 mod 30 = 27

Since our algorithm will generate a number from 0,1,2 … 29, we index each shelf starting from 0 to 29 and we use the above schema for assigning a book to a shelf!

The same principle applies to needing to generate a number between any range: modular arithmetic allows us to limit the range to whatever we want it to be! Although our randomization procedure may use a large and random number in assigning shelves and slots, we simply use the modular procedure for limiting this range, and we now have a hashing function which produces that output which we’re looking for!

Wait though, you say! How do we know that our hashing function won’t ever produce the same shelf and slot number for two different books? Surely, the hashing function can’t possibly ensure that our generated pair of numbers are always unique?

That’s true! Our function doesn’t actually guarantee that we’ll always get a unique value (or values)! Such a guarantee would actually be impossible, so how do we deal with these pesky ‘collisions’?

Well, one way to deal with this is to simply generate a number, and if we find that the book maps to a shelf and slot which is already being used, we simply say: no biggie, we’ll just place the book next to the book already placed within the slot! That way, even if our hashing function doesn’t work perfectly, we’ll still be able to locate our book quickly! We simply scan across the slot to look for our title, and surely, if our collision rate is low, we will always find our book near the location that it’s mapped to!

The above scheme is called linear probing, and it’s not the only collision resolution scheme that can be used by a hashing algorithm, but it’s sufficient to take care of our case. If we wanted, we could also use a second strategy called double hashing, whereby in instances where a collision occurs, we simply generate a new hash value for our book title. We don’t need to be this exact though, so we decide to stick to our simple linear probing approach.

You’re happy with the above scheme and decide to implement it. Of course, our problem doesn’t end there. At some point, you might want to put more books into the library than the library allows. In other words, you might need to build a bigger library. Since the exact spot in the library was calculated using the shelf and slot numbers using the current location and capacity, you know that if you have to resize the library, you will need to find new spots for all the books!

Why? Well, the calculation done to find their spots will change! In other words: our hashing system will be obsolete, since we will need to use a different modulo arithmetic in coming up with a valid shelf / slot combination given a new library! This new library will, after all, most likely have a different limit on the amount of slots or shelves allocated to it’s storage space.

This doesn’t worry you though. You say to yourself: we’ll worry about this when we get to that point. For now, our system allows us to elegantly assign books to relatively unique storage locations, and we can find the given locations easily by doing some simple data entry / lookup!

As a reminder of how the system works, we now outline the main components once more.

In essence, whenever we need to add a book to the library:

We plug in the book title into our program.
The program uses the book title letters and generates a ‘hash’ value which maps the letters in the title to a unique integer. For the sake of simplicity, we’ll let our program assign a hash value based on the letter and the position the letter resides in within our string. Our hash formula will be:
Hash value = Σ (Letter ASCII code) * (Letter Position)
As an example, if our book title is ‘LIBRARY’, our function would output:
In the above instance, the value our book title would map to is 2188.
The program uses modular arithmetic to come up with a unique shelf / slot combo for the title, and outputs this pair for you to use. For example, if there are 30 shelves, with 500 slots available in each shelf, we could just use modular division to generate the combo. For any generated hash number h, we use h mod 30 to generate a shelf and h mod 500 to generate a slot number, so using our example above, and assuming that we have 30 shelves with 500 slots in each shelf, we can come up with:
Shelf Number = 2188 mod 30 = 28
Slot Number = 2188 mod 500 = 188
You use the above location (shelf 28 and slot 188) to store the book ‘LIBRARY’.
If a book already resides within the same shelf / slot, you simply add your new book to the right of the existing ones, and you’re done.

Now, if a person comes into the library, asking where he/she can find the book, you can use a similar procedure to locate the book:

Once again, simply type in the exact title into our program.
The program will use the exact same hashing technique we described earlier to find a unique hash value, which will be used to map the book title to a shelf / slot combination.
You locate the shelf and slot, and find the book.
If there’s more than one book in the particular slot, you simply scan to the right of the existing book until you find the title which you’re looking for.

The above allows you to get the almost exact book location within a few keystrokes, and the system works perfectly. The board is happy and promotes you to master librarian, and it’s all because you’ve mastered hashing!

Hash Functions

You can think of a hash table as being a structure which attempts to index a segment of memory according to a distributed function which maps its information content to a distributed set of buckets / memory regions. Doing so allows the algorithm to efficiently allocate storage locations as well as find the location using the information we’re storing itself.

In order to perform the above operations efficiently though, the function which we use to map the content we want to index and store must be:

Efficient to compute: our algorithm must not take a long time in finding the memory region we are mapping our content to.
Deterministic: equal keys must always produce the same hash value. Not doing so would result in incorrect and inconsistent results!
Programmed such that the generated keys / memory segments are uniformly distributed across our memory region. Doing so ensures that we have efficient indexing and avoid hash collisions!

How do we design a good function?

In general, the most important property when it comes to hash functions is performance. In other words, we want a function which generates a hash which is spread out over the problem space. Doing this depends on the data being fed into the function, so designing a good one which works universally across all data sets is a hard problem to tackle.

Why? Well, data comes with different information content, and may have different information types (numeric versus textual, as an example). As an example, when hashing numbers between 10 and 105 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash. Similarly, when hashing strings, it’s important to consider all of our string characters, except in instances where we know that the characters between strings won’t vary. As an example, if the first 3 characters of a string always start with a country abbreviation, our function would ideally ignore these 3 characters and simply use the ones which tend to vary across our strings.

Although it may seem impossible to design a hash function which works across all data sets, designing a perfect function for one class of data is possible, and it is called “perfect hashing.” Despite the existence of these perfect hash functions though, they are incredibly difficult to discover and find, except in instances where we may have a small set of keys, so although these perfect algorithms exist, their use tends to be impractical, and most designers go with functions which generally tend to try to map the input data across the output space uniformly and evenly, as well as have generally good performance characteristics.

If you’re looking for good advice and information on hash functions, you can find a lot more info using the links below:

https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed/145633#145633

http://www.azillionmonkeys.com/qed/hash.html

https://eternallyconfuzzled.com/hashing-c-introduction-to-hashing

Photon-Lines Substack

Discussion about this post