Understanding Hash Codes and Hash Functions

Understanding Hash Codes and Hash Functions

Hash Codes As Checksums

Imagine that a company has a list of payments that need to be made to suppliers. The payment list is a spreadsheet, the first column of which contains account numbers to be paid and the second column contains the amount of each payment. The spreadsheet of payments needs to be sent to a bank. It is very important the bank receives the correct list of payments. If the bank pays the wrong amount the company might go bankrupt or the suppliers may decide to no longer supply the company.

Account Payment Amount
456178 £5,900
789123 £26,459
456234 £17,324
345623 £3,026

There are several potential problems that could cause the bank to receive the wrong list of payments. The bank may have downloaded only half of the spreadsheet due to a communications failure and not noticed. Perhaps an old version of the spreadsheet was sent by mistake. Perhaps even some thief has added an extra line to the spreadsheet in order to redirect payment to themselves. How could we confirm that the correct list of payments has been received?

One method is for the company to phone the bank and read out every payment line by line and have someone at the bank confirm that it reads the same on their copy. This is very tedious as there may be thousands of lines on the spreadsheet and is also prone to mistakes. Another method, that would be much quicker, is we could total the payments column and have the bank do the same on their version of the spreadsheet. If the sum total of all payments are the same on both copies of the spreadsheet then it is likely that it has been correctly received. When a column total is used to verify the integrity of the information in this way, it is called a checksum.

Payment Amount
£5,900
£26,459
£17,324
£3,026
Checksum: 52709

The payments total doesn’t however confirm that the bank account numbers have been received correctly. What we could do is we could also total the bank account numbers. The sum of account numbers isn’t a very meaningful number but that doesn’t matter. We can compare it on the two spreadsheets just the same as we can compare the total of the amounts payable. It must be identical on both copies if the spreadsheet was correctly received by the bank.

Account
456178
789123
456234
345623
Checksum: 2047158

Whilst column totals are a good start, they don’t catch all problems that may occur. What if two of the figures in the payments column have become swapped around? Columns of figures still total to the same when the figures are in any order. 1 + 2 + 3 is the same as 3 + 2 + 1. Different columns of figures can also total to the same amount. 3 + 3 is the same as 4 + 2. Even with column totals, the wrong supplier could still be receiving the wrong payment.

Hash functions are an improved kind of checksum that can deal with these problems. Hash functions work in exactly the same way as summing a column of figures. You provide the column of figures to a hash function and it returns a number. If the figures are the same then the hash function will produce the exact same number for each column. If the two columns are even slightly different then the hash function will produce different numbers. The number returned by a hash function is known as a hash code or a digest.

If the columns contain exactly the same numbers but in a different order the hash function will be able to detect this and will produce different hash codes.

Hash functions are not only restricted to dealing with columns of numbers because any other kind of data can of course be converted to numbers. Text can be represented as numbers and so can images, video, audio or anything else and so hash codes can be computed from these media too.

Hash functions are not flawless but they are much more robust than summing a column of values. Even so, hash functions can still occasionally produce an identical hash code for different sets of numbers. When this happens it is called a collision.

There are many types of hash functions with names such as “md5” and “sha256”. The main difference between hash functions is the degree to which they can detect errors, that is the probability with which they will cause a collision or produce the same hash code for different columns of numbers.

Hash functions always produce a hash code of a fixed length. The length of a hash code has a correspondence with the likelihood of a collision. Generally longer hash codes have less chance of a collision although the quality of the mathematics used in the hash function also has an impact.

Hash codes look like the examples below. They are usually expressed as hexadecimal:

MD5: 5d41402abc4b2a76b9719d911017c592
SHA256: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

There is sometimes a misconception that longer hash codes must always be used and that shorter and less complex hash codes are never appropriate. This misconception comes from the realm of computer security. Generally when an application of a hash code involves security and the consequences of a collision may involve a high risk then longer and more complex hash codes should be used. For example in the banking application it is very important that the correct list of payments are always made and there is a high risk that the spreadsheet file may be interfered with, therefore a longer hash code may be appropriate. Imagine though a music streaming application where a hash code is used to confirm that a music file has been downloaded correctly. This is a very low stakes application and the only consequence is that the music file will fail to play correctly if the hash code has a collision.

Longer hash codes take more computational power to produce and also require more memory to store. A music streaming application may be attempting to use the cheapest hardware that is possible. A shorter and less complex hash code may be perfectly adequate and completely appropriate for this application.

Hash Codes for Detecting Changes

As well as detecting if a file was received correctly, we can also use hash codes to determine if a file has changed. This might be useful e.g. in backup software. We can store the hash code of every file we have already backed up. To determine if any of the files have changed we compute a hash code of the current version of each file and compare it with the hash code of the last backup. If there is a difference, then the file content must have changed. We then send a new copy of the file to the backup system, otherwise we need not bother as we already have the current version backed up.

Hash Codes for Obfuscation

Hash functions are one-way functions. This means that a column of numbers may be converted into a hash code but it is very difficult to work backwards from the hash code and convert it back into the original column of numbers. It is important to note that while it is difficult to do this, it is not completely impossible.

The one-way property of hash codes is often used in computer security for obfuscation. Many services on the internet require the user to supply a password. This password needs to be stored in a file on a server such that it can be compared with the password supplied by the user when they attempt to login. The password file might look something like this table with each password listed against a user name:

User Name Password
Alice 123456
Bob FluffyBunnies
Charlie DarthVader

The problem with this is that the file of passwords represents an attractive target for thieves and may be stolen. This is unfortunately not just a theoretical risk, theft of password files is a common happenstance.

A regularly used solution is to lower the attractiveness of the passwords file to thieves by running all the passwords through a hash function and storing only the hashed version of the password. The original version of the password is never retained. When the user attempts to log in, the password they type into the login form is run through the same hash function and then this hash code is compared with the one that is stored in the file. If the hash code is the same as the one in the passwords file then the user is granted access.

User Name Password Hash Code
Alice e10adc3949ba59abbe56e057f20f883e
Bob bd6550ca8918b34fd888f8e175f207fe
Charlie 7de394e70d0823fb6366931df1c16ffd

Note the passwords are obfuscated by the hash code, they are not encrypted. Encryption is a different process but is sometimes confused with hashing.

The one-way property of hash codes means that if someone steals the file then it is more difficult to convert the hash codes back into the original passwords. You may have seen security advice to use a complex password and hash codes are the reason why. One method of attempting to reverse a hash code is simply to try every combination of numbers and letters and see if the hash code produced from that combination is the same as the hash of the password. The longer and more complex the password that is used, the more time consuming it is to try every possible combination.

It is important to note that the correct application of hash codes to password storage requires a great deal of care to securely implement and advice from a security professional should be sought as to the current best practice.

Hash Codes For Database Indexes

Another property of many hash functions is that a very slight change to the input to the function will produce a completely different hash code. Two columns of numbers that are only very slightly different will have very different hash codes. This property makes hash functions very useful for databases.

When we store information in an array it is referenced by a number called an array index. When we want to look up an item in an array we need to know the index number for the item we are interested in. Say we want to store a contact book though, that is a list of names and addresses, how do we know at which index number each name is stored? If we want to look up a particular name we would need to start at the beginning of the array and search through it from start to end until we found at which index the name occurred. If the array is very large this could take a long time.

One thing we could do is assign a number to the first letter of each name in the contact book. So we can give names beginning with A the number 0, and names beginning with B number 1 names beginning with C number 2 and so on. If we had an array that was 26 elements long then each name would go into the array element with the corresponding number. So if we got a name beginning with A then it would go into index 0 in the array. Now we know exactly which index contains which name. We just look at the first letter of the name and determine which number corresponds to it, then we know the name is located at that particular array index.

A problem with this approach is that names are not evenly distributed around the alphabet. You may have few friends whose name begins with X therefore the index in the contact book array representing X would likely never be filled. However, you may have several friends whose name begins with T therefore this particular slot in the array might be oversubscribed.

Hash functions can convert a name into a number. Names that are very similar, and in fact start with the same letter, will produce very different hash codes. A hash function can convert an uneven distribution of names around the alphabet into an even distribution of hash codes. This solves the problem of all the names in the contact book being bunched up and competing for the same array indices.

A hash code is a relatively long number but we only have 26 slots in the array. We need a way to convert the long hash code number into a number between 0 and 25 for each of the 26 array indices that we have available. This can be solved using modular arithmetic.

The modulo operator gives the remainder of a division. For example 10 modulo 7 gives 3. This is because 7 divides into 10 once but with 3 left over. It doesn’t matter what number we modulo by 7, we will never get a number back that is greater than 6 as the remainder of a division cannot be greater than the number we are dividing by. Hence, if we modulo a hash code by 26, it doesn’t matter how big the hash code number is, we will always get a number returned between 0 and 25. This is exactly the behaviour we are looking for.

Modular arithmetic is widely available in most programming languages. In Python the % symbol is used to represent modulo:

10 % 7
3

The hash code of the contact name modulo the maximum size of the array gives us a much better distribution.

Name Hash Code of Name Hash Code mod 26
Tim dc2054afd537ddc98afd9347136494ac 24
Tom d9ffaca46d5990ec39501bcdf22ee7a1 11
Ted 76ba144beb8a14b6cf542225ef885a7c 10

This arrangement is called a hash table. Other names for structures normally implemented using hash tables include a dictionary, associative array or content addressable memory. While other algorithms might be used, hash tables are common.

Even though the distribution around the array is now improved, we will still ultimately reach a situation where two names hash to the same index in the array. To resolve this, a simple solution is to choose the next element in the array if the one the hash code refers to is already occupied. If that element is full then choose the next element and so on. When we look up a value in the hash table we need to check that the value the element that is referred to is the correct one. If it is not the correct item then you move on to the next item and then the next item until you either find the correct item or find an empty slot. If you find an empty slot then you know that the item you are searching for is not in the hash table.

Hash tables are widely available as standard in most programming languages and do not usually need to be manually implemented. Python dictionaries are actually hash tables under the hood. The following example shows a mapping between name and place of birth using a Python dictionary:

# Hash table mapping name to city of birth
my_hash_table={ "Alice": "London", "Bob": "New York", "Charlie": "Manchester" }

# Print which city Alice is from.
print( my_hash_table["Alice"] )

Many database products also offer hash tables as a kind of index.

Example of a Hash Function in Python

The following Python is an example implementation of a hash function called DJB2. This isn’t the best hash function but is shown here because the code is reasonably simple. It accepts a string and then returns a 32-bit hash code representation of the string. The way this works mathematically is beyond the scope of the article but the code scrambles the input string such that the output hash code doesn’t have any obvious correspondence with the input string. This therefore produces a better distribution, which is one of the properties we are looking for in a hash code:

An example is shown for the string “hello”.

def djb2_hash( text_to_hash ):
    # Start with a prime number. The number 5381 is often used.
    hash_value = 5381

    # Run through each character in the input string
    for c in text_to_hash:
    	#Bit shift the hash 5 places to the left, then add itself, then add in the character value
        hash_value = ((hash_value << 5) + hash_value) + ord(c)

    # Return the hash as a fixed length 32-bit number
    return hash_value & 0xFFFFFFFF
    
    
if __name__ == "__main__":
	print( djb2_hash( "hello" ) )

Normally in Python we do not need to write our own hash functions because it has many good ones already built in. Here we produce a hash code for the string “hello” using the sha256 hash function which is available in the built-in “hashlib” module:

import hashlib

sha256_hash = hashlib.sha256()
sha256_hash.update(b'hello')
print( sha256_hash.hexdigest() )