Binary Trees

Binary Trees

A binary tree data structure maintains items in a known ordering and is used to enable faster searching of data.

An item is inserted into a binary tree. Let’s put in the number 12:

When a second item arrives, let’s say the number 7, it is compared with the first item. If it is less than the existing item it is attached to the left. If it is greater than the existing item then it is attached to the right. So 7 is less than 12 and hence it is attached to the left. This is known as a branch, as in the branch of a tree.

Let’s add another item: 15. This is attached to the right of 12 as it is bigger.

Now we receive a 4th item, let’s say the number 9. We first ask is it greater or less than 12? It is less so we go left. Now we encounter the number 7 already here. Is 9 greater or less than 7? It is greater and so we attach it to the right of 7.

Now let’s add 19. This is greater than 12 so we go right. It’s greater than 15 so we got right again and attach it here.

Now we’ll add 5:

And 10:

And some more numbers:

The advantage of a tree is that it is generally (although not always) faster to search than a list. Consider if we presented the numbers as a jumbled list and we wanted to find 25. First we start at the beginning and ask if 25 is equal to 19. It is not and so we move to the next item in the list. Is 25 equal to 6? Nope, and so on. In total we have to move through 12 positions in the list and perform 12 comparisons before the number 25 is found at the end of the list. We might get lucky and are searching for 19 which is right at the start and perform only one comparison, however on average the item we are looking for is going to be in the middle of the list and so we’ll need to jump through 6 numbers and do 6 comparisons.

[ 19,6,5,15,10,9,12,14,7,16,3,25 ]

If we try and find the same items in the tree, to find 25 we ask if it is greater than 12. If so, we step to the right as we know this is the order in which we originally produced the tree. Now we find 25 is greater than 15, right again. Now is it greater than 19? Right once more. Now we are at 25. However, we only needed to move through 4 numbers to find it instead of 12. In fact this is the greatest distance we must ever move through the tree. We move this distance when we want to find the items 3, 6, 10, 16 and 25. You might note all of these items are at the bottom level of the tree. To use the tree analogy, these are the leaves (we might also say that 12 is the root of the tree). If we want to find other numbers in the tree then we need to move through even fewer positions.

The disparity between the speed at which a tree and a list can be searched grows with the length of the list of numbers. If we had a list of a million numbers then we may need to perform a half million comparisons on average to find a number. However, using a binary tree we may only need to pass through as few as 20 positions (depending on how it is constructed). So the binary tree would usually be massively faster to search.

The binary tree we have shown above is reasonably balanced, that is the items are spread around the tree branches fairly evenly. The quality of the balance to the tree depends on the order in which the items arrive when the tree is constructed. If we pass in the same list of numbers, but in sorted order we get this:

Not very tree-like. This tree is completely unbalanced. It happened because every time a number arrived, it was bigger than the previous number so it went to the right and never once went to the left. This is technically still a binary tree, but you might note it is effectively also a list. If we want to search for the number 25, we have to scan down the tree for 12 positions, exactly the same as a list.

So we have to be careful about the expected distribution of the data when selecting a binary tree as our chosen algorithm. If the numbers are arriving fairly randomly, we’ll probably get a decently balanced binary tree, if they are sorted we may not.

There is a trick for searching a list with about the same speed as a binary tree but only if the numbers in the list are in strictly sorted order, this is the binary search algorithm. So if we expected the numbers were already in sorted order (as above), we may have been better using that method.

The maximum number of nodes we need to pass through in a binary tree to search it can be (roughly) calculated from the number of items in the tree if we make the assumption that the tree is fairly balanced. In this case there are 12 items in the tree. The calculation is to take the log2 (base 2 logarithm) of the number of items, in this case log2(12) = 3.58. We then round up to a whole number and get 4, which is as we showed above, the maximum number of items we need to pass through to find a number when the tree was fairly balanced.

The search performance of a binary tree is often described as O(log n) which is a reference to the mathematical relationship between the number of items in the tree (“n”) and the number comparisons that need to be done to to access an item i.e. where we used the log2 function above. The “O” is short-hand for “order of”. This is meaning the description is a rough approximation of how well the tree will perform but not an exact answer. This is a common method of describing the performance of an algorithm and is known as Big O notation.

Note this calculation doesn’t work when the tree is very unbalanced. In the worst case, we got a tree with the same performance as list.

There are improvements to the binary tree algorithm to handle unbalanced trees. These are known as “self-balancing” trees.