Database Concepts

One method of storing data is in files on disk organised into folders. For many purposes this works well but there are some limitations with this technique.

When files become very large they take a long time to search. The larger a file becomes, the more time is required to search the file. Sometimes the length of time required can be unacceptable. For example, when performing a search of the web, the user expects a response very quickly. If a search takes even a second, many users will often think a search engine to be slow. Searching through large files can take considerably longer than a second. Searching an amount of data as large as that of the entire web would never be feasible by storing all of the data in regular files as they just take too long to search through.

When two people modify a file at the same time, the file can get out of synchronisation. You may have seen this problem when someone uses a spreadsheet file to maintain lists that many people need to edit at the same time. If two people try to edit the spreadsheet simultaneously, edits made by one person can become overwritten by another person. Sometimes the file may even end up being corrupted (become unreadable) due to multiple people trying to save over each other's versions of the file at the same time.

Often a certain degree of organisation needs to be maintained within a file for it to meet the intended purpose. For example someone may make a file of books for sale that contains a list with the headings: "Title", "Author" and "Price" and invites others to add their books to the file. However, they don't want contributors to be able to type anything anywhere in the file, they want people to fill-out the list only under the column headings provided. They don't want to allow someone to suddenly be able to add a new "Genre" column to the list or to decide to put the price in the author column. They want the file to be organised only in the way they have originally set out.

Databases help solve these problems. Features offered by a database often include:

The ability to search large amounts of data quickly.
Coordination of the modification of the same data by many people simultaneously such that it is done without corrupting the file.
Rigorous enforcement of a specific organisation of the data.

There are many forms of database that work in different ways and operate with different kinds of data, but a popular kind is the relational database. This style of database mainly stores information as tables. The tables in a relational database are just the same as tables you may make in other software such as a word processor or spreadsheet. Tables are grids with column headings and rows with information filled out under those specific headings. A relational database comprises a collection of one or more tables of information. Each row of information in a table is known as a record. Each cell in the table (the point of intersection of a column and a row) is a field.

When we set out the column headings for a database table, this is called the table schema. The schema also specifies what kind of data must be present under each column, such as text, a number or a date. If data is entered that does not match the type specified in the schema, the database will raise an error and disallow it.

A special feature of relational databases in particular, that is not always present in other kinds of database, is that tables of information can be merged together in various ways. This is useful for a couple of reasons. Firstly, we can avoid repeating information and secondly it helps avoid mistakes in the data (improves the integrity). Let's look at an example of how this works. Say we are trying to make a table for storing customer orders.

Date	Customer Name	Part Ordered	Price
1 June	Bob Smith	Servo-Feedback Actuator	24.99
2 June	Bob Smith	Gyro-Balancer	89.99
2 June	Bob Smith	Vortex Amplifier	132.24
3 June	Alice Shopper	Servo-Feedback Actuator	24.99
4 June	Dr. Brown	Flux Capacitor	156.24

The marketing department wants to know who buys the most items so they can send them a special offer. Looking in the table it seems to be "Bob Smith", who appears to have bought three items. However, it turns out there is a mistake. Bob Smith is a common name and they are in fact two separate customers called this. One Bob Smith has only bought one item and another Bob Smith has bought two items.

We can fix this mistake by giving each customer an identification number unique to each person:

Date	Customer ID	Customer Name	Part Ordered	Price
1 June	1	Bob Smith	Servo-Feedback Actuator	24.99
2 June	2	Bob Smith	Gyro-Balancer	89.99
2 June	2	Bob Smith	Vortex Amplifier	132.24
3 June	3	Alice Shopper	Servo-Feedback Actuator	24.99
4 June	4	Dr. Brown	Flux Capacitor	156.24

To count the number of items bought by any individual customer, we now use the customer ID number to identify unique customers instead of the name. Now we get the right count and identify the Bob Smith with ID number 2 as the top customer.

The marketing department also wants to know where each customer lives, so we add a column with their city:

Date	Customer ID	Customer Name	City	Part Ordered	Price
1 June	1	Bob Smith	London	Servo-Feedback Actuator	24.99
2 June	2	Bob Smith	Manchester	Gyro-Balancer	89.99
2 June	2	Bob Smith	Manchester	Vortex Amplifier	132.24
3 June	3	Alice Shopper	Liverpool	Servo-Feedback Actuator	24.99
4 June	4	Dr. Brown	Hill Valley	Flux Capacitor	156.24

We note, however, we've had to list out Bob Smith's city twice, because Bob bought two items. If Bob buys a hundred items, we'll need to list out Bob's city a hundred times. This is somewhat of a waste of space. Another inconvenience is if Bob moves house, we'll have to update the 100 rows with his city. We might make a mistake and miss one leading to an inconsistency where Bob ends up with two different cities. It would be better if we could only list Bob's city once.

We can solve this problem by splitting the data into two separate tables:

Orders Table

Date	Customer ID	Part Ordered	Price
1 June	1	Servo-Feedback Actuator	24.99
2 June	2	Gyro-Balancer	89.99
2 June	2	Vortex Amplifier	132.24
3 June	3	Servo-Feedback Actuator	24.99
4 June	4	Flux Capacitor	156.24

Customer Information Table

Customer ID	Customer Name	City
1	Bob Smith	London
2	Bob Smith	Manchester
3	Alice Shopper	Liverpool
4	Dr. Brown	Hill Valley

The customer information is now only listed once for each customer as opposed to being repeated on every row for every customer order. The information in the orders table is linked to the customer information table by way of the unique customer ID. If we want to find out who a particular order was for, we get the relevant customer ID and look it up in the customer information table.

Splitting out the information like this such that it is only listed once, instead of multiple times is called database normalisation.

The special customer ID that is unique to every customer in the customer information table is called a primary key. However, when the customer ID number is referenced in the orders table, it becomes known as a foreign key. It's the same number in both cases with the same meaning (a unique identifier for a customer), but whether it is described as a primary key or a foreign key depends on the table in which it is being used. As a rule of thumb, a primary key is always unique for every record of the table. There are never two rows in the customer information table that have the same customer ID. A foreign key is not necessarily unique for each record. In the orders table many records will legitimately reference the same customer in the situation where a customer has placed multiple orders.

Primary keys do not have to be a code number, you can use any kind of unique identifier. Names of things could be used as primary keys, but it often turns out in the real world that two things have the same name. The property of a good primary key is that it uniquely identifies a thing, therefore a name isn't always a great choice and it is replaced with a number. In many databases it is possible to use the combination of two columns as a primary key, for example name and postcode combined can be used as a primary key if this combination is thought to be unique. This is a compound key.

As well as primary keys, we may also have secondary keys. These are additional unique pieces of information that could have potentially served as a primary key. Say a customer calls up and asks for the cost of their most recent order. In order to get this we need the customer's primary key (customer ID) to lookup their most recent order in the orders table. However, unfortunately for our system, customers can rarely remember their customer ID number. Instead we use an alternate piece of information that they can remember, perhaps an email address or phone number. Email addresses and phone numbers are relatively unique to each customer, this is better choice of key than a name where two customers may have the same name. We can store this extra information in the customer information table and use that to lookup the customer ID number. Once we know the primary key we can successfully retrieve the order information. The email address and telephone number are secondary keys we can use to locate the primary key. We are not using secondary keys to link the tables together, we are using the primary key of customer ID to do that, but secondary keys are useful unique pieces of information for looking up customers that are equivalent to the primary key.

Using the primary key (customer ID), you can see that it is possible to produce the table as we originally had it with the names and cities of all the customers included as columns. To achieve this we take every customer ID from the orders table and replace it with the relevant information from the customer information table. This process is called performing a join. Relational database software includes features to perform joins as standard.

We can see that there is a further opportunity to normalise the orders table. Each part can be purchased by more than one customer and in fact each customer could buy the same part more than once. This leads to repeated mentions of parts in the table. The price of the part may also change, possibly leading to an update being required in the table. We could remove the part information from the orders table and replace it with a primary key (unique ID number) referencing each individual part.

So we end up with something like this:

Orders Table

Date	Customer ID	Part ID
1 June	1	1
2 June	2	2
2 June	2	3
3 June	3	1
4 June	4	4

Customer Information Table

Customer ID	Customer Name	City
1	Bob Smith	London
2	Bob Smith	Manchester
3	Alice Shopper	Liverpool
4	Dr. Brown	Hill Valley

Parts Table

Part ID	Part Name	Price
1	Servo-Feedback Actuator	24.99
2	Gyro-Balancer	89.99
3	Vortex Amplifier	132.24
4	Flux Capacitor	156.24

Lastly, it might be a good idea to give each distinct order record in the orders table a primary key because there could be two orders on the same date of the same part from the same customer. If we have a primary key, we can differentiate the orders when this situation happens. We'll add an order number column.

Orders Table

Order Number	Date	Customer ID	Part ID
1	1 June	1	1
2	2 June	2	2
3	2 June	2	3
4	3 June	3	1
5	4 June	4	4

Now we have the following benefits:

We've avoided the problem of confusing customers (and part numbers) that have the same name.
We have avoided repeating information and in the process potentially saved space.
We can update the price of products by making only one change to the parts table rather than many changes in the orders table for every time the part is mentioned. Similarly we can update customer cities by making only one change to the customer information table.
We have avoided the potential for inconsistency in the data e.g. by avoiding the possibility of a customer name or city being spelt in multiple different ways when it is repeatedly typed.

Normal Form

The extent to which a table is normalised, that is how far we got in the above process, is often described using the term normal form. There is 1st normal form (1NF), 2nd normal form (2NF), 3rd normal form (3NF) and even higher numbers may be found. The numbers refer to levels describing how well normalised a table is. In 1st normal form, the table is a little normalised, in 2nd normal form is bit more normalised and so on. There are commonly used definitions of what is meant at each level of normalisation.

In 1st normal form (1NF):

We have got as far as primary keys for clear and distinct identification of each record.
We don't have any records that are exact duplicates of each other (two records where every column has the same value).
Fields are atomic. This means we have not tried to mash multiple items of information together in one field. That is, an entry in the city column for a customer contains only one city and we haven't tried to glue together two cities for a single customer in the same field. The price column for the parts contains only one price for each part and so on. It might be argued we didn't make the "name" column in the customer information table into true 1st normal form because we could split it further into first and last name and we've mashed the two together. However, we'll leave that argument to the database pedants. It probably really depends on whether splitting out the names would be useful. You might e.g. need a lookup a customer by last name only, in which case having the last name in a separate column would be useful.
We haven't got any columns storing the same thing. So we haven't e.g. got a "city 1", "city 2", "city 3" column etc. We only have one city column.

In 2nd normal form (2NF):

It meets all the criteria for 1st normal form
Each table only contains information about one kind of thing. So the customer information table only has information about customers, it does not have any information about parts. The parts table doesn't contain any customer names or cities. The orders table doesn't contain customer names or part prices. You may see a more technical way of explaining this concept in database speak: "a relation must never contain partial dependencies", however what this really means is each table is about one kind of thing.

A test of whether you successfully got the tables into 2nd normal form is whether you removed duplicated information from each table. For example, when we started out with the orders table we had customer names and cities repeated for every order. When we see this, it's a good clue it's not yet in 2nd normal form. We needed to move the customers and parts into separate tables and link them to the orders with a common ID.

Third normal form (3NF) is about whether the columns in a table all directly or indirectly relate to the primary key. Let's add another column to the parts table which adds the name of the part manufacturer:

Part ID	Part Name	Price	Manufacturer Name
1	Servo-Feedback Actuator	24.99	Dave's Components
2	Gyro-Balancer	89.99	Alice's Actuators
3	Vortex Amplifier	132.24	Dave's Components
4	Flux Capacitor	156.24	Acme Products

Dave's Components could be a common name for a company. Is there one Dave's components in the table or two separate companies with the same name? Perhaps we should have a key to distinguish companies.

Part ID	Part Name	Price	Manufacturer ID	Manufacturer Name
1	Servo-Feedback Actuator	24.99	1	Dave's Components
2	Gyro-Balancer	89.99	2	Alice's Actuators
3	Vortex Amplifier	132.24	1	Dave's Components
4	Flux Capacitor	156.24	3	Acme Products

The part name and price are definitely associated with the primary key of part ID. The part manufacturer is associated with the primary key of part ID, but really we'd call manufacturer ID its primary key. We're starting to get the sense there should be another table.

Parts Table

Part ID	Part Name	Price	Manufacturer ID
1	Servo-Feedback Actuator	24.99	1
2	Gyro-Balancer	89.99	2
3	Vortex Amplifier	132.24	1
4	Flux Capacitor	156.24	3

Manufacturer Table

Manufacturer ID	Manufacturer Name
1	Dave's Components
2	Alice's Actuators
3	Acme Products

The relationship is really manufacturer name is associated with manufacturer ID and manufacturer ID is associated with the part ID. This is a transitive relationship. This means one thing is associated with another thing that is in turn associated with another thing. In 3rd normal form we seek to eliminate any transitive relationships and make all relationships direct by making more tables.

We'd probably have spotted that one when doing 2nd normal form. However some transitive relationships can be more subtle. Say we want to store the expected shipping time for a customer so we can inform them how many days it will take for them to receive their order. Maybe we decide to put that in the customer information table:

Customer ID	Customer Name	City	Shipping Days
1	Bob Smith	London	1
2	Bob Smith	Manchester	2
3	Alice Shopper	Liverpool	2
4	Dr. Brown	Hill Valley	7

The number of days it takes to ship to a customer depends on how far they are from the warehouse where the products are dispatched from. Shipping days is not really directly associated with the customer ID, it is really associated with the customer city. There is a transitive relationship between the shipping days and the primary key of the table because they are only indirectly associated by first jumping via the customer city. To satisfy 3rd normal form, we should eliminate that transitive relationship with a new table:

City	Shipping Days
London	1
Manchester	2
Liverpool	2
Hill Valley	7

We'll just casually ignore the fact that city isn't a great primary key because there are dozens of cities with the same name but perhaps we could use postal code instead.

In education it can be a good idea to convert any databases you design to 3NF to demonstrate your knowledge, unless told otherwise. In a professional context there can be good reasons why databases are not in 3NF. For example, conversion of data to 3NF can be very slow and the developer wants the database to have higher performance than would be otherwise be possible.

Database Integrity

A mistake that is possible with relational databases is we might delete a customer record from the customer information table, but yet the ID number remains in the orders table. This means that when we do a join and go to lookup the customer information we find nothing. Now we have a problem. Relational database software often provides features that make this situation impossible. We can set the database in such a way that it will prevent deletion of primary keys from one table if mentions of those keys still exist in another table. This is a referential integrity check. With this in place, you cannot delete a customer unless they have no orders. Neither can you delete a part unless no one has ever ordered it. Similarly, in a relational database it is often possible to enforce a rule that no customer ID may be used in the orders table unless it already exists in the customer information table. This means orders cannot be placed unless there is an existing customer to attribute them to. These kinds of rules may be referred to as constraints.

Image Credit: RRZEicons