Assembly Language and Machine Code
Computers do not actually execute the languages programmers usually prefer to write code in, like Python, C++ and Java. These are known as high level languages and are designed to be easier for people to work with. Computers actually run machine code, which is a language composed of much simpler operations that are easier to translate into electronic circuits. It is very difficult to design a CPU to work directly with a high level language and so machine code is used instead.
Each type of CPU uses it’s own form of machine code and the list of instructions available is called the instruction set. Differences in instruction sets are one of several reasons why software intended for one type of computer cannot run on another.
If a programmer wants to add 5 to a number held in a variable, they may prefer to write a statement like this in a high level language:
B = A + 5
In machine code this might be broken down into a simpler sequence of instructions as follows:
- Retrieve A from memory and store a copy internally within the CPU in a register such that it can be worked on. A register is a temporary working area that holds a single number within the CPU.
- Add 5 to the internally stored copy of the number within the register
- Store the result from the register back to main memory at the location where we would like to hold the B variable
To use machine code, the programmer first needs to break down the steps they want to perform into the simpler sequence of instructions. Then they have to determine where in memory they might like to store the variables A and B. Machine code instructions are represented as numbers called opcodes. The programmer then needs to look up which opcode number represents each instruction.
Let’s say there is an instruction to load a number from a memory location into a register and it is represented by the opcode AD in hexadecimal. Assume we have previously placed the A variable at memory location FF01 and want to retrieve it into the internal CPU register. The machine code to do this might be:
ADFF01
The data we intend to work with after the opcode, in this case the FF01 memory address, is called the operand.
Now we want to add 5. The instruction to add a number to the register might have opcode 69 and then we need the operand 5 which is the same in hex. So far we have:
ADFF016905
Now we need to move the addition result from the register back into memory. We might need opcode 8D to do this and we might decide to put the B variable at memory location FF02. So the final machine code program is:
ADFF0169058DFF02
It is possible to program a computer entirely in machine code and never use a high level language. Very early computers intended for electronics hobbyists were routinely programmed in this way. Below is a Science of Cambridge (Sinclair) MK14 from 1977. It is programmed by typing machine code on the calculator style hexadecimal keypad. The output appears on a tiny LED display which is just behind the keypad. The display can show only hexadecimal digits.
Computers were available in the 1970s that could be programmed in high level languages but were expensive. The hobbyist on a budget might have made do with something like this.
Coding in machine code is extremely tedious and leads to a very slow pace of software development. There are a number of problems for humans when working in this language:
- The programmer needs to remember all of the opcodes. In modern CPUs there may be hundreds or thousands.
- The programmer needs to think about where they will place each variable in memory and remember where they put everything. A program might have hundreds of variables.
- The programmer needs to break down what they want to do into a simpler sequence of instructions. This is all very well with a simple statement like B = A + 5, but what if the maths we wanted to do was more complex? It can take a lot of thought to figure out how to do this.
- The resulting program is a very long list of hexadecimal digits and is exceptionally difficult to read and debug. Imagine a program many thousands of statements long and trying to understand it as a long list of hexadecimal digits. This is not very easy.
It’s no wonder programmers like high level languages instead.
To solve the disconnect between the programmer preferring high level languages and the computer preferring machine code, the high level language is translated into machine code by software before it is run. The software that does this is either a compiler or an interpreter. A compiler translates the whole program into machine code before it is run. An interpreter does it as it goes along while the program is running.
There is a notation to make machine code somewhat easier to read, which is called assembly language. It is easier to understand than machine code but not as easy to deal with as a high level language. Assembly language assigns each opcode number a short sequence of letters. This is called a mnemonic. We might say the instruction to load a value from memory into a register is called LDD (opcode AD), the add instruction is called ADD (opcode 69) and the instruction to store the register contents back into memory is STO (opcode 8D). Each CPU manufacturer normally assigns a set of common mnemonics for their CPU instructions in their documentation. These particular mnemonics are fictional but not unlike real ones.
So our machine code program in assembly might be:
LDD FF01
ADD 5
STO FF02
Assembly language may also allow other programming conveniences such as the definition of variables. We may be able to define “A” as FF01 and “B” as FF02 and then the programmer just needs to type A or B respectively instead of being required to remember the exact memory locations for where each of the variables are stored.
Assembly is a whole lot easier to read and understand than a sequence of machine code opcodes. Programs that convert assembly language to machine code are known as assemblers. Due to how much easier assembly is to understand than machine code, on the occasions when there is a need for machine code to be used, assembly is almost always used instead.
In the past assembly language was written very regularly by programmers because compilers were not particularly adept at converting high level languages into machine code and would not always produce the fastest result. Performing the optimal translation between a high level language and machine code is not an easy task and requires sophisticted software. A skilled human programmer could more often than not beat the compiler and write assembly code that worked more efficiently. Additionally compilers often used to cost a lot of money and budget constrained hobby programmers may not have had access to a compiler at all.
These days, however, compilers can be found for free and the technology is very high quality and well optimised. The free compilers are amongst the best available. It is now a rare case that a programmer can beat the performance of machine code written by a compiler. It is therefore less common that programmers write assembly language in practice, although it is still sometimes done for short sections of code when the absolute highest performance is required.
Assembly language can be mixed with high level languages when needed to improve performance, as opposed to writing an entire program in assembly. One common area where assembly is still used is when utilising obscure specialist CPU instructions that fit niche cases. Sometimes compilers may avoid using specialist instructions and only write code using more common instructions. The programmer may know that the specialist instructions will be considerably faster. A skilled programmer may ask the compiler to display the assembler equivalent for the machine code it has generated. On reading it they may discover that the specialist high performance instructions have not been used and write a short section of assembler themselves to correct this.
Assembly language tools are still produced and maintained for all popular CPUs. As well as for occasional use by programmers looking to write optimal code, a much more common use is by compilers themselves. Sometimes compilers do not write machine code directly but write assembly language instead. They then benefit from the work that has already gone into the translation of assembly into machine code.
Next: Assembly language instructions and addressing modes
Credit: MK14 photo by Andrew Steers