Backus–Naur Form (BNF)

Backus–Naur Form (BNF)

Backus-Naur Form (BNF) is a language for describing the rules of programming languages. A language that describes another language is sometimes called a meta-language. A BNF description formally defines the exact syntax that will be accepted by a compiler or interpreter as valid.

When designing a new programming language, developers will often start with a description of the syntax of the new language in a form such as BNF. There are many alternative meta-languages (such as YACC grammar), but BNF was an early example of this kind of description. Programming tools can read these meta-language descriptions and assist with constructing the new language.

To present an exampe of BNF, we’ll formally define the format of variable names for the popular Python programming language. Variable names in Python look like this:

Hello_Variable5 = 42
print(Hello_Variable5)

A variable name in Python can be composed of upper case letters, lower case letters or digits. It can also contain underscores.

The pedant may point out that Python accepts variable names in all sorts of non-English characters but we’ll ignore that to avoid over-complicating things:

日本語="Japanese"
print(日本語)

We can describe these rules in Backus-Naur Form. Firstly we list out all of the symbols that can be contained within a variable name:

<allowed_chars> ::= <letter> | <digit> | "_"
<letter> ::= [a-z] | [A-Z]
<digit> ::= [0-9]

The first line says a Python variable is comprised of either a letter or a digit or an underscore. BNF uses the vertical bar | character to mean “or”.

The ::= means that we are defining a rule. The name of the rule is to the left of the symbol and the rule is to the right.

When you put something in quotes, it’s a literal string. By putting an underscore in quotes, we are saying exactly the character underscore and nothing else is allowed. This is called a terminal expression in BNF.

Things in angle brackets, like <letter> mean that this will be replaced by something else. This is called a nonterminal expression in BNF. “Letter” is an arbitrary name we have defined as a placeholder.

We have to define more precisely what we mean by a “letter”. Do we mean upper case or lower case. Is it only certain letters that are allowed? The next line does this for us. We define a new rule called “letter”. This rule will replace the nonterminal expression <letter> from the first line. We could list out every allowed letter using the vertical bar to mean “or”, however this is tedious. We’d have to say "a" | "b" | "c" | ... and so on. Instead we specify ranges of allowed characters using square brackets. So by writing [a-z] we are saying the range of lower case letters a through z is allowed and then we write [A-Z] to mean upper case A through Z are permitted. Then we write a separate rule for digits. We could possibly include the digits in the same rule as letters, but there’s a reason why we separate this out into a separate rule which will become apparent below.

This is not enough to match a full Python variable though. This only matches one single character. At the moment we can only have a variable name that is either a single letter, a single digit or a single underscore. We need to add another rule to match multiple characters:

<valid_python_variable> ::= <allowed_chars>*
<allowed_chars> ::= <letter> | <digit> | "_"
<letter> ::= [a-z] | [A-Z]
<digit> ::= [0-9]

The rule <valid_python_variable> is the <allowed_chars> rule followed by an asterisk. The asterisk means repeat the preceeding nonterminal expression zero or more times. So we can have zero or more allowed characters, that is a string of characters. So is this a valid Python variable? Nope! There’s a bug! Well, actually two bugs. The problems are:

  • We can apparently have a variable which is zero characters in length as the asterisk means zero or more repeats of <allowed_chars>.
  • One thing we didn’t consider is a Python variable cannot actually start with a number. In Python starting a variable name with an underscore is fine, but numbers are not permitted as the first character of the name. Here’s the syntax error we get in Python if we try and make a variable starting with a number:
5Hello_Variable = 42
File "<stdin>", line 1
5Hello_Variable = 42
^
SyntaxError: invalid decimal literal

We can fix these problems like this:

<valid_python_variable> ::= (<letter> | "_" ) <allowed_chars>*
<allowed_chars> ::= <letter> | <digit> | "_"
<letter> ::= [a-z] | [A-Z]
<digit> ::= [0-9]

Now we’ve said a valid Python variable is a letter or an underscore followed by zero or more of the characters specified in the <allowed_chars> rule.

BNF and Extended BNF

There are a number of variants of BNF in use. Technically the above is called “Extended” BNF or EBNF notation. Regular BNF does not allow the asterisk character to mean zero or more repeats. We can rewrite this in pure BNF as below, but it is somewhat more complicated:

<valid_python_variable> ::= (<letter> | "_") <zero_or_more_allowed_chars>
<zero_or_more_allowed_chars> ::= (ε | <allowed_chars> <zero_or_more_allowed_chars>)
<allowed_chars> ::= <letter> | <digit> | "_"
<letter> ::= [a-z] | [A-Z]
<digit> ::= [0-9]

We have an extra rule for <zero_or_more_allowed_chars>, which as the name implies, sets up a repeat of zero or more allowed characters. It starts with an epsilon (ε) which in BNF means “match nothing”. Then we say an allowed character followed by the rule itself. So in total this means the rule matches either nothing or an allowed character followed by a repeat of the rule. The rule recursively repeats, so it means: nothing or an allowed character followed by nothing or an allowed character followed by nothing or an allowed character and so on to infinity … Yeah that’s not much fun to figure out. Lets stick with the EBNF asterisk notation which is much simpler!

Image credit: Martin Vorel