Learn how to define and check the syntax of programming languages using Backus-Naur Form (BNF), and build derivation trees to trace syntax validity.
What is BNF used for?
Backus-Naur Form (BNF) is a formal notation used to describe the syntax of programming languages and other structured data formats. It allows you to define the grammar of a language by specifying how valid statements (or strings) are constructed. Once you define a grammar using BNF, you can then parse strings — that is, check whether they follow the correct structure — and build derivation trees to trace how valid strings are formed.
BNF plays a key role in compilers, interpreters, and language parsers, helping to ensure that programs conform to the syntactic rules of a language before they are executed. This topic explores how to write BNF grammars, use them to parse strings, and build trees that trace syntactic correctness step-by-step.
Defining syntax using BNF
Components of a BNF grammar
BNF grammars are built from three types of components:
Non-terminal symbols: These are placeholders that represent patterns or structures in the language. They are usually written between angle brackets, such as <expr>, <digit>, or <identifier>.
Practice Questions
FAQ
Regular expressions are limited to describing regular languages, which have a flat, linear structure and can be recognised by finite state machines (FSMs). They are ideal for patterns that involve repetition or alternation but cannot handle recursion or nesting. This is because FSMs do not have memory beyond their current state — they cannot “remember” how many levels deep a structure has gone. In contrast, BNF is capable of defining context-free languages, which include recursive patterns such as balanced parentheses, nested HTML tags, or strings with matched numbers of symbols like aⁿbⁿ. These require a form of memory to ensure that the number of opening symbols matches the number of closing ones, regardless of how many levels deep the nesting goes. BNF achieves this by allowing non-terminals to be defined in terms of themselves (recursion), enabling grammars to represent complex hierarchical structures that regular expressions fundamentally cannot.
In BNF parsing, direct derivation refers to a single-step replacement of a non-terminal with one of its production rule options. This is the most immediate application of a rule. For example, if you have <expr> ::= <term> + <expr> | <term>, replacing <expr> with <term> + <expr> is a direct derivation. On the other hand, indirect derivation involves multiple steps where several direct derivations are chained together. It’s the full process of starting from the initial symbol and applying direct derivations repeatedly until a string made entirely of terminal symbols is formed. Indirect derivation shows the complete path from the start symbol to the final string, and it is what’s used to confirm whether a string is valid. The distinction is important for constructing derivation trees, tracing syntax, and understanding how a grammar expands in stages to produce complex valid strings.
BNF by itself does not include direct notations for optional elements (?), repetition (*), or one-or-more (+) as found in extended forms like EBNF (Extended Backus-Naur Form). However, these features can still be simulated in standard BNF using recursive and ε (empty string) rules. To model an optional element, you would define a production with two alternatives — one including the element and one with ε. For example: <opt_digit> ::= <digit> | ε makes the digit optional. To model repetition, recursion is used: <digits> ::= <digit><digits> | <digit>. This allows one or more digits, as each <digits> can either terminate or continue the chain. Though more verbose than EBNF, BNF is still fully capable of defining optional and repeated constructs using clever rule definitions. This illustrates the flexibility of BNF despite its minimalistic syntax.
Left recursion occurs in a BNF rule when the non-terminal being defined appears as the first symbol on the right-hand side of its own production. For example: <expr> ::= <expr> + <term> | <term>. This is a left-recursive grammar because <expr> appears first in its own definition. Left recursion is perfectly valid in BNF and is often used in mathematical expression grammars to model left-associative operations, such as subtraction or addition, where expressions are grouped from the left. However, it poses challenges for certain types of parsers, particularly recursive descent parsers, which may go into an infinite loop when attempting to expand left-recursive rules. In such cases, the grammar needs to be refactored to eliminate left recursion, often by introducing new non-terminals and restructuring the rule. Understanding and identifying left recursion is therefore essential for designing grammars that are both correct and efficiently parsable.
Derivation ordering refers to the strategy used when applying production rules during parsing. In a leftmost derivation, the leftmost non-terminal in the string is always the one that gets replaced at each step. Conversely, in a rightmost derivation, the rightmost non-terminal is expanded first. Both approaches result in the same final string, but the sequence of intermediate steps differs. This matters when constructing derivation trees, as it influences the structure of the tree and how the string is built up. For example, a leftmost derivation mirrors top-down parsing methods, while a rightmost derivation aligns with bottom-up parsing techniques. Examining both types can reveal whether a grammar is ambiguous (if both lead to different trees for the same string). It’s important in syntax analysis to understand derivation orderings because parser behaviour, ambiguity detection, and derivation tree shape all depend on the chosen derivation strategy.
