Finite State Machines (FSMs) and regular expressions are mathematically equivalent tools used to define regular languages and recognise specific patterns within strings.
What are FSMs?
A Finite State Machine (FSM) is a computational model used to simulate how systems change state in response to inputs. It consists of a finite number of states, with transitions between them that are triggered by input symbols. These machines process an input string one symbol at a time and change states accordingly.
FSMs are essential in computing for modelling and designing systems such as lexical analysers, control circuits, and protocol parsers. They are also used to validate whether a string belongs to a particular language.
An FSM generally includes:
A start state: where the machine begins processing.
A set of accepting (or final) states: if the machine ends in one of these after reading the full input, the string is accepted.
A transition function: defines how the FSM moves from one state to another based on input.
FSMs used for pattern recognition (acceptors) do not produce outputs. These are the type considered in this topic.
What are regular expressions?
Practice Questions
FAQ
Regular languages are defined by finite state machines and regular expressions, which both lack memory beyond their finite set of states. This means they cannot track nested structures, count arbitrary numbers of symbols, or match dependencies between different parts of a string unless those dependencies can be encoded using a finite number of states. For instance, they cannot recognise languages like {0ⁿ1ⁿ} where the number of 0s must exactly match the number of 1s, because such matching requires potentially unbounded memory, which FSMs do not have. Regular expressions are similarly restricted because they can only describe patterns formed using concatenation, alternation, and repetition, not recursive or nested constructs. While FSMs and regular expressions are powerful within the domain of regular languages, their inability to handle context-sensitive or context-free structures limits their use for more complex language recognition, such as in parsing programming languages or mathematical expressions with balanced parentheses.
Deterministic FSMs (DFAs) have exactly one transition for each input symbol from a given state, meaning there is only one path through the machine for a particular input string. Non-deterministic FSMs (NFAs), however, can have multiple transitions for the same input from the same state or even transitions without input (epsilon transitions). Regular expressions more naturally correspond to NFAs, as they often describe patterns that have multiple valid interpretations or paths. For example, the expression (a|b)c implies a choice between a and b, which is easily modelled in an NFA but would require more complex structure in a DFA. Despite these differences, both DFAs and NFAs accept the same set of regular languages, and any regular expression can be translated into either. However, converting an NFA to a DFA may result in an exponential increase in the number of states. In practice, NFAs are often used during design, while DFAs are preferred for implementation due to their simplicity in execution.
FSM minimisation involves reducing the number of states in a finite state machine without changing the language it recognises. This process groups equivalent states—states that behave identically for all possible future inputs—and merges them. The resulting minimal FSM is unique (up to renaming of states) for any given regular language. Importantly, this minimisation does not affect its equivalence to the corresponding regular expression because both the original and the minimised FSM accept the same set of strings. From a regular expression perspective, minimisation may lead to simpler, more elegant expressions when translating back from an FSM. However, the process of converting a regular expression to an FSM and then minimising it is often used to improve computational efficiency in applications such as lexical analysis, where fewer states mean faster processing. Thus, while minimisation doesn’t change what the FSM accepts, it optimises how efficiently the machine performs and how cleanly it can be related back to a regular expression.
FSMs and regular expressions, in their basic forms, are designed to accept or reject strings based on strict matching rules and cannot handle errors or approximate matches. They follow precise transitions or pattern descriptions without flexibility. However, there are extended models and adaptations that allow for limited error-tolerant matching. For FSMs, one such extension is the use of edit distance or Levenshtein automata, which can accept strings within a certain number of insertions, deletions, or substitutions from a given pattern. Similarly, some advanced regular expression engines used in programming support approximate matching using fuzzy matching techniques or modifiers. These are not part of the theoretical foundation of regular languages but are practical enhancements used in tools like grep, Perl, or Python. For example, fuzzy search might match "colour" and "colur" despite the missing letter. In the context of theoretical computer science, however, FSMs and regular expressions assume exactness and cannot recognise or allow errors without modification.
The alphabet of a language, which is the set of input symbols it uses, fundamentally determines the structure and behaviour of both FSMs and regular expressions. A larger alphabet may require more transitions and potentially more states in an FSM to account for every possible symbol at each point in the input. For regular expressions, a larger alphabet means more possible literals and combinations in the pattern description. For instance, with an alphabet {a, b}, the FSM for recognising strings ending in ab is relatively straightforward. But if the alphabet is {a, b, c, d}, the FSM must explicitly manage all transitions on c and d even when they are irrelevant to the desired pattern, often adding complexity. The same applies to the regular expression, which may need to include additional constructs like (a|b|c|d)*ab to ensure proper matching. In short, the alphabet influences the design complexity and length of both FSMs and their equivalent regular expressions, though it does not alter their theoretical equivalence.
