Set theory provides the essential language and concepts for understanding regular expressions, offering a clear framework for describing and manipulating groups of symbols.
What is a set?
A set is a fundamental concept in mathematics and computer science, defined as a well-defined collection of distinct objects, known as elements or members. In the context of computing, these elements can include numbers, letters, characters, strings, or more abstract items like machine states or transitions.
A set is well-defined when there is no ambiguity about whether an object belongs to the set.
Elements in a set are distinct, meaning duplicates are not allowed.
The order in which elements are listed in a set does not matter.
Everyday examples of sets
The set of vowels in the English alphabet: {a, e, i, o, u}
The set of even digits: {0, 2, 4, 6, 8}
The set of days in a week: {Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday}
Practice Questions
FAQ
Yes, a set can contain another set as an element, making it a nested set or a set of sets. This is valid because sets treat their elements as individual, distinct objects—whether they are numbers, characters, or even other sets. For example, {{0,1}, {1,2}} is a set containing two subsets. In computer science, especially in formal language theory and automata, this is useful when representing more complex structures. For instance, in finite state machines, the set of all possible state transitions can be modelled as a set of ordered pairs, which themselves can be expressed as sets. Additionally, in parsing and language analysis, languages may be defined as sets of strings, and each string can be viewed as a set of characters. When grouping languages together (e.g. as subsets of Σ*), you are effectively dealing with a set of sets. This concept supports abstraction and modularity when building more sophisticated computational models.
The power set of a set is the set of all possible subsets of that set, including the empty set and the set itself. If a set A has n elements, then its power set contains 2^n subsets. In computing, power sets become highly significant in the context of non-deterministic finite automata (NFA) and their conversion to deterministic finite automata (DFA). When transforming an NFA to a DFA, each DFA state represents a subset of NFA states, effectively an element of the NFA's power set. This transformation ensures that for every input, there is a single possible transition, making the DFA deterministic. The use of the power set ensures that all combinations of NFA states are considered in the DFA model, maintaining language recognition accuracy. Therefore, power sets are not just mathematical curiosities—they play a practical role in modelling and understanding how machines can process regular languages.
Membership testing in set theory involves checking whether a specific element belongs to a set, written as x ∈ A. This directly parallels the task of determining whether a particular string belongs to a regular language, which is itself a set of strings. In practice, when a finite state machine processes an input, it essentially performs membership testing—traversing transitions to determine whether the input ends in an accepting state, confirming that the string belongs to the language. Similarly, a regular expression tests whether an input string matches a pattern, which is equivalent to asking: “Is this string a member of the language described by the regular expression?” Thus, in both FSMs and regular expressions, set membership testing underlies the logic of pattern recognition. It ensures that only valid strings—those satisfying the rules of the language—are accepted. This foundational concept helps bridge abstract set operations with concrete applications in computing and syntax validation.
The empty set (∅) and the set containing the empty string ({ε}) are fundamentally different in regular language theory, though they might seem similar at first glance. The empty set represents a language that accepts no strings at all. There are no elements—no strings of any length, not even the empty string. On the other hand, the set {ε} is a language that contains exactly one string, which is the empty string—meaning the language accepts just that single string with zero characters. This distinction is important when defining acceptance criteria in finite state machines or regular expressions. A finite state machine that accepts ∅ has no accepting paths. One that accepts {ε} transitions from the start state to an accepting state without consuming any input symbols. This difference affects how regular expressions are written and interpreted. For example, the expression ^$ matches ε, but nothing can represent a match for ∅ using standard syntax.
Set difference, written A − B, identifies all elements in set A that are not in set B. In the context of regular languages, A and B can represent two different languages or sets of strings. Using set difference allows you to exclude specific patterns or strings from a broader accepted set. For instance, if set A is the language of all binary strings (Σ* = {0,1}*), and set B is the set of all strings containing the substring 11, then A − B would represent all binary strings that do not contain 11. This operation is useful when filtering inputs—for example, accepting all usernames that do not include special characters, or recognising strings that meet general criteria but not specific exceptions. While regular expressions alone can't express all set operations directly, many engines and tools support techniques like negative lookahead or FSM manipulation to achieve similar filtering effects, effectively implementing set difference at the practical level.
