Collisions occur when two keys in a hash table are assigned the same location. This topic explores why collisions happen and how they can be effectively resolved.
What is a collision?
A collision in a hash table occurs when two distinct keys generate the same index using a hash function. Since each index in a hash table is typically designed to hold just one entry, a collision creates a conflict where multiple key-value pairs compete for the same slot.
Collisions are one of the most fundamental challenges in designing and maintaining hash tables. If not resolved effectively, they can degrade the performance of operations like lookup, insertion, and deletion, turning what should be constant-time operations into linear-time ones in the worst case.
Why do collisions occur?
The reason collisions occur lies in the nature of hash functions and finite storage. A hash function is responsible for taking a key, such as a string or number, and computing an index within a fixed-size array, the hash table. However, the number of possible unique keys is generally much larger than the number of available slots in the table.
Practice Questions
FAQ
Secondary clustering is a phenomenon that occurs in open addressing methods like quadratic probing when different keys hash to the same initial index and follow the exact same sequence of probe steps. This means that even though the collisions don’t occur consecutively in memory, the keys are still forced to check the same subsequent slots, leading to reduced performance due to repeated probing paths. This happens because the sequence generated by quadratic probing is determined solely by the initial hash index and the probe number, which remains consistent for any key that produces the same hash. In contrast, primary clustering happens in methods like linear probing when long sequences of filled slots are formed. Keys that hash to any index within or adjacent to this sequence get absorbed into the cluster, making it grow and making insertions and lookups slower. Primary clustering creates large blocks of contiguous filled slots, while secondary clustering affects keys with the same hash index regardless of surrounding slots.
Choosing a prime number for the table size in a hash table helps improve the distribution of hash values, especially in open addressing schemes like quadratic and linear probing. When using the division method (key mod table size) to generate indices, non-prime table sizes can lead to patterns or clustering due to common factors shared between the key values and the table size. This can result in some slots being used more frequently than others, reducing efficiency and increasing collisions. A prime table size minimises this issue by ensuring that keys are more evenly distributed across the table, regardless of their numerical patterns. Additionally, in quadratic probing, using a prime number ensures that the probing sequence can potentially cover all slots in the table, preventing situations where only part of the table is accessible. This improves the chance of finding an empty slot during insertion and reduces the need for frequent resizing or rehashing.
In chaining, deletion is straightforward because each index contains a list (typically a linked list) of entries. To delete a key-value pair, the algorithm simply searches the list at the relevant index, finds the matching key, and removes it from the list. This operation does not affect other entries in the table or disturb the structure of the table. There is no need for placeholder values or additional checks. In contrast, open addressing requires careful handling of deletion. If an entry is removed from a slot, subsequent keys that were inserted due to collisions might no longer be found during lookups, breaking the probing sequence. To fix this, deleted slots are often marked with a special flag (e.g. “deleted”) instead of being left empty. This ensures that lookups and insertions still follow the correct probe sequence. However, over time, these flagged slots can accumulate and degrade performance, often requiring a cleanup phase to rehash the table and remove the flags.
Yes, hash tables can be used with non-numeric keys, such as strings, objects, or other complex data types. In these cases, the key is first converted into a numeric value using a hash function that interprets the non-numeric data in a consistent, deterministic way. For example, in the case of strings, each character can be converted to its ASCII or Unicode value, and then combined through addition, multiplication, or other operations to produce a final hash code. Once this numeric hash code is generated, it is typically reduced using the modulo operator with respect to the table size to find the index: index = hash_code mod table_size. Collisions are handled in the same way as with numeric keys — through chaining or open addressing methods. The key point is that the original key must be retained for equality comparison during lookup, since two different keys might produce the same hash. Thus, even if a collision occurs, the correct value can still be located by comparing stored keys.
Several factors influence whether chaining or open addressing is the better choice for a specific application. Memory usage is a key consideration: open addressing stores all entries within the array itself, making it more memory-efficient, especially in environments with limited memory, such as embedded systems. Performance under load is another major factor. Chaining handles high load factors better because each slot can store multiple entries, and performance only deteriorates significantly when chains grow very long. In contrast, open addressing degrades more quickly as the table fills up, since probing times increase. Ease of implementation can also affect the decision. Chaining often relies on external data structures like linked lists, which might be more complex to manage in low-level programming environments. Additionally, the type of workload matters: applications with many deletions tend to favour chaining because deletion is simpler and cleaner. Finally, concurrency and thread safety can play a role — chaining offers more straightforward synchronisation when multiple threads access different chains independently.
