home page -> teaching -> data structures and algorithms -> hash tables

Hash tables

Overview

TBD

Direct address tables

TBD

Hash tables

TBD

Hash functions

TBD

Collision resolution by chaining

TBD

Hash tables in various programming languages

TBD

Open addressing

Idea

When a collision occurs on inserting an element, we attempt to put the element in the next cell (at the next position in the table), or, more generally, in another pre-determined cell.

In the most general case, if a key k is to be inserted, the following sequence of indexes is tried, until the first empty cell is found: g(k,0), g(k, 1), g(k, 2), etc.

On lookup, the same sequence is tried, until either:

the searched element is found;
an empty cell is found - in which case, the searched element does not exist in the hash table

Choices for the cell sequence

Linear probing: g(k, i) = h(k) + i, in other words, the first probed index is given by the hash function, and, if occupied, the next cell is probed, then the next and so on. Advantages: better locality; disadvantage: bad behavior if a large number of collisions occur on the same hash value or on consecutive hash values.
Quadratic probing: g(k, i) = h(k) + ai² + bi: the sequence of probed indexes is the hash function value plus a second-degree polynomial. Advantages: better if lots of collisions occur on consecutive hash values. Disadvantage: very hard to make the probing sequences use the full table.
Double hashing: g(k, i) = h(k) + i*h'(k); this way, it is possible even that two keys that are hashed on the same initial cell have distinct probing sequences afterwards. This can greatly improve behavior on collisions.

Performance

Assuming that the hash distributes uniformly the input keys on the cells, the probability that cell g(k, i) is occupied is the same for all keys and all positions in the probing sequence, and it is equal to the load factor α.

Now, the probability to have, for a fixed key to be inserted, at least i cells probed is α^i-1 (it is equal to the probability that the first i-1 cells are occupied. The average number of cells to be probed is then α⁰ + α¹ + α² + ... = 1/(1-α).

Issues on removing keys

Suppose a key k was inserted, not in the first probed cell - which was occupied - but in the second cell. Then, suppose the key occupying the first cell above is deleled. If we just set the cell as empty, then a subsequent attempt to search for the key k will fail, because the first probed cell is now empty, and the second cell is never probed.

So, when a key is removed from the hash table, the cell must have a special marking, showing that a key existed there but was removed. Such a cell can be used for inserting a new element in its place, but must be treated as an occupied cell during the search for a key.

Radu-Lucian LUPŞA
2016-05-10