Features of Hash

What are the requirements for an excellent hash algorithm?

a), the original data cannot be reversely deduced from the hash value.  해시게임 This can be clearly seen from the above MD5 example, the mapped data has no corresponding relationship with the original data.

b) A small change in the input data will result in a completely different hash value, and the same data will get the same value. It can be seen that we only changed one text, but the entire hash value obtained has changed a lot.

c) The execution efficiency of the hash algorithm should be efficient, and the hash value can be quickly calculated for long texts

d), the collision probability of the hash algorithm is small
Because the principle of hash is to map the value of the input space into the hash space, the space of the hash value is much smaller than the input space. According to the drawer principle, there must be cases where different inputs are mapped to the same output. So as a good hash algorithm, the probability of this conflict needs to be as small as possible.

There are ten apples on the table. Put these ten apples into nine drawers. No matter how you put them, we will find at least one drawer with no less than two apples in it. This phenomenon is what we call the “drawer principle”. The general meaning of the drawer principle is: “If each drawer represents a set, and each apple can represent an element if there are n+1 elements in n sets, there must be at least two sets in one set. Elements.” The drawer principle is also sometimes referred to as the pigeonhole principle. It is an important principle in combinatorics.

A solution to Hash Collision

As mentioned earlier, the hash algorithm is bound to have conflicts, so what should we do if we encounter a hash conflict that needs to be resolved? The more commonly used algorithms are the chain address method and the open address method.

chain address method

The linked list address method uses a linked list array to store the corresponding data, and when the hash encounters a conflict, it is added to the back of the linked list for processing.

해시게임
해시게임

Schematic diagram of chain address method

The process of chain address processing is as follows:
When adding an element, first calculate the hash value of the element key to determine the position to insert into the array. If there is no duplicate data under the current position, it will be directly added to the current position. When a conflict is encountered, it is added to the elements of the same hash value to form a linked list. The characteristic of this linked list is that the Hash values ​​on the same linked list are the same. The Java data structure HashMap uses this method to deal with conflicts. In JDK1.8, when the data on the linked list exceeds 8, the red-black tree is used for optimization.

Open address method

The open address method means that an array of size M holds N key-value pairs, where M > N. We need to rely on empty spaces in the array to resolve collision conflicts. All methods based on this policy are collectively referred to as “open address” hash tables. The linear detection method is an implementation of a commonly used “open address” hash table. The core idea of ​​the linear detection method is that when a conflict occurs, the next unit in the table is sequentially checked until an empty unit is found or the entire table is searched. Simply put: once a conflict occurs, look for the next empty hash table address. As long as the hash table is large enough, an empty hash address can always be found.

The mathematical description of the linear detection method is h(k, i) = (h(k, 0) + i) mod m, where I indicate which round of detection is currently being performed. When i=1, it is the next one to probe h(k, 0); i=2, it is the next one. The method

해시게임
해시게임

is to simply drill down. mod m means: after reaching the bottom of the table, go back to the top and start from the beginning.

For the open addressing conflict resolution method, in addition to the linear detection method, there are two other more classical detection methods, quadratic probing (Quadratic probing) and double hashing (Double hashing). But no matter which detection method is adopted, when there are not many free positions in the hash table, the probability of hash collision will be greatly increased. In order to ensure the operational efficiency of the hash table as much as possible, in general, we will try our best to ensure that there are a certain proportion of free slots in the hash table. We use the load factor to represent the number of vacancies.

The load factor of the hash table = the number of elements filled in the table / the length of the hash table. The larger the load factor, the more conflicts and the worse the performance.

Demo examples of the two schemes

Suppose the hash length is 8, the hash function H(K)=K mod 7, and the given key sequence is {32,14,23,2,20}
When the linked list method is used, the corresponding data structure is shown in the following figure:

해시게임
해시게임