Maths encyclopedia and lessons  
Search

Mathematics Encyclopedia and Lessons

 
     
 

Lessons

Popular
Subjects

algebra
arithmetic
calculus
equations
geometry
differential equations
trigonometry
number theory
probability theory
more
 

References

applied mathematics
mathematical games
mathematicians
more
 
 

Hash function

A hash function is a function that converts an input from a (typically) large domain into an output in a (typically) smaller range (the hash value, often a subset of the integers). Hash functions vary in the domain of their inputs and the range of their outputs and in how patterns and similarities of input data affect output data. Hash functions are used in hash tables, cryptography and data processing. A good hash function is one that experiences few hash collisions in the expected domain of values it will have to deal with; that is, it would be possible to uniquely identify most of these values using this hash.

Formally speaking, a hash function is defined by its domain (typically variable-length strings of bytes), by its range (typically fixed length bit-sequences) and by the defining function (call this function H). The desired characteristic of a hash function in general is that H(x) = H(y) probably implies that x = y. Unfortunately, the use of probably here is precise though indeterminate. If the set of all possible values of H(x) is much smaller than the set of all possible values of x, then this latter condition cannot be true if all sequences are equally likely. Thus, the second condition is often revised in different ways to make it plausible, or workable. For instance, cryptographic hash functions use the second condition in the form that given x, it is computationally very difficult to find y such that H(x) = H(y). Additionally, it is desirable that if we are given x and H(x + s) where + can be bit changing or concatenation, then we can't find s short of exhaustive enumeration. In practice, for most applications other than error correction, cryptographic hashes serve very nicely. The MD5 and SHA-1 algorithms are two popular but compromised algorithms for generating cryptographic hash functions; the SHA-2 algorithm has no known compromises.

Contents

Hash tables

Hash tables, a major application for hash functions, speed up the lookup of a record of information, given a key (note: this use of the word key is separate from its meaning as a code for encrypting data). For example, an alphabetic string might be used to look up an employee record in a database system.

Hashing can provide almost direct access to records, which means that, on average, a lookup can require just one or two probes into the memory or file containing the records. (At the other extreme is the worst case, in which lookup proceeds by comparing the key with each record in turn -- this linear search can require examining all the records.) Naturally, a good hash function that will yield few hash collisions is preferred to speed up the lookup.

Assuming the key to be a string of bytes, the hash function should be an index into the records that has random distribution over expected strings. Otherwise, there will be more key "collisions" and hence degradation of lookup time. If, for example, a key is alphabetic, then each byte can have only 26 of its possible 256 values. So simple functions will not probe all record indices evenly.

See the external link "Hash Functions for Hash Table Lookup" below for a comparison of speed and quality of various hash functions, including Pearson Hashing and some fast, high quality hash functions by Jenkins himself.

Error correction

For error correction, a distribution of likely perturbations is assumed at least approximately. Perturbations to a string are then classified into large (improbable) and small (probable) errors. The second criterion is then restated so that if we are given H(x) and x+s, then we can compute x efficiently if s is small. Such hash functions are known as error correction codes. Important sub-class of these correction codes are cyclic redundancy checks and Reed-Solomon codes.

Audio identification

For audio identification such as finding out whether an MP3 file matches one of a list of known items, one could use a conventional hash function such as MD5, but this would be very sensitive to highly likely perturbations such as time-shifting, CD read errors, different compression algorithms or implementations or changes in volume. Using something like MD5 is useful as a first pass to find exactly identical files, but another more advanced algorithm is required to find all identical items. Contrary to an assumption often voiced by people who don't follow the industry carefully, hashing algorithms do exist that are robust to these minor differences. Most of the algorithms available are not extremely robust, but some are so robust that they can identify music played on loud-speakers in a noisy room.

Rabin-Karp string search algorithm

Rabin-Karp string search algorithm is a relatively fast string searching algorithm that works in O(n) time on average. It is based on the use of hashing to compare strings.

Origins of the term

The term "hash" apparently comes by way of analogy with its standard meaning in the physical world, to "chop and mix." Knuth notes that Hans Peter Luhn of IBM appears to have been the first to use the concept, in a memo dated January 1953; the term hash came into use some ten years later.

In the SHA-1 algorithm, for example, the domain is "flattened" and "chopped" into "words" which are then "mixed" with one another using carefully chosen mathematical functions. The range ("hash value") is made to be a definite size, 160 bits (which may be either smaller or larger than the domain), through the use of modular addition.

See also

References

External links

01-04-2007 01:18:14
The contents of this article are licensed from Wikipedia.org
under the GNU Free Documentation License. How to see transparent copy