While you will hear people state that hash codes generate a unique value for a given input, the fact is that, while difficult to accomplish, it is technically feasible to find two different data inputs that hash to the same value. However, the true determining factors regarding the effectiveness of a hash algorithm lie in the length of the generated hash code and the complexity of the data being hashed.

Let's first talk about the hash algorithms themselves. Hashing algorithms generate a fixed-length hash code regardless of the length of the input. For example, the MD5 hash function always generates hash codes that are 32 bytes in length, the SHA1 hash function generates 20-byte hash codes, SHA256 generates 256-bit (32 byte) hash codes, and so on. Therefore, since there are a limited range of possible values for a given hash code and an unlimited range of values to hash, it stands to reason that the length of the hash code generated with a given hash algorithm is directly related to the difficulty of finding two inputs that will generate the same hash code.

This is easy to prove. If n is the number of possible hash codes, we only need n + 1 distinct input values to prove that there is an overlap. Granted that for most hash functions, n is rather large so we can safely assume that for any meaningful input values, it would be very difficult to find another meaningful input that would give the same hash.

That brings me to the second point—the input value itself. It is assured that if the input is anything more involved than random data, then even if you were to find two inputs that generated the same hash value, the two inputs would have no semantic relationship to one another. This is especially evident if you are hashing text, because there are many more ways to produce gibberish than there are ways to produce meaningful words. In other words, it's extremely difficult to create random words, even harder to form sentences, and very unlikely that those sentences will form a paragraph or a document and still have it make sense.

As an example, let's say you're generating a hash code for a top-secret document that details the route of nuclear warheads for disposal. The chances of someone randomly generating words until they matched the hash code generated for the original document and producing a document that makes any sense are so small that you could accurately say it would be impossible.

Therefore, the fact that each possible hash code doesn't uniquely match an input value only comes into play when you are dealing with random (nonstructured) data. An example of this would be spyware-removal applications that determine if a file is spyware by hashing each file in a specified folder or drive and then comparing that hash value to a list of known spyware-file hash codes. In this case, relying solely on the hash code would be a mistake as the files being hashed are of varying lengths with many files having no semantic meaning (as the application is not determining the meaning of the data; only the hash code values). As a result, it would be highly recommended that these applications either match on file name before hashing the file's contents (which dramatically reduces the possibility of "false positive" results) or use a hash code with a very long output value - such as SHA256.