There are a variety of message digest algorithms out there, MD5 being a particularly popular one. These generate a "message digest" (essentially, a hash) so you can detect whether somebody has tampered with a file, the theory being that it's hard to tamper with a file without changing its hash.

But make sure you record the file size as well as the digest.

Not that collisions are necessarily easy to create by mistake. (I've heard a rumor that the deployment team has seen an MD5 collision, but it's just a rumor. I have no evidence. Heck, maybe what really happened was that somebody on the deployment got their MR2 into a car accident...)

Anyway, the possibility of a "reset attack" makes collisions trivial to create.

Hash generators typically operate on a stream. The hash engine maintains some state. The file to be hashed is broken up into chunks, and each chunk is combined with the engine's state variables in some complex way. When you have passed all the data through the engine, you push a button on the engine and out pops the hash value (which is typically a copy of the state variables, or possibly a subset of them).

Now suppose somebody came up with a way of "resetting" the engine; that is, returning it to the initial state. Here's how they can make any document match your digest:

First, create an alternate message and send it through the hash engine.

Next, generate the bytes necessary to "reset" the engine.

Finally, append the original message.

In other words, the fake file looks like this:

[alternate message][garbage][original message]

where "garbage" is the reset.

This fake file has the same hash as the original message, since the "garbage" resets the hash engine to the initial state, at which point the replay of the original message regenerates the hash.

Result: A file with the same hash as the original, but with different content!

In a proper attack, of course, the "alternate message" would be crafted so the garbage and original mesage would be ignored. You might end it with a marker that means "Ignore everything after this point." (For HTML, you can just say <NOFRAMES> and everything after that point will be largely ignored by all modern browsers.) Many other file types encode the expected file length in the header, in which case you can append whatever garbage you want without having any effect.

But if you also store the file size with the hash, then the reset attack fails, because a reset attack always generates a file bigger than the original. To create a collision, they would have to create a shorter alternate message than the original, and then fiddle with the extra bytes to get the desired target hash to come out. This is significantly harder than just resetting.

(I'm not aware of anybody who has successfully been able to reset MD5, mind you. This is a protective measure: If somebody figures out how to reset MD5, a small bit of work on your side will prevent you from falling victim.)