Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

WHat's wrong with this code, part 9 (File Copy) - the answers

WHat's wrong with this code, part 9 (File Copy) - the answers

  • Comments 11
Wow, where to begin on this one.

Lets start with the easy one, the simple coding error.

Here it is:

    // Account for data read.
    dataBlock->dataLengthWritten->QuadPart += dataBlock->readLength;

The problem with this is that 64bit math isn't atomic on 32bit processors.  Which means that if you tried to copy files over 4G in size, then the addition would potentially mis-order the writes to the two halves of the dataLengthWritten field, causing measurement issues.

To fix that, you need to use the InterlockedExchangeAdd64() function which will perform the operation atomically.

Now for the complicated coding error.  That one shows up here:

   if (!SetFilePointerEx(dataBlock->sourceHandle, dataBlock->fileDataOffset, NULL, FILE_BEGIN))
        printf("Unable to set current file position in source to %i64d: %d", dataBlock->fileDataOffset, GetLastError());
    DWORD readLength;
    if (!ReadFile(dataBlock->sourceHandle, dataBuffer, dataBlock->readLength, &readLength, NULL))
        printf("Unable to read data from file: %d\n", GetLastError());

The problem here is that the file wasn't opened for overlapped I/O.  If you don't specify FILE_FLAG_OVERLAPPED on the call to CreateFile, then the OS will synchronize access to the file.  It does that by acquiring a lock across each file operation.  But the key thing is that this lock is only held across individual operations.  So if there were two worker threads running, their calls to SetFilePointerEx call would reset each other.  In other words, you'd have:

Thread 1 Thread 2
Set file pointer to 0  
  Set file pointer to 20
Read from file  
  Read from file

In general, if you don't specify FILE_FLAG_OVERLAPPED, then the OS makes no guarantees about what happens with file I/O if you attempt to do the I/O on more than one thread. 

So the first thread would think its reading at 0, but would in fact be reading at 20.  To fix this, the file would have to be opened with the FILE_FLAG_OVERLAPPED flag.  The copy routine would have to remove the calls to SetFilePointerEx and instead replace the ReadFIle with:

    DWORD readLength;
    OVERLAPPED readOverlapped;
    readOverlapped.hEvent = CreateEvent(NULL, FALSE, TRUE, NULL);
    readOverlapped.Offset = dataBlock->fileDataOffset.LowPart;
    readOverlapped.OffsetHigh = dataBlock->fileDataOffset.HighPart;
    if (!ReadFile(dataBlock->sourceHandle, dataBuffer, dataBlock->readLength, NULL, &readOverlapped))
        DWORD error = GetLastError();
        if (error == ERROR_IO_PENDING)
            if (!GetOverlappedResult(dataBlock->sourceHandle, &readOverlapped, &readLength, TRUE))
                printf("Unable to read data from file: %d\n", GetLastError());
            printf("Unable to read data from file: %d\n", GetLastError());

And this leads neatly to the design errors.  As I mentioned, there are at least three of them.

The first is fairly large - the reality is that this code doesn't actually overlap reads and writes - instead it replaces them with synchronous operations.  Which doesn't really generate a lot of parallelism - the code is still spending most of its time blocked waiting on I/O to complete.  Instead, the code should be reworked based on a finite state machine (or recoded to use ReadFIleEx with a callback function) to force the code to proceed in a more orderly fashion.

The second design error is that the code will effectively read (and writes) randomly through the file, while QueueUserWorkItem queues requests in order, if there are more than one worker thread running, the order in which operations run is effectively random.  This means that the cache manager will thrash mightily trying to pre-read data.  This can be fixed by specifying FILE_FLAG_RANDOM_ACCESS on the CreateFile.  But the code STILL has problems - it would be nice if the OS could read ahead of the writes and doing random I/O doesn't help.  In reality, using unbuffered I/O (and limiting the I/O to the native cluser size on disk) is probably a better choice - in other words, accepting to cache the file data internally.  But the I/O ordering issue that leads to the third design failure.

The third error is that by extending the file by doing asynchronous writes if you ever write beyond the current "valid data length" of the file, the filesystem starts a lazywriter thread backfilling the file data with zeros.  If you later come in and write that data (which will happen with this design) you end up writing twice as much data. And that leads to a fourth design error.

The fourth design failure is that the code isn't recognizing the physical limitations of hard disks - hard disks typically have only one spindle and only one read arm (with multiple heads).  They can only do one thing at a time, which means that if your source and destination are on the same drive, overlapping the reads and writes in this manner is going to thrash the hard drives heads - which will hurt your performance.  You'd be better off reading as much of the data from the source at once and then writing it out to the destination.


Oh, and as to why my testing didn't show any problems?  That's because QueueUserWorkItem is quite conservative about spinning up new threads - in all my testing, I never saw more than one thread running - so my code didn't actually speed up anything because it was never run in parallel.  And that's a fifth design error (although if you'd made the entire process fully asynchronous, it wouldn't need ANY worker threads).


B.Y. nailed almost all the issues in the first comment.

The first issue (synchronization of the byte count was also caught by Skywing).

Stuart Ballard caught the fact that the random writes had the potential of causing fragmentation issues.

Harald totally nailed the perf issue covered by random I/O to the disks (the 4th issue above).

Later on, B.Y. nailed the concept of using a read queue and a write queue for the I/O.

Not bugs:

Harald pointed out that alternate data streams aren't copied, that's by design (see the original article).

Capt. Jean-Luc Pikachu asked about the operator new not throwing.  By default (unless you #include <new>) the C runtime operator new doesn't throw, instead it returns null.

Mea Culpas:

B.Y. also caught:

  • A missed error check on the QUeueUserWorkItem() call
  • A potential exception if an allocation error occurs inside the loop queuing the items

Mike Dimmick also pointed out issues with the process heap critical section (used under the covers by operator new).  This is a very real scalability issue, and since this particular problem is all about scalability it is relevant.  He also picked up on some of the worker thread queue issues that made up the fifth error above.


The reason I started writing this "what's wrong with this code" post was to come up with an example that shows the issue of doing I/O to a synchronous file handle on more than one thread, it was only after writing the code that I started realizing what a truly extraordinary bug farm that could be found from just the simple task of copying a file as quickly as possible.  When I started writing the answer, I thought I'd identified only two design issues, by the time I'd finished I had five of them.

The simple answer is: Use CopyFile and let the OS do the heavy lifting for you :)

  • I do not know if you are coding a new algorithm for Long.... but here's my take and then a question.

    If the case is that the source file size is, say, less than half of the available physical memory, then use apply logic:

    IF source & dest can be determined to be on the same physical drive, see if the file is fragmented, if it is, then copy as normally and let NCQ handle the head position ordering. If file is in say less than 10 fragments, have the file read into the file cache to the point that half of the avail physical memory is used. Now start writing the file, stopping reading until writing is done. Handle as transaction.

    The question I have is, I have not yet found a copy program that uses memory to speed up the copy, instead the programs seem to prefer reading small blocks and have the head seek back and forth the drive. Even if there is gigabyte of physical memory free. Now is this just bad design or can you give some good insight to why the available memory is not used to full/even half extent?
  • I'm certainly not messing with CopyFile, that's the base team's job, I'm in multimedia :)

    Those might be reasonable assumptions, I'd have to think about it a bit...
  • Arg, I misread <a href="">the FAQ</a>. Also, I just recently started reading, so I wasn't aware of your view on exceptions. ^^;;
  • One small point about what you said about 'operator new' - it's not just #include <new> that causes use of a throwing new - any C++ library header will causes this (so long as it brings int eh C++ runtime libs). It's a small point (I thought you were more wrong, but I read the VS.NET 2003 docs and learnt what the actual situation is - which is nice, learning something new everyday is always good), but I'd guess you'll find there are a lot of people who include some part of the C++ runtime and might still be checking that new functions correctly.
  • I'd really like to know more about memory allocation vs. scalability, if you ever run out of topics :-) My first encounter with the problem was a while ago: I noticed that the behaviour of the application I was working on degraded significantly during subsequent repitions, and identified the cause as being due to excessive memory allocation and fragmentation.

    I managed to tweak things by simply tidying up the number of times data was copied and being more careful about lifetime, but I didn't really get any satisfying advice from the internet: there are a lot of products being sold, and a lot of college-level articles about memory allocators, but nothing that gave sound advice for scalable memory management (except, of course, articles describing how GC mitigates these problems)
  • I accidentally glanced at the comments and noticed the comment about 64 bit operations not being atomic. I guess I've been around too long and don't see any reason to assume any such operation is atomic. If the CLI doesn't make this promise then you can't assume it for any operation.

    With HT processors we're going to see a lot of subtle bugs appear and they'll all be blamed on mysterious viruses or maybe ghosts left in the CPU from processes that died horrible deaths.

    Seriously the current processors are the result of years of single threaded programming. Maybe we should make atomicity a well-defined part of the architecture instead of just assuming it because the program seems to work.
  • Wow, Bob - long time no hear.

    Your point is actually a huge issue. On our currently supported 32bit processors, 32bit aligned memory operations are guaranteed (by the processor) to be atomic. However, in general, you're totally right - on 64bit processors this is explictly NOT true - on those machines the rules on write ordering have been significantly relaxed, which can cause all sorts of issues.

    I used a great deal of care writing the example to never have shared access to any variables - there is only one thread operating on any data structure (except for the one case I left in as a bug) at any time.

    Your comment about the ordering issue is why I suggested using the InterlockedExchangeAdd API - it asserts a memory barrier after the add to ensure that all (logical and physical) processors have a consistant view of the data.

    Btw, the biggest likely item to fail in the presence of misordered writes is the Double Checked Lock Pattern - here's a great discussion of why it fails on modern processors:

    The problem is that there is a lot of existant code that uses it that will start breaking in really subtle ways.
  • "The problem with this is that 64bit math isn't atomic on 32bit processors."

    Not even 32-bit math is atomic on 32-bit processors. Given

    int i = 0;

    // Thread A
    i += 2;

    // Thread B
    i += 3;

    the final result might be i=2, or it could be i=3 or it could be i=5. if you want atomic arithmetic you must use an Interlocked function.
  • I especially like your mention of using unbuffered I/O. It can help on low memory computers (256 is low memory? <grin>).

    When using WinRAR on a 4GB archive, my computer becomes unusable for the duration. No UI application is active enough to stay in memory. But altering the source, which they provide, to use sequential__scan on the input and unbuffered I/O on the output file prevents the problem.

Page 1 of 1 (11 items)