Ron Jacobs

Windows Workflow Foundation

Architecture Design Question of the Week

Architecture Design Question of the Week

  • Comments 14

Here is the scenario...

Every day you get a text file with a number of financial records in a fixed length format. Your job is to identify any new or changed records from the previous day and be able to produce an output file in the same format containing only the new or changed records. You can make no assumptions about the ordering of records in the file.

Here is a sample of the data

1310|512|086048610|01/01/1996|WB| |12/31/9999|1290.00 |USD5 |

1310|512|110000011|06/10/2002|WB| |12/31/9999|100.00 |USD5 |

1310|512|110000111|06/10/2002|WB| |12/31/9999|100.00 |USD5 |

 

The data files can get quite large (3GB)

Question for you: How would you architect the solution for this to achieve optimal performance?

  • In practicing architecture...

    My first question would be: Which is the business problem that needs to be solved here?

    Then: Why do I need to spot the new or changed records?

    Another: What task is accomplished with these financial records in the data file?
  • Good job Marcod - you win the first design prize.  Good architects are always looking at the larger context because it shapes the solution in important ways.  Where this data comes from and where it is going are two questions that I don't know the answer to (yet) but hope to have something soon.  For example it might be advantageous to load this data into a SQL database and allow it to do the heavy lifting of searching and sorting.  On the other hand it might not be worth the extra time and effort.  Hard to say without knowing more.  Stay tuned.
  • Some questions first:

    1. What is the criteria for a change in record? As in, you should have a key in each record so you can compare the rest of the record.

    2. What kind of a machine would the solution be run on, in terms of its configuration?

    3. Would also like to know if it is going to be run interactively or would be more of a batch processing kind of an app.

    The Solution:

    I have a solution, which might not necessarily be the best but it would hardly take an hour to develop. Testing out on data can give some valuable insight as to whether I can use an XML parser in a better solution (If not then I can use SQL Server). Also gather some inputs from the users.

    Build an XML file from the old data file. Each line of the text file is added as data inside a node and the hash of the text is calculated and added as an attribute.

    Once this is done. Read the new file line by line. Calculate the hash of the text and try to see if there is a match. If there is a match then do a string comparison (to be absolutely sure). Whenever a match does not occur you know this is the record is either new or has changed.

    If as per my qestion 1, there is a key in each record then the solution can use two xml files!
  • Interesting idea of comparing a hash - do you think that calculating and comparing the hash would be faster than just doing the string compare?

    Baesd on the previous code the critera for change in a record is any difference in the strings between the old file and the new file.

    There is a key for each record in the text file.  The app does run in a batch mode currently as I've been told it takes anywhere from 15-60 minutes to run depending upon the hardware and load.
  • So, is there a memory constraint?  Would I be able to consume 6GB of RAM to process these files?

    Also, are we able to change the application that produces these files to add a timestamp to each record so we don't even have to look at the previous day's file?
  • I like the idea of timestamps.  I don't know if we can alter the thing that generates the files though but would most certainly be on my list of asks. The current system has been run on 2 servers.  One has 3GB of RAM and one has 1GB of RAM.  Though considering the cost of hardware I would not hesitate to recommend a hardware upgrade for the solution.
  • OK, I had to take a stab at this.

    In order to easily identify lines that have been added/deleted/changed between two files (yesterdays and todays), it would help if they were sorted in some way. However, as I can't assume that they are sorted in any particular way, I'm going to make what I think is a reasonable assumption that it also doesn't matter how the results are sorted, just so long as all the lines that are added or changed are listed.

    So, I'd sort each day's source file. This is probably time consuming, so I'll just do it once for each day's file when I get it. If I just keep the sorted file and throw away the original, I can compare it against yesterdays sorted file, and keep it to compare tomorrow's against in turn. (I'm assuming I have enough disk space to store at least 2-3 days of files.)

    As it doesn't matter what order the lines of the files are in, just so long as it's stable, I can just sort the lines of the file in plain ASCII order. Which is easy.

    Then, given two sorted files, finding differences between them is also easy. Read a line from each, as see if they're the same. If they are, go to the next line in each file; if the line in the first file sorts before the line in the second file, a line was removed, go to the next line in the first file; if the line in the first file sorts after the line in the second, a line was added, go to the next line in the second file. Repeat.

    (Note that a change will be marked as a removal and an addition, so to find lines that were changed or added, we just need to look at lines that have been added.)


    Hey, actually this is looking pretty easy.

    Taking your sample data and copying it onto itself a few times gives me a file 3472883712 bytes (3.3GB) in size, which I'll use as a sample. Now, this is definitely more than the amount of RAM I have, and is also more than the amount of memory a process can allocate at once, so there's no way this file will all fit in memory. Which will make things interesting, but only a little bit.

    So, here I go:

    ~/code/sorttest$ ls -l
    total 3394804
    -rw-r--r-- 1 user users 3472883712 2006-09-26 00:04 data.txt

    ~/code/sorttest$ time cat data.txt > /dev/null

    real    1m23.400s
    user    0m0.152s
    sys     0m10.397s

    ~/code/sorttest$ time (LC_ALL=C sort data.txt > data2.txt)

    real    15m7.823s
    user    4m46.878s
    sys     1m33.282s

    ~/code/sorttest$ time (LC_ALL=C sort -c data2.txt && echo sorted)
    sorted

    real    1m29.067s
    user    0m21.405s
    sys     0m10.377s

    ~/code/sorttest$ cat data2.txt | tail -n +2 | head -n -1 > data3.txt
    ~/code/sorttest$ echo zzzzzzz >> data3.txt
    ~/code/sorttest$ echo zzzzzzzzzzz >> data3.txt
    ~/code/sorttest$ time (LC_ALL=C comm -1 -3 data2.txt data3.txt | sed -e "s/^\\t//")
    zzzzzzz
    zzzzzzzzzzz

    real    6m23.062s
    user    0m42.159s
    sys     0m19.265s

    ~/code/sorttest$


    The first command just shows that I have the file and how big it is.

    The second establishes a baseline for how long it takes to slurp the whole file off the disk. "real" is the amount of actual time it takes - 1m23s. "user" is the amount of CPU time spent in userspace, and "sys" is the amount of CPU time spent in the kernel. As "user" + "sys" is a lot less than "real", that's a good indication that I was waiting on I/O for much of the time. My HD light agrees. :)

    The ASCII sort is next. This is the real meat of the approach. Fortunately, GNU "sort"(1) appears to handle sorting files larger than can fit in memory OK. (Command 4 supports this, where a check doesn't take much more time than "cat"(1) as it only needs to compare each line against the next.) At a guess it uses mergesort which is useful this way and still only has O(n log n) complexity. Still, only 15 mins on a desktop system for sorting a 3.4Gb plain text file sounds pretty good to me.

    Commands 5, 6 and 7 are used to create another day's run. It's pretty simple for this demo, but should suffice. The runs are the same except the second day's file (data3.txt) is missing the first and last line, and has two other lines (badly formed, but guaranteed to be in the correct position) appended.

    Command 8 just uses "comm"(1) to show only the lines that are unique to the second file on its command line (data3.txt). (The invocation of "sed"(1) just removes the leading tab that "comm"(1) uses to put lines that are unique to file 2 in column 2 of the output)

    As expected, the only lines that have been added (which will include the final version of lines that have changed, as required) are the "zzzzz" lines. This last command hit the disk particularly hard, presumably because it had to keep seeking back and forward between the two 3.4GB files - a real performance killer.


    Put this in a script, automate it to run once per day and send output and errors to a human to check (via email?) and you've got the bulk of a solution.


    Anyone any idea how well Monad would handle this? I guess it'd be pretty similar.
  • Oooh - more interesting approaches and comments since I started on mine. From reading them, it occurs to me that it might be worth me showing my system specs so the timings mean something:

    Athlon 1600
    512MB RAM (+10GB swap)
    Single 80GB 7200RPM HD

    Also, I forgot to point out that only commands 3 and 8 are needed in the final script.
  • A simple way to approach this would sort both files into ascending order, and then process looking for adds, chnages and deletions. Even better if you can get the source system to sort into the correct order, then all you have to do is the processing.

    Another approach would be to calculate a hash of each record, and compare those.

    -S
  • I don't know if the sorted files mechanism would outperform a hash map -- it might (I actualy don't know).  So here is the hash map approach...

    Assuming you can't change the file formats, the idea is to produce a new file each day that filters out rows that did not change.  You have hashing will reduce the iteration count from n*(n-1) down to the n*log(n) range, which is much better.  I think this is much better than trying to sort the rows.  

    The record keys don't help much, because you are looking for changes in the overall record.  The other thing you have on your side is that any hash you create on the new file is also the hash for tomorrow's old file (if you follow the meaning).

    So, on day 1, you scan the old file and build a hash map where each hash is built on the entire record content.  Each hash entry should have a pointer to the source record.  Note that more than one record can hash to the same value, so the hash map needs to deal with that.

    You then scan the new file converting each whol row into a hash value and then look to see whether a matching hash exists in the old file's hash map.  If there is no match, that record is a keeper and you move on.  If there is a hit, you then should compare the new record to the old record(s) to double check that they actually match.  If so, you discard that new record.

    Finally, preserve the hash map built by scanning the new file for use on day 2 to avoid that first step.

    The question is whether the hashmap for the old file can itself fit completely into memory.  Assuming 53 million hash values (for a 3GB file containing 60-byte fixed-length records), that's 215MB for a 32-bit numerical hash or 530MB for a 64-bit numeric hash.  Using a 32-bit hash means more records will wind up with matching hashes which then need to be checked against the source data.  Also remember that when you generate a numeric hash, you need to process the record content into a big number and then take the modulo of that number against the highest prime number that fits the target hash value size.  That helps the calculations to evenly distribute hash values from 0 - 2^32-1.

  • If the records don't have a key then what is the criterion for comparing two records?

    Say record x changes on day 2 then how do we determine if x has changed as we cannot assume that the record x appears at the same line number in the two files.
  • PingBack from http://asailorskis.com/arch/?p=61
  • The way I would do it owuld be:

    I would create two tables with the same columns

    (the columns would be the same as in the text fle). Lets Say Table1 and Table2

    You could use either Access or SqlServer(Firebird, PostgreSql,Sqlite etc)

    I would create an odbc datasource for the text file (for SqlServer) Or a simple procedure with TransferText (for Access).

    Lets sa

    Every day I would delete the data from table 2, and insert it from the text file via the odbc connection for SqlServer or via the procedure in Access.

    The data from the previuos day would allready be in Table1

    Then I would apply aquery like Select Table1.Field1,Table1.Field2 ... Table2.Field1, Table2.Field2 from Table1 Right Outer Join Table2 on Table1.Field1=Table2.Field1,Table1.Field2=Table2.Field2 etc

    These would give you all the records from the second table (today), and for the records that are not present in the table1 table (yesterday) you would have all nulls.

    So You can take the new records.

    The output file could be generated through another odbc link for SQLServer or through a procedure from Access.

    After this you could replace the data in table1 with the data in Table2 for tomorrow

  • My Solution:

    Some things to consider.

    - The fastest time can be achieved if the number of reads from files are reduced.

    - Important thing to note: at least one of the files can be read sequentially and only once. The second might need rereading previously read records from disk depending on the amount of available memory.

    - we need to split our files into pages. Pages are of fixed size and can be read whole, also they can be read directly by simply seeking calculated page position in the file. Every page contains a fixed number of records.

    - we need to determine the maximum number of pages we can store in memory.

    - the idea is to read the firs file sequentially in small chunks discarding previous read data and also read the second one sequentially but also index every record found and keep as much as possible in memory.

    - We also need a list containing pages with records read from the second file, used as a MRU (most recent used) list

    Algorithm:

     - Create and indexing structure contains records key and page where the record can be found. The index can be created with the appropriate size from the beginning and no dynamic resizing will be required.

     - Start reading the first file a small chunk at a time. Now we need to try to find a matching record from the second file for every record in our current chunk read from the first file.

       IF record number not in INDEX then

           - Read the second file a page at a time until the record is found. The difference is that previous pages are not discarded. Every new page read is pushed into the MRU page list. When the page is read every record is indexed: it's key and page are inserted into index. If the maximum size of the list is reached the last page is discarded.

       ELSE

          IF Page already in memory

               - Compare records and bring page to the beginning of the MRU list. If no more records not compared in current page discard it

          ELSE Read page from file2 and discard last page from MRU list if exceeding max   size.

    Scenarios:

    1 Worst case:

    - every record from every page in the fist file has the equivalent one, in a different page on the second one, so basically at worst you won't have page hits until all remaining pages cam be stored in memory.

    - best case scenario all equivalent records are on the same pages which means every file is read only once and the records compared.

    Boy I sure hope my English is good enough for somebody to understand what I was trying to say. :)  

Page 1 of 1 (14 items)