Here is the scenario...
Every day you get a text file with a number of financial records in a fixed length format. Your job is to identify any new or changed records from the previous day and be able to produce an output file in the same format containing only the new or changed records. You can make no assumptions about the ordering of records in the file.
Here is a sample of the data
1310|512|086048610|01/01/1996|WB| |12/31/9999|1290.00 |USD5 |
1310|512|110000011|06/10/2002|WB| |12/31/9999|100.00 |USD5 |
1310|512|110000111|06/10/2002|WB| |12/31/9999|100.00 |USD5 |
The data files can get quite large (3GB)
Question for you: How would you architect the solution for this to achieve optimal performance?
The way I would do it owuld be:
I would create two tables with the same columns
(the columns would be the same as in the text fle). Lets Say Table1 and Table2
You could use either Access or SqlServer(Firebird, PostgreSql,Sqlite etc)
I would create an odbc datasource for the text file (for SqlServer) Or a simple procedure with TransferText (for Access).
Lets sa
Every day I would delete the data from table 2, and insert it from the text file via the odbc connection for SqlServer or via the procedure in Access.
The data from the previuos day would allready be in Table1
Then I would apply aquery like Select Table1.Field1,Table1.Field2 ... Table2.Field1, Table2.Field2 from Table1 Right Outer Join Table2 on Table1.Field1=Table2.Field1,Table1.Field2=Table2.Field2 etc
These would give you all the records from the second table (today), and for the records that are not present in the table1 table (yesterday) you would have all nulls.
So You can take the new records.
The output file could be generated through another odbc link for SQLServer or through a procedure from Access.
After this you could replace the data in table1 with the data in Table2 for tomorrow
My Solution:
Some things to consider.
- The fastest time can be achieved if the number of reads from files are reduced.
- Important thing to note: at least one of the files can be read sequentially and only once. The second might need rereading previously read records from disk depending on the amount of available memory.
- we need to split our files into pages. Pages are of fixed size and can be read whole, also they can be read directly by simply seeking calculated page position in the file. Every page contains a fixed number of records.
- we need to determine the maximum number of pages we can store in memory.
- the idea is to read the firs file sequentially in small chunks discarding previous read data and also read the second one sequentially but also index every record found and keep as much as possible in memory.
- We also need a list containing pages with records read from the second file, used as a MRU (most recent used) list
Algorithm:
- Create and indexing structure contains records key and page where the record can be found. The index can be created with the appropriate size from the beginning and no dynamic resizing will be required.
- Start reading the first file a small chunk at a time. Now we need to try to find a matching record from the second file for every record in our current chunk read from the first file.
IF record number not in INDEX then
- Read the second file a page at a time until the record is found. The difference is that previous pages are not discarded. Every new page read is pushed into the MRU page list. When the page is read every record is indexed: it's key and page are inserted into index. If the maximum size of the list is reached the last page is discarded.
ELSE
IF Page already in memory
- Compare records and bring page to the beginning of the MRU list. If no more records not compared in current page discard it
ELSE Read page from file2 and discard last page from MRU list if exceeding max size.
Scenarios:
1 Worst case:
- every record from every page in the fist file has the equivalent one, in a different page on the second one, so basically at worst you won't have page hits until all remaining pages cam be stored in memory.
- best case scenario all equivalent records are on the same pages which means every file is read only once and the records compared.
Boy I sure hope my English is good enough for somebody to understand what I was trying to say. :)