In my last post, I showed you how to create a spreadsheet based on data from a database using the Open XML SDK. Today I am going to show you how to do the reverse, specifically, how to read data from a spreadsheet and insert it into some data source, like a database.
Again, if you have any specific scenarios or solutions you would like me to address in future posts please let me know.
Today's scenario theme is document interrogation, which is all about reading data from a file.
Imagine a scenario where I'm a developer working for a fictional company called Contoso. At my company, the sales team uses Excel to create sales orders for customers. These sales orders are all based on an Excel template that has designated regions that keep track of data such as customer id, invoice number, items being purchased, total dues, etc. The sales team typically creates hundreds of sales orders a day. My company has asked me to create a solution that is able to bulk export data from these sales orders into a database. The company wants this solution to be run every night on the server, so automating Excel is not an option.
If you want to skip ahead and just download the code + Excel template just click here.
The scenario I listed above talks about reading sales orders that are based off of a workbook template. In this sales order template I will have designated regions where I expect people to fill in information. To simplify my coding and to ensure that my code is more robust, I am going to take advantage of defined name regions as specified in SpreadsheetML. Defined name regions allow me to "tag" one or more cells within my spreadsheet as having semantic meaning. For example, I can specify a defined name region for a cell that will have a value for customer id. The advantage of using defined names is that I won't have to rely on looking up hardcoded columns and rows. For example, if a cell that contains my customer id changes from being B10 to B11 because a user added a row to the spreadsheet, I am still able to find that specific cell due to the defined name tag.
In any case, I have created a simple template that looks like the following:
Before I can use the SDK, I need to add the appropriate namespaces to our C# file. In this case, I need to add my Packaging and Spreadsheet API components as follows:
Since my sales order file contains defined name regions, the first thing I will need to do is find all the defined names specified in my file. In SpreadsheetML, workbook-level defined names are stored in the main workbook part as follows:
The defined name is stored as a string that I will need to parse. What I am going to do in my solution is read each of these defined names and then parse it out into the following components:
I am going to then use this information to then find the value or sets of values that make up the defined name region. Note that my code assumes that no one changes the defined names or adds more defined names with the string "ItemsRange". I can accomplish this task with the following code:
Where I define the BuildDefineNamesTable method as follows (note that I created a class to store all my information called DefineNamesVal):
Note that my code above assumes that none of my defined names are relative defined names. In other words, my defined name ranges contain "$" between the column and row. Okay, at this point I have all the necessary information for me to look up the values for my defined name regions.
Based on my Excel template, I have two types of data:
In either case, the value for my defined name is contained in the worksheet specified by the defined name, which I already found. Here is a little method I wrote that returns back the worksheet part based on the defined name data:
Once I have the worksheet part and the cell reference (or region of cells) as specified in the defined name region I can get the cell values. I have two types of values stored in this sales order: numbers and strings. In SpreadsheetML, cell values that are strings can be represented as inline strings within the cell or stored within the shared string table that can be found in the sharedstrings part.
The first step in retrieving a specific cell value based on column and row index is to first find that cell. The second step is then to look up the value for that cell. The following methods will retrieve the cell value based on the cell reference:
If the cell is based on a shared string then you need to look up the appropriate value from the shared string table, otherwise you only have to look at the cell value. The following code accomplishes this task:
Now that I have all the basics in place I can put everything together. I need to go through all the defined names and find the appropriate values from the spreadsheet. For the sake of simplicity, instead of writing all found values into a database I am going to instead print out the data to my console. Yes, I got a bit lazy for creating a database to store this data.
As mentioned before, I have two types of data based on my defined names: single cell values and range values based on the sales order. I am going to handle my first type of data, the single cell values, with the following code:
Notice in the code above that I am converting a SpreadsheetML stored date as a human readable date. Since I am just going to print out my found values I am going to keep track of my inventory values as a separate list. I will retrieve my inventory values with the following code:
In the code above, I am reading my entire inventory sales order until I reach an empty row, which means the list is complete.
Imagine I take a sales order with the following information:
Running my code I will end up with the following output (remember I decided to print my information out rather than store the values in a database:
In my next few posts I am going to walk through other solutions to some key scenarios. Suggestions are always welcome.
Zeyad Rajabi