Imagine your sales business is ‘booming’ in cities X, Y, and Z, and you are looking to expand. Given that demographics provide regional indicators of sales potential, how do you find other cities with similar demographics? How do you sift though large sets of data to identify new cities and new expansion opportunities?
In this application example, we use “Cloud Numerics” to analyze demographics data such as average household size and median age between different postal (ZIP) codes in the United States.
This blog post demonstrates the following process for analyzing demographics data:
Because of the memory usage, do not attempt to run this application on your local development machine. Additionally, you will need to run it on at least two compute nodes on your Windows Azure cluster.
For this demonstration, we use the dataset “2010 Key US Demographics by ZIP Code, Place and County (Trial)”
To subscribe to the dataset:
To set up your Cloud Numerics Visual Studio project:
We now need to add project dependency from Demographics to DemographicsReader.
Finally, go to the DemographicsReader project, and add references to the following managed assemblies. In the Add References window, click Browse and find the assemblies, typically found at C:\Program Files\Microsoft Numerics\v0.1\Bin
To get data into the application, we need to add a service reference to the dataset.
3. Add the following “using” statements to the DemographicsReader project’s Class1.cs file, and declare a namespace “DemographicsReader” that will contain the classes for accessing the demographics data.
4. Add a class named “DemographicsReference” to this namespace with the following private members and a constructor.
5. Replace "myemail@live.com” with your Live ID and “myAzureMarketplaceKey” with the key from Step 1.
Next, let’s build the LINQ query operation for reading in the data. A single transaction returns a maximum of 100 rows. However, we want to read in much more data. Therefore, we create a list where we append and consolidate the individual query results. We structure the query so that we can start from a given row, read in a block of rows, move forward, read in another block, and so forth.
The query, by default, returns objects of type demog1. This is a class that holds the demographics data as fields. However, we want to be able to select an arbitrary subset of columns. We accomplished this by supplying Func<demog1, T> selector function into the query as an argument.
Add the following method to the DemographicsReference<T> class
Now, we have the basic framework in place to read in data. Next, we implement the Cloud Numerics IParallelReader interface to read in the data to a distributed array in parallel fashion.
Add the following class “DemographicsData” to the DemographicsReader namespace.
The class constructor expects the number of dataset rows (“samples”) as input. The method “ComputeAssignment” divides the total number of rows into groups (referred to here as “blocks”) and attempts to distribute these blocks evenly between ReadWorkers. The “ReadWorker” method is executed in parallel fashion: there is one parallel call in each rank. The method expects as input a start index and a specified number of blocks to read. For example, the first worker might read rows 0-999, the second one rows 1000-1999 and so on. Then, each ReadWorker converts the demographics indicators from the query results into array columns. The arrays from each ReadWorker are then concatenated to form a large distributed array.
See the blog post titled “Cloud Numerics” Example: Using the IParallelReader Interface for more details on how to use the IParallelReader interface.
Because the resulting distributed array is all numeric, we need an auxiliary list to map array rows to geographies. We accomplish this by adding a serial reader named “GetRowMapping.” We also add a struct “ZipCodeAndLocation” that holds the ZIP Code, city name, and state as well as the row index. The GetRowMapping then returns a list of such structs.
1. First, add the following struct to the top-level of the DemographicsReader namespace
2. Next, add a method GetRowMapping to the DemographicsData class from the previous step.
Here we use a selector that only picks out the GeographyID (the ZIP code), GeographyName (the city name), and StateAbbreviation from the full query. Then, because the results are in the same sorted order as those from the earlier demographics query, we can use a simple for-loop to add an index.
Next, we’ll move to the Demographics project. We add a static method that expects a distributed array as input, normalizes the rows and columns, and then computes a correlation matrix. The columns have to be normalized because otherwise they would be on vastly different scales: for comparing per capita income and median age, for example. We do this by subtracting the mean of each column, and dividing by the standard deviation. Note how the KroneckerProduct method is used to tile vectors into an array that matches the data array size. The rows are then normalized to obtain the correlation coefficients between –1 and 1.
To get the correlation matrix between all ZIP codes, we need to compute correlation coefficients between each pair of rows. An efficient way to perform this computation is to multiply the data with its own transpose. This operation is at the heart of the application. Because it produces a very large matrix (30000-by-30000), performing the computation on a distributed cluster yields benefits.
Delete the default template code from Demographics project source code, and add the following:
In addition to the Correlate method, this code includes the necessary “using” statements for Cloud Numerics namespaces.
We store results to Windows Azure blob storage as a blob resembling a .csv formatted file. To save time and memory, we only write a small subset of the 30000-by-30000 correlation matrix to a blob. Add the following method to the Demographics Program class. Also, change “myAccountKey” and “myAccountName” to the key and name of your own storage account.This method takes a list of ZipCodeAndLocation objects we’re interested in, finds the corresponding rows and columns from the correlation matrix, and writes the results to a string. Then it creates a container “demographicsresult” and a blob, “demographicsresult.csv”, makes the blob publicly readable, and uploads the string to http://<myAccountName>.blob.core.net/demographicsresult/demographicsresult.csv .
Finally, we add a main method that instantiates the reader, reads in data, computes correlations, and selects the interesting subset of cities from the row mapping. For example, let’s select Cambridge, MA, Redmond, WA, and Palo Alto, CA. Add the following main method to the Demographics Program class.
Now, the application is ready to be deployed. Set AppConfigure as StartUp project, build and deploy using the Deployment Utility. The application will run for a few minutes. The results can be viewed, for example by opening http://<myAccountName>.blob.core.net/demographicsresult/demographicsresult.csv as a spreadsheet
Here the correlation coefficients above 0.9 between different ZIP codes are highlighted.
For your convenience, this section of the blog post contains the entire source code for you to copy. Remember to change the “myAccountKey” and “myAccountName” strings to your storage account key and name, and change “myemail@live.com” and “myAzureMarketplaceKey” to your Live ID and key from Step 1.