You may recall our earlier blog post about Latent Semantic Indexing and Analysis, where we analyzed a few hundred documents. With new features of “Cloud Numerics” Wave 2 we can repeat the analysis: not just for few hundred but for many thousands of documents. The key new features we will use are distributed sparse matrices and singular value decomposition for dimensionality reductions of sparse data. We’ll then use DataParallel.Apply to compute the cosine similarities between documents.
The dataset we use is derived from SEC-10K filings of companies for the year of 2010. There are 8944 documents in total; the number of terms is over 100,000. We have computed the tfidf term-document matrix and stored it as triplets (term number, document number, tfidf frequency). This matrix was computed using Hadoop and Mahout from pre-parsed filings, and then saved into a regular CSV file. We will use the “Cloud Numerics” CSV reader to read in this data. For those interested, “Cloud Numerics” can also read the triplets from the HDFS (Hadoop Distributed File System) format directly. This is a topic for another blog post.
To run the example, we recommend you deploy a Windows Azure cluster with two compute nodes, which you can do easily with the “Cloud Numerics” Deployment Utility. Alternatively, if you have a local computer with enough memory –-more than 8 GB free-– you can download the files (see next step) and attempt the example on that local computer.
We’ll be creating the application --method by method-- in the steps below.
1. Open Visual Studio, create a new Microsoft “Cloud Numerics” Application, and call it SparseLSI.
2. In the Solution Explorer frame, find the MSCloudNumericsApp project and add a reference to Microsoft.WindowsAzure.StorageClient assembly. We’ll need this assembly to read and write data to and from Azure storage.
3. Next, copy and paste the following piece of code to the Program.cs file as the skeleton of the application.
The input data for this application is stored at cloudnumericslab storage account:
The methods we will use to read in data to distributed arrays do not support public access to blobs, so you must copy the data to your own Windows Azure storage account.
To do this, let’s add following methods to the “Cloud Numerics” application:
This routine instantiates a CloudBlobClient instance, given an account name and key:
This method enables your application to download and upload blobs between storage accounts:
The third method copies two blobs: the CSV file that holds the term-document matrix, and a list of document names that maps term-document matrix columns to specific companies.
Create a method that reads data in “triplet” format (row, column, value), slices it into three vectors, and then assembles it into a sparse matrix. As an input argument, our reader will take a class that implements the IParallelReader interface, because this will give us flexibility to use the same method for data stored locally or on Azure (this point becomes more clear in Step 9). The triplet data is shaped n-by-3 where n is total number of unique words in each document of the corpus. The slicing operation yields first, second, and third column as separate vectors, where the first two correspond to indices of terms and documents, respectively. To convert them into row and column indices, we cast the first two into long integers and then call the distributed sparse matrix constructor. Add the following method to the skeleton program from previous step.
From the sparse matrix alone we cannot tell how columns map to filings by specific companies. To provide this mapping, we read in an indexed list of documents. This list was originally produced during the pre-processing of documents.
The following method reads the data file from Windows Azure to a string:
This method converts the string into list of strings, one for each document:
The dimensionality of the term-document matrix is huge. In a sense, each document can be considered as a vector in t-dimensional space, where t is the number of words --over 100,000-- in the corpus. Also, the term-document matrix is very sparse, each document uses only a small subset of all available words. We, therefore, can reduce the dimensionality of the problem by applying singular value decomposition.
This SVD operation has three major effects that facilitate subsequent analysis by:
The number of SVD components determines the reduced dimensionality. In this example we’ll specify a value of 100; you can experiment with different values. The output is a V*S matrix in reduced-dimensional subspace, which we will use for further analysis.
At this stage, each document is represented as a row vector in V*S matrix. Now, we’d like to find which documents are most similar to a specific document of interest.
To compare the documents we’ll use a cosine similarity measure. Let’s first add a class with a method that computes cosine similarity between a pair of documents. Note that the first vector is supplied as an argument to the constructor. The reason for this design is that we can then pass the cosine similarity method as an argument to the DataParallel.Apply(), to compute the similarity between a single document of interest and all other documents in parallel. We also mark the class as serializable so it can be passed to DataParallel.Apply().
Then, we’ll add a method that first looks up the row index of document of interest and gets the slice (row vector) corresponding to that document. It then instantiates SimilarityComparer using that vector, and calls DataParallel.Apply with SimilarityComparer.CompareSimilarity as the method to execute in parallel against rows of the V*S matrix. We sort the results by their similarity comparison score, select the top 10, and write the result to a string.
We write the results to a Windows Azure blob as a text file. The blob is made public, so you can view the results using URI http:\\<myaccount>.cloudapp.net\lsioutput\lisresult.
Finally, we put together the pieces in a Main method of the application: we specify the input locations, load in data, reduce the term-doc matrix, select Microsoft Corp. as a company of interest, and find similar documents. We also added a Boolean “useAzure” to provide for the optional scenario where local input and output of data is needed for cases where you want to run the application on your workstation, not on Windows Azure.
Note that the strings accountName, accountKey need to be changed to hold the name and key of your own Windows Azure storage account. In case you intend to run on local workstation, ensure that localPath points to the folder that holds the input files.
Once all pieces are together, you should be ready to build and deploy the application. In Visual Studio’s Solution Explorer, Set AppConfigure Project as the StartUp Project, build and run the application, and use the “Cloud Numerics” Deployment Utility to deploy to Windows Azure. After a successful run, the resulting cosine similarities should look like the following:
The algorithm has produced a list of technology companies as one would expect. As a sanity check, note that MICROSOFT_CORP_2010 has similarity score of 1 with itself.
This concludes the example; the full example code is available as a Visual Studio 2010 solution at Microsoft Codename “Cloud Numerics” download site.