This post demonstrates how to use Microsoft.Numerics C# API to perform statistical operations on data in Windows Azure blob storage. We go through the steps of loading data using IParallelReader interface, performing distributed statistics operations, and saving results to blob storage. As we sequence through the steps, we highlight the code samples from the application.
Before you run the sample “Cloud Numerics” statistics application, complete the instructions in the “Cloud Numerics” Getting Started wiki post to:
You can download the sample application from the Microsoft Connect site (connect.microsoft.com). If you have not already registered for the lab, you can do that here. Registering for the lab provides you access to the “Cloud Numerics” lab materials (installation package, reference documentation, and sample applications).
For your convenience, we have staged sample datasets of pseudorandom numbers in Windows Azure Blob Storage. You can access the small and medium datasets at their respective links:
These datasets are intended merely as examples to get you started. Also, feel free to customize the sample application code to suit your own datasets.
Set AppConfigure as the StartUp project. (From Solution Explorer in the Visual Studio IDE, right click the AppConfigure subproject and select Set as Startup Project).
To build the application you must have a Windows Azure storage account for storing the output. Replace the string values “myAccountKey” and “myAccountName” with your own account key and name.
static string outputAccountKey = "myAccountKey"; static string outputAccountName = "myAccountName";
The application creates a public blob for the output under this storage account. See Step 4 for details.
Let us take a look at code in AzureArrayReader.cs file.
The input array in this example is in Azure blob storage, where each blob contains a subset of columns of the full array. By using the Microsoft.Numerics.Distributed.IO.IParallelReader interface we can read the blobs in distributed fashion and concatenate the slabs of columns into a single large distributed array.
First, we implement the ComputeAssignment method, which assigns blobs to the MPI ranks of our distributed computation.
public object[] ComputeAssignment(int nranks){ Object[] blobs = new Object[nranks]; var blobClient = new CloudBlobClient(accountName); var matrixContainer = blobClient.GetContainerReference(containerName); var blobCount = matrixContainer.ListBlobs().Count(); int maxBlobsPerRank = (int)Math.Ceiling((double)blobCount / (double)nranks); int currentBlob = 0; for (int i = 0; i < nranks; i++) { int step = Math.Max(0, Math.Min(maxBlobsPerRank, blobCount - currentBlob)); blobs[i] = new int[] { currentBlob, step }; currentBlob = currentBlob + step; } return (object[])blobs;}
Next, we implement the property DistributedDimension, which in this case is initialized to 1 so that slabs will be concatenated along the column dimension.
public int DistributedDimension{ get { return 1; } set { }}
The ReadWorker method:
public msnl.NumericDenseArray<double> ReadWorker(Object assignment){ var blobClient = new CloudBlobClient(accountName); var matrixContainer = blobClient.GetContainerReference(containerName); int[] blobs = (int[])assignment; long i, j, k; msnl.NumericDenseArray<double> outArray; var firstBlob = matrixContainer.GetBlockBlobReference("slab0"); firstBlob.FetchAttributes(); long rows = Convert.ToInt64(firstBlob.Metadata["dimension0"]); long[] columnsPerSlab = new long[blobs[1]]; if (blobs[1] > 0) { // Get blob metadata, validate that each piece has equal number of rows for (i = 0; i < blobs[1]; i++) { var matrixBlob = matrixContainer.GetBlockBlobReference("slab" + (blobs[0] + i).ToString()); matrixBlob.FetchAttributes(); if (Convert.ToInt64(matrixBlob.Metadata["dimension0"]) != rows) { throw new System.IO.InvalidDataException("Invalid slab shape"); } columnsPerSlab[i] = Convert.ToInt64(matrixBlob.Metadata["dimension1"]); } // Construct output array outArray = msnl.NumericDenseArrayFactory.Create<double>(new long[] { rows, columnsPerSlab.Sum() }); // Read data long columnCounter = 0; for (i = 0; i < blobs[1]; i++) { var matrixBlob = matrixContainer.GetBlobReference("slab" + (blobs[0] + i).ToString()); var blobData = matrixBlob.DownloadByteArray(); for (j = 0; j < columnsPerSlab[i]; j++) { for (k = 0; k < rows; k++) { outArray[k, columnCounter] = BitConverter.ToDouble(blobData, (int)(j * rows + k) * 8); } columnCounter = columnCounter + 1; } } } else { // If a rank was assigned zero blobs, return empty array outArray = msnl.NumericDenseArrayFactory.Create<double>(new long[] { rows, 0 }); } return outArray;}
When an instance of reader is invoked by the Microsoft.Numerics.Distributed.IO.Loader.LoadData method, the ReadWorker instances are executed in parallel on each rank, and the LoadData method automatically takes care of concatenating the local pieces produced by the ReadWorkers.
The source code in the Statistics.cs file implements the statistics operations performed on distributed data.
The sample data is stored at:
static string inputAccountName = @"http://cloudnumericslab.blob.core.windows.net";
This is a storage account for our data. It contains the samples of random numbers in publicly readable containers named “smalldata” and “mediumdata.”
In the beginning of the main entry point of the application, we initialize the Microsoft.Numerics distributed runtime. This allows us to execute distributed operations by calling Microsoft.Numerics library methods.
Microsoft.Numerics.NumericsRuntime.Initialize();
Next, we instantiate the array reader described earlier, and read data from blob storage.
var dataReader = new AzureArrayReader.AzureArrayReader(inputAccountName, arraySize); var x = msnd.IO.Loader.LoadData<double>(dataReader);
The output x is a columnwise distributed array loaded with the sample data. We then compute the statistics of the data: min, max, mean, median and percentiles, and write the results to an output string.
// Compute summary statistics: max, min, mean, medianoutput.AppendLine("Summary statistics\n");var xMin = ArrayMath.Min(x);output.AppendLine("Minimum, " + xMin);var xMax = ArrayMath.Max(x);output.AppendLine("Maximum, " + xMax);var xMean = Descriptive.Mean(x);output.AppendLine("Mean, " + xMean);var xMedian = Descriptive.Median(x);output.AppendLine("Median, " + xMedian);// Compute 10% quantilesvar tenPercentQuantiles = Descriptive.QuantilesExclusive(x, 10, 0).ToLocalArray();
As x is a distributed array, the overloaded variant of the method (QuantilesExclusive) that distributes processing over nodes of the Azure cluster is used. Note that the result of the quantiles operation is a distributed array. We copy it to a local array in order to write the result to an output string.
The application, by default, writes the result to the file system of the virtual cluster. This storage is not permanent; the file will be removed when you delete the cluster. The application creates a public blob on the named Azure account you supplied in the beginning of the application.
// Write output to blob storagevar storageAccountCredential = new StorageCredentialsAccountAndKey(outputAccountName, outputAccountKey);var storageAccount = new CloudStorageAccount(storageAccountCredential, true);var blobClient = storageAccount.CreateCloudBlobClient();var resultContainer = blobClient.GetContainerReference(outputContainerName);resultContainer.CreateIfNotExist();var resultBlob = resultContainer.GetBlobReference(outputBlobName);// Make result blob publicly readable,// so it can be accessed using URI// https://<accountName>.blob.core.windows.net/statisticsresult/statisticsresultvar resultPermissions = new BlobContainerPermissions();resultPermissions.PublicAccess = BlobContainerPublicAccessType.Blob;resultContainer.SetPermissions(resultPermissions);resultBlob.UploadText(output.ToString());
You can then view and download the results by using a web browser to open the blob. For example, the syntax for the URI would be:
https://<accountName>.blob.core.windows.net/statisticsresult/statisticsresult
--Where <accountName> is the name of the cluster account you deployed to Windows Azure.