You sit at the airport only to witness your departure time get delayed. You wait. Your flight gets delayed again, and you wonder “what’s happening?” Can you predict how long it will take to arrive at your destination? Are there many short delays in front of you or just a few long delays? This example demonstrates how you can use “Cloud Numerics” to sift though and calculate a big enough cross section of air traffic data needed to answer these questions. We will use on-time performance data from the U.S. Department of Transportation to analyze the distribution of delays. The data is available at:
This data set holds data for every scheduled flight in the U.S. from 1987 to 2011 and is —as one would expect— huge! For your convenience, we have uploaded a sample of 32 months —one file per month with about 500,000 flights in each— to Windows Azure Blob Storage at container URI: http://cloudnumericslab.blob.core.windows.net/flightdata (note that you cannot access this URI directly in a browser, you must use a Storage Client C# APIs as in Step 2, or a REST API query).
You should use two-to-four compute nodes when deploying the application to Windows Azure. One node might not have enough memory, and for larger-sized deployments there are not enough files in the sample data set to assign to all distributed I/O processes. You should not attempt to run the application on a local system because of data transfer and memory requirements.
First, let’s create a Cloud Numerics application using the Cloud Numerics Visual Studio project template.
1. Start Microsoft Visual Studio and create a solution from the Cloud Numerics project template. Let’s call the solution “OnTimeStats.”
2. Create a new application, and delete the sample code from within the MSCloudNumericsApp subproject’s Program.cs file. Replace it with the following skeleton of an application.
3. Add a .NET reference to Microsoft.WindowsAzure.StorageClient. This assembly is used for reading in data from blobs, and writing results.
We’ll use the IParallelReader interface to read in the data from blob storage in the following manner:
For Step 2, add the following code to the FlightInforReader class in your skeleton application.
After reading in the delays, we compute mean, we center the data to the mean (to make the worse-than-average delays positive and better-than-average ones negative), and then compute sample standard deviation. Then, to analyze the distribution of data, we count how many flights, on average, are more than k standard deviations away from the mean. Also, we keep track how many flights are above or below to mean, so as to detect any skew in data.
For example, if the data were normal distributed one would expect it to be symmetric around the mean, and obey the three-sigma rule —or, equivalently, about 16% of flights would be delayed by more than 1 standard deviation, 2.2 % by more than 2 standard deviations and 0.13% by more than 3.
To implement the algorithm, we add following code as the Main method to the application.
We add the WriteOutput method that writes results to binary large object (blob) storage as .csv-formatted text. The WriteOutput method will create a container “flightdataresult” and a blob “flightdataresult.csv.” You can then view this blob by opening your file in Excel. For example: http://<myAccountName>.blob.core.windows.net/flightdataresult/flightdataresult.csv
--Where <myAccountName> is the name of your Windows Azure account.
Let’s fill in the last missing piece from the application, the WriteOutput method, with following code.
Note that you’ll have to change "myAccountKey" and "myAccountName" to reflect your own storage account key and name.
Now, you are ready to run the application. Set AppConfigure as the startup project, build the application, and use the Cloud Numerics Deployment Utility to deploy and run the application.
Let’s take a look at the results. We can immediately see they’re not normal-distributed at all. First, there’s skew —about 70% of the flight delays are better than average of 5 minute. Second, the number of delays tails off much more gradually than a normal distribution would as one moves away from the mean towards longer delays. A step of one standard deviation (about 35 minutes) roughly halves the number of delays, as we can see in the sequence 8.5 %, 4.0 %, 2.1%, 1.1 %, 0.6 %. These findings suggests that the tail could be modeled by an exponential distribution.
This result is both good news and bad news for you as a passenger. There is a good 70% chance you’ll arrive no more than five minutes late. However, the exponential nature of the tail means —based on conditional probability— that if you have already had to wait for 35 minutes there’s about a 50-50 chance you will have to wait for another 35 minutes.