Random Disconnected Diatribes of a p&p Documentation Engineer
Some years ago I was forcefully introduced to the concept of statistical quality control, where the overall quality of a batch of items could be determined from an examination of a small sample. This came to mind as I've been immersed in watching demos of the new "Big Data" techniques for analyzing data.
My rude introduction to the topic came about many years ago in a different industry from IT. I was summoned to appear at a large manufacturer in York, England, to look at a delivery my company had made of glass divider panels for railway carriages. The goods inward store manager bluntly informed me that they were rejecting a delivery of 500 panels because they did not meet quality control standards.
"OK", I said, "show me some of the faults". However, it turned out that I was taking a simplistic view of the quality control process. I assumed that they unpacked them all, examined each one as they were about to fit it, and rejected any that were damaged or out of specification. What they actually did is look up the total quantity on a chart, which tells them how many to test. In my case it was 32. So they choose 32 at random and examined these. If more than one fails the test, they reject the whole batch because, statistically, there will be more than the acceptable number of defect items in the batch.
This struck me as odd because I knew that most of the batch would be perfect, some would be perhaps a little less than perfect (glass does, of course, get scratched easily), and only a few would be too bad to use. Our usual approach would be to simply replace any they found to be faulty as and when they came to fit them.
However, as the quality control manager patiently explained, this approach might work when you are installing windows in a house but isn't practical in most cases in manufacturing. If you get a delivery of 100,000 nuts and bolts, you can't examine them all - you just need to know that the number of faulty ones is below your preset acceptance level (perhaps 1%), and you simply throw away any faulty ones because it's not worth the hassle of getting replacements.
Of course, you won't find that exactly one in every 100 is faulty and the other 99 are perfect. You might find a whole box of faulty ones in the batch, or that half of the batch are faulty and by chance you just happened to have tested the good ones. It's all down to averages and random selection of the samples. What worried me as I watched the demos of data analysis with Hadoop-based methods was the assumption that, statistically, you could mistakenly rely on numbers that are really only averages or trends.
For example, one demo used the AdventureWorks sample data to calculate the number of bicycles sold in each zip code area and then mapped this to a dataset obtained from Windows DataMarket containing the average ages of people in each zip code area. The result was that in one specific area people aged 50 to 60 were most likely to buy a bicycle. So the next advertising campaign for AdventureWorks in that area should be aimed at the older generation of consumers.
I did some back-of-an-envelope calculations for our street and I reckon that the average age is somewhere around the 45 to 55 mark. Yet the only people I see riding a bicycle are the couple across the road who are in their 30s, a lady probably in her late 20s that lives at the other end of the street, and lots of young children. I rather doubt that an advert showing two gray-haired pensioners enjoying the freedom of the outdoors by cycling through beautiful countryside on their new pannier-equipped sit-up-and-beg bicycles would actually increase sales around here. Though perhaps one showing grandparents giving their grandkids flashy new racing bikes for Christmas would work?
Maybe "Big Data", Hadoop, and HDInsight do give us new ways to analyze the vast tracts of data that we're all collecting these days. But what's worrying is that, without applying some deep knowledge of statistical analysis techniques, will we actually get valid answers?
It's amazing how often you get the feeling with computers that someone has virtually trampled on your toes, or unceremoniously shoved you out of the way. The latest Patch Wednesday updates (here in England, patch Tuesday usually catches up with us on Wednesday) seemed to coincide with a driver update for NVidia cards to resolve a vulnerability, and since then I've been trying to clean up some picture files that NVidia feel they are free to dump into the My Pictures folder.
Of course, at first you have to wonder how just installing a video card can open up your computer to remote attacks that can take over the whole machine. Not that I had any idea up until last week that I actually had NVidia GeForce cards in my two Media Center boxes. Or, until I looked at the preference settings in the video driver console, that the driver checked for updates every day - and obviously not very successfully if it needs a Patch Wednesday to kick off the update process.
But what galled me was that, after the update, I had a new folder in the public pictures folder full of weird 3D sample files that the video card doesn't actually recognize as I didn't install the 3D driver. Because we use Media Center as our main TV and entertainment system, I like to manage the pictures folder so that we can browse our collection of digital photos, and they are also displayed by the screensaver. I don't really want some unusable files dropped in there, especially by an update that doesn't bother to ask for my permission.
And what's worse is that, on one of the two machines, I can't delete them. The owner, and the only account with permissions to delete them, is the built-in SYSTEM account - which is presumably used by the driver update program. I managed to add my domain admin account to the list of permissions and even take ownership, but I still can't delete them. I have no idea why. All I managed to do was set the hidden flag on the folder so that they don't show up in Media Center. Yet on the other machine I was able to delete the 3D sample files using my domain admin account.
Of course, it could just be that the flat file system of my hard drive doesn't recognize the extra dimension...
Some friends have just adopted a rather cute ginger cat and decided to name it Juno, perhaps after the Queen of the Roman Gods. Though it regularly leads to the interesting conversation: "What's your cat's name?" - "Juno" - "No I don't, that's why I'm asking"...
Meanwhile, here at p&p we're just starting on a project named after one of the new religions of information technology: Big Data. It seems like a confusing name for a technology if you ask me (though you probably didn't). Does it just consist of numbers higher than a billion, or words like "floccinaucinihilipilification" and "pseudopseudohypoparathyroidism"?
Or maybe what they really mean is Voluminous Data, where there's a lot of it. Too much, in fact, for an ordinary database to be able to handle and query in a respectable time. Though most of the examples I've seen so far revolve around analyzing web server log files. It's hard to see why you'd want to invent a whole new technology just for that.
Of course, what's at the root of all this excitement is the map/reduce pattern for querying large volumes of distributed data, though the technology now encompasses everything from highly distributed file systems (HDFS) to connectors for Excel and other products to allow analysis of the data. And, of course, the furry elephant named Hadoop that sits in the middle remembering everything.
Thankfully Microsoft has adopted a new name for its collection of technologies previously encompassed by Big Data. Now it's HDInsight, where I assume the "HD" means "highly distributed". There's a preview in Windows Azure and a local server-based version you can play with.
What's interesting is that when I first started playing with real computers (an IBM 360) all data was text files with fixed width columns that the code had to open and iterate through, parsing out the values. The company where I worked used to have four distinctly separate divisions, each with its own data formats, but these had now been melded into one company-wide sales division. To be able to assemble sales data we had a custom program written in RPG 2 that opened a couple of dozen files, read through them extracting data, and assembled the summaries we needed - we'd built something vaguely resembling the map/reduce pattern. Though we could only run it an night because it prevented most other things from working by locking all the files and soaking up all of the processing resources.
Thankfully relational databases and Structured Query Language (SQL) put paid to all that palaver. Now we had a proper system that could store vast amounts of data and run fast queries to extract exactly what we needed. In fact we could even do it from a PC. And yet here we are, with our highly distributed data and file systems, going back to the world of reading multiple files and aggregating the results by writing bits of custom code to generate map and reduce algorithms.
But I guess when you appreciate the reasons behind it, and start to grasp the concepts of the vast amounts of data involved, our new (old) approach starts to make sense. By taking the processing to the data, rather than moving the data around, you get distributed parallel processing across multiple nodes, and faster responses. And when you discover just how vast some of the data is, you realize that our modern relational and SQL-based approach just doesn't cut it.
Though there are some interesting questions that nobody I've spoken to so far has answered satisfactorily. What happens when you need more than just a simple aggregate result? It seems likely that the map function needs to produce a result set that is considerably smaller than the data it's working on, and if there is little correspondence between the data in each node the reduce function won't be able to do much reducing.
Maybe I just don't get it yet. And maybe that's why being just a "database programmer" is no longer good enough. Now, it seems, you need to be a "data scientist". You not only need to know about Database Theory, but Agile Manifesto and Spiral Dynamics as well according to DataScientists.net. You're going to spend the rest of your life organizing, packaging, and delivering data rather than writing programs that simply run SQL queries.
But it does seem that data scientists get paid a lot more, so maybe this Big Data thing really is a good idea after all...
So it's New Year resolution time again, and it's pretty clear after many previous unsuccessful iterations that the usual crop consisting of more exercise, better diet, and giving up smoking are a waste of time. Therefore, after several months of playing host to an assortment of builders and tradesmen, this year's resolution is more DIY.
What's annoying is that most tradesmen seem to be in a mad rush to get to the next job, and so don't have time for those little finishing touches (which, as my wife says I'm a perfectionist, are so important). Some days it really did feel like I might as well have done the job myself. For example, over the last several weeks I've been:
Meanwhile the people who delivered the rubbish skip and promised to come back for it the next morning left it here for a week so my front lawn now has a square hole that, after all the rain we've had, resembles a small swimming pool.
Realistically, though, many of the jobs they tackled were beyond my level of competence or patience. I can see that my attempts to plaster a ceiling or tile a floor would probably be a disaster, and completely rewiring a kitchen is likely to require some level of theoretical knowledge of the regulations that I don't have.
But, hopefully, it will be another fifteen years before we need to do anything else to the house. And I'll be retired long before then, so I'll have plenty of spare time. However, my wife says we're never going to go through this again - we're going to move house instead. Though I'm not sure that would be any less stressful...