Additional profile information on Alfred Thompson at Google+
Big data! I don’t know how many times lately I have read or heard that computer science students need to work with big data. But what is big data and where do you get it? If you have ever tried to build fake data you know it can be hard. This is especially true if you want the data to be “real” by some definition of real. Fortunately there is a huge amount of data on the Internet. The US Government has some great collections of data that are available in many formats that often include Excel, comma delimitated list text files, HTML and others. Below are a few of my favorite data sources.
The US Census bureau has several data sets including one about Popular surnames from 2000 Census that you can download and use.
For a list of Popular Baby Names More than the Top 1000 you can visit the Social Security website. There is other data there as well.
The Bureau of Labor Statistics has a lot of data including this helpful page of Databases, Tables & Calculators by Subject
The National Center for Education Statistics (NCES) has data related to education. They even have some tools for building your own custom data sets which can be downloaded in several formats.
Want some large text files for analysis and projects take a look at the large collection of free books at Project Gutenberg. There are books there in many languages by the way!
There are a couple of other links in the comments as I update this over the weekend. I really hope more of you will add your favorite online data sets. Thanks for the comments!
My mind tends to wonder a bit when I drive long distances. Sometimes it goes in very weird directions. For some reason I started thinking about encoding characters in binary. Yeah, pretty weird. Anyway I was reminiscing on RADIX-50 (aka RAD-50) which was a way back in the days of expensive memory of encoding file names in fewer Binary words. Specifically if you had 8 character file names and 3 character files extensions (which is what we had back in the day) you could use one byte for each character which meant you had to use eleven bytes or six words. using RAD-50 you could encode the same file name in four words by packing three characters in each word. Wow! A 25% saving in space. Of course you had to limit yourself to a small character subset – in this case about 40 characters. So I started asking myself how many bits does one need to encode a given number of characters? From there is was a simple jump (ok it is in my weird mind) to “how many characters are there in an alphabet?”
If you are in to subtle clues you may have noticed that I said “an alphabet” rather than “the alphabet.” Why? Well it turns out, and this does not seem to be widely known by American students, there are more alphabets than just the English alphabet. Some of them have more (perhaps some with fewer) than 26 characters. So before you know how many bits you need you have to know which alphabet you are dealing with. Now some people will say “what does it matter? I am writing my application for English speakers.” In today’s world that is not going to cut it. Too much of the market for technology and software is international.
This is why were the computer industry used to standardize on ASCII and EBCDIC (two standards developed in English language and being very English centric) we now use a lot more Unicode which supports some 109,000 characters and 93 different scripts or alphabets. It also requires more bits per character. Good thing memory is cheap these days.
Of course representing the characters is one thing, using them to sort data is another. Did you know that some scripts do not have a specific order and are not used for sorting? Me either but I ran into that while researching links for this post. And in some languages letters with special marks over them are new characters and in some they are not. And does it make a difference if the letters are uppercase or lowercase? Ouch. Some of the commonest things in the world are more complicated than we realize.
Fortunately for most of us there are library routines to handle this stuff for us. Sort routines in the .NET framework for example have options for dealing with multiple languages, different or special sorting orders and of course different character sets. Jumping back to the beginning when I was working for Digital Equipment Corporation which invented and used the Radix-50 system the various programming tools had functions to handle the character encoding and decoding. When I left there I had to write my own functions to do the decoding so that I could read magnetic tapes that had my personal data on them. That was educational.
So what is the point? Well the point is that these are things we don’t often think about but people who want to get involved at the systems level or even just properly understand the issues around making international products do have to think about them. And by the way if you read the Wikipedia article on Unicode you’ll find that for all the good intentions and smart people working on the problem there is still controversy.
One last thing, if you are ever on an interview and someone asks “ how many bits are needed to encode an alphabet” be sure and ask them which one. It’s a trick question.
Paul Vick posted these Seven Rules for Beginning Programmers earlier this week and I have been thinking about them a lot. They make sense to me. As a professional developer you have to understand that these are the rules for beginners. The last item concludes with “You may go beyond these rules after you have thoroughly understood and mastered them.” It is important to remember that there are few absolutes but at the same time beginners need to exercise some caution while they are mastering the basics. Though honestly, even professionals should “Never use language features whose meaning [they] are not sure of.” Of course there should be fewer of those for experienced professionals than for beginners. Here is Paul’s list.
I expect some to take exception with something in rule 7 - “getting rid of the bad programming language habits you picked up at the university” After all isn’t the purpose of a university education to learn “the right way” to do things? Sort of. But the truth is that the types of projects that most complete as part of their education don’t always lead to best practices. The problem is size of the projects and also, in many cases, the artificiality of the projects. They have to be both small and somewhat artificial to fit in the concepts that are being taught. And in some schools software engineering is a dirty word (term) in computer science programs. That’s not all bad as long as people who become software developers understand that they have things to learn about design and about creating large scale projects.
Copy and paste is another interesting case and I am glad to see it on the list. There a huge temptation to copy and paste. Sometimes the idea is to use the new code as is in which case one should really think about encapsulating the code in a method, function or subroutine. Or the idea is to copy/paste and then make a little modification. The problem with that is that one invariably misses something in the process. And of course you also have the possibility of copying something that is wrong, buggy or just plain not as close a fit as you thought. Copy/paste means trouble later if far too many cases for beginners to use it as often as they seem to like to do.
So what do you think of these rules? I see that Garth is planning on posting them in his lab. Would you consider doing the same thing?