Computer Science Teacher
Computer Science Teacher - Thoughts and Information from Alfred Thompson

May, 2011

  • Computer Science Teacher - Thoughts and Information from Alfred Thompson

    Data! I need more data!


    Big data! I don’t know how many times lately I have read or heard that computer science students need to work with big data. But what is big data and where do you get it? If you have ever tried to build fake data you know it can be hard. This is especially true if you want the data to be “real” by some definition of real. Fortunately there is a huge amount of data on the Internet. The US Government has some great collections of data that are available in many formats that often include Excel, comma delimitated list text files, HTML and others. Below are a few of my favorite  data sources.

    The US Census bureau has several data sets including one about Popular surnames from 2000 Census that you can download and use.

    For a list of  Popular Baby Names More than the Top 1000 you can visit the Social Security website. There is other data there as well.

    The Bureau of Labor Statistics has a lot of data including this helpful page of Databases, Tables & Calculators by Subject

    The National Center for Education Statistics (NCES) has data related to education. They even have some tools for building your own custom data sets which can be downloaded in several formats.

    Want some large text files for analysis and projects take a look at the large collection of free books at Project Gutenberg. There are books there in many languages by the way!

    There are a couple of other links in the comments as I update this over the weekend. I really hope more of you will add your favorite online data sets. Thanks for the comments!

  • Computer Science Teacher - Thoughts and Information from Alfred Thompson

    How Many Letters In An Alphabet


    My mind tends to wonder a bit when I drive long distances. Sometimes it goes in very weird directions. For some reason I started thinking about encoding characters in binary. Yeah, pretty weird. Anyway I was reminiscing on RADIX-50 (aka RAD-50) which was a way back in the days of expensive memory of encoding file names in fewer Binary words. Specifically if you had 8 character file names and 3 character files extensions (which is what we had back in the day) you could use one byte for each character which meant you had to use eleven bytes or six words. using RAD-50 you could encode the same file name in four words by packing three characters in each word. Wow! A 25% saving in space. Of course you had to limit yourself to a small character subset – in this case about 40 characters. So I started asking myself how many bits does one need to encode a given number of characters? From there is was a simple jump (ok it is in my weird mind) to “how many characters are there in an alphabet?”

    If you are in to subtle clues you may have noticed that I said “an alphabet” rather than “the alphabet.” Why? Well it turns out, and this does not seem to be widely known by American students, there are more alphabets than just the English alphabet. Some of them have more (perhaps some with fewer) than 26 characters. So before you know how many bits you need you have to know which alphabet you are dealing with. Now some people will say “what does it matter? I am writing my application for English speakers.” In today’s world that is not going to cut it. Too much of the market for technology and software is international.

    This is why were the computer industry used to standardize on ASCII and EBCDIC (two standards developed in English language and being very English centric) we now use a lot more Unicode which supports some 109,000 characters and 93 different scripts or alphabets. It also requires more bits per character. Good thing memory is cheap these days.

    Of course representing the characters is one thing, using them to sort data is another. Did you know that some scripts do not have a specific order and are not used for sorting? Me either but I ran into that while researching links for this post. And in some languages letters with special marks over them are new characters and in some they are not. And does it make a difference if the letters are uppercase or lowercase? Ouch. Some of the commonest things in the world are more complicated than we realize.

    Fortunately for most of us there are library routines to handle this stuff for us. Sort routines in the .NET framework for example have options for dealing with multiple languages, different or special sorting orders and of course different character sets. Jumping back to the beginning when I was working for Digital Equipment Corporation which invented and used the Radix-50 system the various programming tools had functions to handle the character encoding and decoding. When I left there I had to write my own functions to do the decoding so that I could read magnetic tapes that had my personal data on them. That was educational.

    So what is the point? Well the point is that these are things we don’t often think about but people who want to get involved at the systems level or even just properly understand the issues around making international products do have to think about them. And by the way if you read the Wikipedia article on Unicode you’ll find that for all the good intentions and smart people working on the problem there is still controversy.

    One last thing, if you are ever on an interview and someone asks “ how many bits are needed to encode an alphabet” be sure and ask them which one. It’s a trick question.

  • Computer Science Teacher - Thoughts and Information from Alfred Thompson

    Seven Rules for Beginning Programmers


    Paul Vick posted these Seven Rules for Beginning Programmers earlier this week and I have been thinking about them a lot. They make sense to me. As a professional developer you have to understand that these are the rules for beginners. The last item concludes with “You may go beyond these rules after you have thoroughly understood and mastered them.” It is important to remember that there are few absolutes but at the same time beginners need to exercise some caution while they are mastering the basics. Though honestly, even professionals should “Never use language features whose meaning [they] are not sure of.” Of course there should be fewer of those for experienced professionals than for beginners. Smile Here is Paul’s list.

    1. Do not write long procedures. A procedure should not have more than ten or twelve lines.
    2. Each procedure should have a clear purpose. It should not overlap in purpose with the procedures that went before or come after. A good program is a series of clear, non-overlapping procedures.
    3. Do not use fancy language features. If you’re using something more than variable declarations, procedure calls, control flow statements and arithmetic operators, there is something wrong. The use of simple language features compels you to think about what you are writing. Even difficult algorithms can be broken down into simple language features.
    4. Never use language features whose meaning you are not sure of. If you break this rule you should look for other work.
    5. The beginner should avoid using copy and paste, except when copying code from one program they have written to a new one they are writing. Use as few files as possible.
    6. Avoid the abstract. Always go for the concrete. [Ed. note: This one applies unchanged.]
    7. Every day, for six months at least, practice programming in this way. Short statements; short, clear, concrete procedures. It may be awkward, but it’s training you in the use of a programming language. It may even be getting rid of the bad programming language habits you picked up at the university. You may go beyond these rules after you have thoroughly understood and mastered them.

    I expect some to take exception with something in rule 7 -  “getting rid of the bad programming language habits you picked up at the university” After all isn’t the purpose of a university education to learn “the right way” to do things? Sort of. But the truth is that the types of projects that most complete as part of their education don’t always lead to best practices. The problem is size of the projects and also, in many cases, the artificiality of the projects. They have to be both small and somewhat artificial to fit in the concepts that are being taught. And in some schools software engineering is a dirty word (term) in computer science programs.  That’s not all bad as long as people who become software developers understand that they have things to learn about design and about creating large scale projects.

    Copy and paste is another interesting case and I am glad to see it on the list. There a huge temptation to copy and paste. Sometimes the idea is to use the new code as is in which case one should really think about encapsulating the code in a method, function or subroutine. Or the idea is to copy/paste and then make a little modification. The problem with that is that one invariably misses something in the process. And of course you also have the possibility of copying something that is wrong, buggy or just plain not as close a fit as you thought. Copy/paste means trouble later if far too many cases for beginners to use it as often as they seem to like to do.

    So what do you think of these rules? I see that Garth is planning on posting them in his lab. Would you consider doing the same thing?

Page 4 of 9 (27 items) «23456»