Genetic Programming is a fascinating field of study. Essentially, this is the study of software that writes software, selecting the software it has written that exhibits the highest degree of fitness, and allowing this software to continue to evolve over time. In essence, what Genetic Programming is trying to do is find some route to a higher degree of fitness that is more efficient than either random selection or intelligent design.

To understand the magnitude of the problem, consider how quickly the size of the problem can grow. Say, for example, that you are using IA-32 assembly language on a processor much like my humble little Pentium M in my Tablet PC. How many discrete instructions does the IA-32 instruction set provide? Hundreds. Let's simplify things and say that there are 375 available instructions. We'll ignore the register architecture for the moment, which adds additional complexity. There are, consequently, 54,993,666,708,469,390,869,140,625 different ways to construct a program that is only 10 instructions long. Most of these don't work. Of those combinations that do work, most of them don't do anything useful. The very small number of combinations which do perform some useful activity don't do very much of it.

Most of us are all too familiar with just how easy it is to come up with a combination of instructions that just doesn't work.

Now consider the fact that your program may not necessarily be 10 instructions long. You could potentially solve the exact same problem in 10 instructions, 5 instructions, or 38 instructions. In fact, any solution greater than 10 instructions could potentially be a successful solution that happens to use a lot of NOP instructions (unless, of course, the context of the EIP register happens to be important to determining the correctness of the solution). All put together, there are a LOT of ways to create a program to solve a problem, and the overwhelming majority of them are wrong.

Genetic Programming is an attempt to find a quicker way to the solution from this enormous number of potential solutions. It is not generally practical to try every possible solution to the problem to discover the best one. By starting off with some solutions, selecting the best, and then randomly mutating and recombining, Genetic Programming explores the problem domain much more quickly. Of course, because it covers the set of potential solutions randomly, it is possible that it could still miss the best solution, but any alternative that does not systematically test every possible combination of instructions is subject to the same limitation. The assumption, of course, is that solutions that are a smaller number of mutations away from the "best" solution will show a higher degree of fitness on the test used by your Genetic Programming algorithm.

I don't want to delve too far into Genetic Programming today. What is important to note about genetic programming is the unit of selection: genetic program modifies the source code of the most successful organisms when it is performing mutation or crossover operations.

When I was doing my initial exploration of the terminology in this post, I considered binary code to be analogous to DNA. This is still a perfectly suitable analogy in some regards. (That's the problem with analogies - they are never precise.) Binary code is still the device that drives the expression of phenotype. It is still sufficient to duplicate a piece of software. But, in one very important regard, I don't believe it is always the best analogy, unless you happen to be using binary code as the instruction set.

An important concept is the unit of selection. If you have written an application in C#, and are maintaining that application, those constructs which you happen to save in that C# source code are those that survive to the next generation. This is, indeed, the unit of selection. The binary source code is, of course, a hugely important aspect of the ontogeny of the organism, but I wanted to redefine the analogy I had used previously to be more suitable to some of the concepts that I believe are interesting to explore. As the unit of selection, the source code (in whichever language) is the most suitable analogy to DNA for my purposes.