Jomo Fisher--There are two kinds of programming problems—those that can run in parallel and those that can’t. There’s no special sauce that can turn an inherently serial problem into an efficient parallel algorithm. Today’s top-end desktop has eight cores spread across two processors. All of the cores are identical and they’re each as fast and feature-rich as their manufacturer can make them.
For the algorithms that can’t run in parallel, having eight cores is like trying to get your Christmas shopping done faster by strapping eight Ferraris together and driving them from store to store. Eventually you’ll give up and just use the one with the most gas in the tank.
On the other hand, for problems that can be run in parallel, it’s better to have many processors than to have a small number of very fast processors. Think of sending out a hundred robot minions on Vespas to Christmas shop for you.
Now imagine a world in which you have one very fast processor and a bunch of smaller slave processors. (author takes out napkin and crayon and starts madly scribbling). Let’s say the smaller ones are just good enough to host something like the .NET runtime or a C++ runtime and to execute a few concurrent ui-less processes. They communicate with each other using relatively slow TCP\IP, they have a modest amount of non-shared read-write RAM (let’s say 256meg), and they have a modest amount of shared read-only RAM so that each doesn’t need its own copy of the runtime. There would need to be an onboard TCP\IP switch to keep the communication relatively snappy.
Let’s crunch some numbers. Today’s Core 2 Quad has 582 million transistors. Let’s say our scaled down Vespa CPU is 21 million transistors—this is half the size of a P4 but maybe still too large for what I have in mind. For simplicity, let’s say our onboard dedicated TCP\IP switch is 10 million transistors—I know this is overkill but let’s just use it for a first-order approximation.
So, for the same cost as today’s top-end we could have 1+48 cores that will run single-threaded code as well as today’s systems and, for parallel problems, is significantly better. Now that’s a machine I’d like to find under my Christmas tree this year.
Update August 23, 2007--I just noticed this. It looks like some smart folks were already busy working on such a thing. Now if we could just get the .NET framework running it>>>
This posting is provided "AS IS" with no warranties, and confers no rights.
You've been kicked (a good thing) - Trackback from DotNetKicks.com
I like very much the 1 good core and lots of little ones concept. I don't get why it would be a TCP/IP stack to communicate between the processors though...
I think maybe TCP\IP is too literal. What I'm thinking of is a flexible communication structure that scales in complexity linearly with the number of processors.
I think you missed the point.
Currently my computer has 48 processes running. If I had one CPU per process, then each application could run at full speed instead of constantly fighting for the same CPU.
I would get this gain without any change what so ever to my existing applications.
I think that multi-core is a short term stepping stone for all vendors.
All of the vendors have many core solutions on their roadmaps and whilst your observation, that not all tasks can be run in parallel, is correct, the availability of other cores allows your serial application to run in a much less resourse constrained environment.
Many core is coming, multi-core is our opportunity to make sure we are appropriately equipped to take advantage of the parallel if appropriate.
Sounds similar to the Cell CPU.
And no, the TILE64 is not similar to what you are talking about. It has 64 identical processors which have reconfigurable interconnects.
If you read a bit about TILE64, you would find that you can compile C code to it. So any .NET implementation written in C would work.
I noticed the TILE64 has a variation that runs on a card in the PCI slot. I was thinking the main CPU would be the one in my PC and the slave processors would be the ones on the card.
Yeah, you're right about Cell, but when I wrote this I was thinking (though not accurately conveying) that I think mainstream PC processors should head this way.
A slightly different way to put your proposition (I don't buy Desktop SMP is dead) is that you'd rather have say 1 Wintel compatible fast as possible cpu + 32 simpleton massively parallel cpus than 8 identical Wintel cpus.
I pull the 1+32 vs 8 out of thin air, but there is some tradeoff that would give you more simple cpus.
The problem with this approach is finding a killer app for the simpleton cpus and creating all the dev tools to go with it since it is an incompatible architecture. There is also the question of whether so many more cpus makes the problem of accessing memory over the system bus so much worse that you still lose versus fewer more powerful cpus.
I tend to think the world will have enough trouble making massively parallel software if most of the cpus are the same. However, I can easily envision special purpose apps. For example, imaging pairing the latest graphics GPU (basically a microprocessor) with a 64 cpu Tile 64 on the same board with very high speed memory and dedicated software to do graphics (pretty well understood parallel software) at unheard of speeds? If it worked, by the next year they'd have a 1+64 all on one chip equivalent and you could buy one at the store for $69.95.
PS Can't we put to rest once and for all that even if windows is running 64 threads it can't keep 64 cores busy? Take a look at the column that shows cpu utilization. Nearly all of those threads are asleep, using 0 cpu, waiting for something to happen. Windows will do well just to keep 4 cores busy under the worst cases on a desktop.
Re: your PS. I couldn't agree more. I'm not sure why this misunderstanding is popping up everywhere right now.
Your comment makes me think: the thing we're missing is a true killer app for general purpose parallel. Vision? The possibilities here bogle my mind, but the software isn't here yet.
Bob and Jomo,
Asymmetric multiprocessing (aSMP) is, imho, a big part of why title porting to the PS3 (which uses the Cell - 1 main CPU, 7-8 special purpose processors) hasn't happened as quickly as hoped.
Major Nelson has a 4 part series article about why aSMP isn't appropriate for games but the analysis might apply equally well to application software. The bulk of the time of the CPU is spent doing operations (fetch, load, store) that don't benefit from the specialized processors. Put another way, most programs don't spend enough time doing things that specialized processors do well to justify the tradeoff (speed vs flexibility).
Hmmm, your post has nearly perfectly described the products from Azul Systems. They have custom silicon to thread JVMs, so called "compute appliances". It's actually rather neat. I think you're also trivializing the problem (and advantages?) of shared memory.
This is why software engineers write software, and electronics engineers make hardware. I feel like my eyes have been assaulted :P!
I see where your coming from, but the fact of the matter is that the bottleneck in most systems is the caching and memory access, not the core number, or speed of the cores. This is why the current quad core is actually 2 dual cores back to back, as the 4 cores / 1 cache system is extremely hard to get right.
Secondly, you would never, ever implement communications across the CPU's through TCP/IP. You would use a multi access register or dedicated on chip bus to communicate, as TCP/IP would be too expensive in both silicon and time taken.
That said, the idea of several less powerful cores to do work with is very cool. For instance the .net threadpool could be implemented at hardware level, so that when a thread is taken from the pool, it's literally runing on a core (or timeslice on a core).
I would like to see a core that could run x86 instructions, with an instruction decoder that had a second microcoded instruction set for performing IL instructions and thus removing JIT overhead.
I realise that you can't do everything through microcode (ie precalculated addresses for jumps etc), but any reduction to JIT time would give significant decreases to startup time. Also the underlying calls could be implemented more effectively on a non x86 core optomised for .Net style instructions.
Food for thought.