Burton Smith joined Microsoft last year as a technical fellow. Before joining Microsoft, he was chief scientist and member of the board of directors for Cray Inc. Smith focusses on working with existing groups within Microsoft to help expand the company’s efforts in the areas of parallel and high-performance computing. He reports directly to Craig Mundie, chief technical officer and senior vice president for Advanced Strategies and Policy.
Last June at the International Supercomputing Conference, Dresden, he delivered a keynote titled "Reinventing Computing". Below is the summary of some of his key messages:
The only way to get more performance of a single threaded processor is to increase its speed, and the only way to do that is via increased power consumption and all the costs associated with it. Multicore chips offer a different, but inherently parallel alternative to boosting performance, and performance has always been the chief characteristic of HPC systems. So the lessons learned in building HPC systems can now start to be applied in general purpose computing.
The fundamental, he suggested, is that uniprocessor performance is levelling off and instruction levels, power consumption and cache limitations are all "walls" that are now being hit. And the fact that we now have multicore processors doesn't change this if the architecture hasn't changed, which then means that they become difficult to program.
The Instruction Level Wall is constructed from the limits of the uniprocessor instruction architecture, which are now being reached. There are issues that restrict the level of concurrency possible in a system, such as control dependent computation and data dependent memory addressing, and they collectively limit such architectures to a few instructions per clock cycle.
The Power Wall is now coming into play more significantly. As an example he noted that it is possible to scale hardware by Sigma, but that the power will scale by Sigma as well. Scaling the clock frequency by Sigma is worse, for it scales the dynamic power by Sigma cubed.
The Memory Wall needs not only bigger cache sizes, but also the ability to cut the cache miss-rate in half. In addition, the actual size of the growth in cache capacity will be driven by the type of data being fetched and stored. The more complex, the greater the cache needs to be. For example, if the data is intended for dense matrix-matrix multiply functions, then the cache needs to be four times bigger. If it is for a Fast Fourier transform it has to be the square of the original cache to half the miss-rate. So there are issues here in not only increasing cache size, but also increasing the bandwidth and reducing the latency of the channel serving the cache.