February 24, 2010 -- In configuring next generation large scale parallel processing arrays some teams are relying on "heterogeneous processing." Basically a fifty-cent phrase describing a microprocessor with one or more on board co-processors for high-speed on-node processing, most typically GPU, FPGA, Cell, and/or DSP. While the debate continues about the right ratio of microprocessors to co-processors, most teams agree that the basic plumbing of memory management can be the real bottleneck. Today the only real solution is having the microprocessor and co-processors share memory on the node, and interconnecting many nodes with a GigE, Infiniband, or a custom interconnection, configuring the nodes in a distributed memory layout.
Enter the unintended consequence of scaling. Amdahl's law says that as you add more processors, you get bogged down by more overhead. Basically the Nth guy you add to build a brick wall begins to slow things down because all the brick layers are reaching for bricks off the same pile, and get in each other's way. Add another N brick layers and it just gets worse. So the idea is to compliment the original process (the first brick layer) with a co-processor that makes that brick layer more efficient (faster), independent of any other brick layer. Image a machine that hands the brick layer a pre-cemented brick, so all they need to do is place it. Or, there is always the old analogy:
"I know how to make 4 horses pull a cart - I don't know how to make 1024 chickens do it." -- Enrico Clementi
Using co-processors dodges Amdahl's law by using more powerful nodes, thus needing fewer of them to reach the same level of performance. While this approach is successful, it puts more burden on the programmer to make a heterogeneous programming model, and successfully implement it on a given node and across multiple nodes. How does the program deploy the algorithm in this new environment? Can it be emulated in one simulation? How does the programmer debug a multi-node program? all using co-processors?
This article will discuss these basics within the tool flow and then focus primarily on memory mapping issues at the low end of FPGA enabled co-processing, and at the high end of the thousand processor arrays.
By Dave Strenski and Brian Durwood. (Durwood is the co-founder of Impulse Accelerated Technologies, Inc. and Strenski is an application analyst at Cray, Inc.)
This brief introduction has been excerpted from the original copyrighted article.
View the entire article on the EE Times Programmable Logic Designline website.