Teχlog: GPU Computing and Fusion

During the recent Analyst Day, AMD talked about something I haven't mentioned yet, partly because it sort of slipped my mind, and partly because it really deserves its own post. Most news reports haven't made a big deal out of it, in part because it's not exactly unexpected, but I think that's a mistake, and perhaps many people don't realize how significant it is.

I'm talking about Fusion, and it's impact on GPU computing (or GPGPU if you prefer). To understand why it's important, you must first look at the current state of GPGPU and its limitations. And when you ask people who use it, they usually tell you that it's really great if you need to do something that boils down to dense linear algebra on large arrays, with speedups up to 10× or so. But they also tell you that they have workloads that might benefit from using GPUs, but not enough to compensate for the overhead of constantly having to copy data to the GPU's memory and back. If you have big chunks of data to send to the GPU in order for it to do a lot of calculations and then send them back, that's fine, but if you only have small chunks of data, or relatively few calculations to make on these chunks before you have to send them back to the CPU for serial work, then your GPU starts being much less helpful.

And it just so happens that many problems have a rather simple, naive solution that relies on simple, static structures such as matrices; and a more sophisticated one that may rely on smarter data parsing, therefore more complex (dynamic) structures, fewer calculations per work unit, so to speak, and that solution is usually much faster. Unfortunately, while the latter may be a great improvement on a CPU, it can be much less of an improvement when you're using a GPU, because you have a hard time coalescing your memory transfers, and the calculations are just too small, so the overhead ends up negating the benefit of using a GPU. Naturally it doesn't have to be this extreme, and using a GPU can remain beneficial even with such algorithms, just a little less so.

So AMD intends to alleviate these issues through incremental improvements to their architecture. First, in 2011, comes the APU:

Now this slide is interesting in the context of graphics, and it does indicate improved latency for CPU—GPU communication, but no one uses IGPs for GPU computing, so let's take a look at this one too:

As you can see, even compared to full-width PCI-Express, APUs (in this case it should be Llano) provide plenty of internal bandwidth. As Jon Stokes pointed out, this is still a far cry from Sandy-Bridge, with its very fast ring bus, but I suspect it will suffice for the time being. That said, Llano isn't being targeted at server markets, and while AMD believes that developers will leverage its GPGPU capabilities for the consumer market, the company apparently doesn't hope for much on the High-Performance Computing front; at least not yet.

But there's a lot more coming:

This diagram may not seem all that exciting, but I have reason to believe that AMD isn't kidding about all those 'substantial's. It mentions substantial improvements to GPU—Memory-controller bandwidth, and I believe we might actually see a shared L3 here at some point. It mentions the same "substantial improvements" to Memory-Controller—Main-memory bandwidth, and AMD actually said a few words about memory die stacking, so we're talking something really big, there. The same goes for discrete GPU bandwidth.

Perhaps more importantly, APUs will move to a unified virtual address space for the CPU, the GPU inside the APU, and the discrete GPU if there is one. They will all have coherent memory and the GPU will support context switching as well as virtual memory support via IOMMU. All that will greatly reduce the overhead I discussed above, and will go a long way towards making GPU computing a reality for a large variety of workloads. Before this happens, I think it will remain a bit of a niche, but it's eventually poised for great expansion. However at that stage, it will be considered heterogeneous computing more than GPGPU.

AMD isn't alone in this game, though. Intel will be there as well with its CPU cores, and with Knights Corner, the 22nm evolution of the elusive Larrabee architecture. As far as I'm aware they will be on separate dies, at least initially, but this is slightly less of an issue because Knights is based on general-purpose (albeit rather simple) CPU cores, supplemented by vector hardware. Intel's approach is different from up close, but basically it's the same idea: a few big, complex, hot, high-clocked CPU cores for serial work, and many smaller, simpler, slower and more power-efficient parallel cores for… well, parallel work. As you might have gathered, NVIDIA is the odd man out, here. They've done a lot to develop GPGPU, but ironically, they might end up left out in the cold because they lack appropriate CPU technology. It's possible that they will try to build HPC-oriented System-On-Chips (SOCs) based on the fastest available ARM cores (so in the near future, Cortex-A15) but that probably won't quite cut it. Hard times are ahead for NVIDIA, no doubt.

That said, those hardware improvements won't magically make everything right, and problems will remain. Notably, AMD is having a hard time figuring out just how much GPU hardware should go into computing-oriented APUs for servers, and Intel is probably having similar issues. Hitting the right balance isn't easy, and a few workloads will remain difficult to exploit for heterogeneous systems no matter what.

But the revolution is coming. ;-)

Teχlog

Monday, November 15, 2010

GPU Computing and Fusion

No comments:

Post a Comment