Tuesday, November 30, 2010

GeForce GTX 570: specs and release date

There had been whispers going around for a few weeks about the GeForce GTX 570, but now, thanks to the guys from Sweclockers, we have the specifications, and a release date: December 7.

This new card comes with 480 shaders, like the GTX 480, 1280MB or RAM, like the GTX 470, and 732/1464/3800MHz, for the base, shader, and RAM clocks, respectively… And those clocks are higher than the 480's. Confused, yet? So am I, so let's crunch a few numbers, shall we?

Now that's better. The rightmost column indicates how much better than the GTX 480 the 570 is, as a (sometimes negative) percentage. If it's green, then it's better, if it's red, it's worse—so far so good, right? If it's yellow, it's either neutral or not directly important. For instance, memory bus width in itself doesn't matter, but it contributes to memory bandwidth, which does. Therefore, memory bus width is yellow, but memory bandwidth could be either green or red (it happens to be red in this case).
Also note that this table chart doesn't show an important detail: in some cases, namely when processing RGB9E5 or FP16 textures, the GTX 580 and 570's TMUs are twice as fast as their predecessors'. The effect of this obviously depends on whether those formats are used in a particular game, and to what extent. In practice, you could see a performance gain anywhere between 0 and 15%, maybe more in very few pathological cases.

So, compared to the GTX 480, the 570 has slightly higher shader and texturing throughput, especially considering the improved TMUs, and that should help a bit. It also has slightly higher triangle throughput, but the GTX 480 was far from bottlenecked in this area, so it shouldn't have any measurable effect. Likewise, the GTX 570 has significantly less memory, but I don't expect that to be a problem in 1920×1200; in 2560×1600 with anti-aliasing, however, it could be.
The main problems are memory bandwidth and, to a somewhat lesser extent, fillrate. Those two go down by 13~14%, and I suspect it will have a significant effect.

All in all, it's hard to say how the GTX 570 will compare to the 480. I think we'll see it being slightly faster in some games, slower in others. Perhaps something like 3~5% slower on average, but I don't expect the gap to be larger than this.

Finally, Sweclockers' information mentions a TDP of 225W, which is 25W less than the GTX 480's official TDP, and 75W less than its actual maximum power draw. Then again, the GTX 580 has a TDP of 244W, but with its limiter off, it has been measured well upwards of 300W, so who knows?

In any case, the GTX 570 looks like a good replacement for the 480: though it might be a bit slower in some games, on average it should perform similarly, but with lower power consumption, and hopefully much lower noise levels. The big question is how expensive it will be, and of course, how it will compare to AMD's Cayman, which is due just a few days after the 570's launch.

PS: I'm happy to announce that Teχlog has just reached 1000 pageviews: a modest milestone, but the first ones always are… :-)

Furthermore, I think that AMD and NVIDIA's renaming practices are dishonest and harmful to consumers, and that they need to stop.

Monday, November 29, 2010

AMD pulls an NVIDIA

A few years ago, renaming products was commonplace in the graphics world. Then, AMD sort of stopped doing it, and NVIDIA started doing it a lot more. The latter therefore gained a reputation for being something of a serial-renamer.

But last year, AMD surprised everyone by introducing new HD 5000 products that were in fact renamed HD 4000s, such as the Mobility Radeon HD 5145, 5165, or the oddly-named HD 530v/540v/550v/560v. AMD argued that OEMs demanded new names for existing 55nm DX10.1 designs. People complained for a day or two, but then forgot about it. After all, those were only low-end mobile products, and the commercial designations indicated fairly clearly that they were inferior to proper DX11 designs such as the HD 5600, for instance. Nevertheless, that was regrettable.

Then, a bit over a month ago, AMD introduced the Radeon HD 6800s, which were slower than the HD 5800s. While that wasn't strictly speaking a renaming, it was still misleading and an unpleasant surprise.

And today, AMD has just released "new" products, namely the HD 6500M and HD 6300M. Now you might think that those are mobile derivatives of AMD's latest Northern Islands architecture, but they're not. The specifications for these two additions to AMD's lineup state that they feature the "UVD 2 dedicated video playback accelerator" which is a component of Evergreen, otherwise known as the HD 5000 series. Those parts are in fact renamed Evergreen products. More specifically, the HD 6500M bears striking resemblance to the Mobility HD 5770, and the HD 6300M reminds me a lot of the Mobility HD 5470. Let me take that opportunity to say that AMD's website is a pain to navigate.

Also note that the HD 6300M and 6500M have pretty loose specifications as far as clocks are concerned, or even memory type. In practice, the 6500M present in a laptop could be clocked at 500MHz with 900MHz DDR3, or at 650MHz with 900MHz GDDR5, with the same name!

This sort of thing creates a very confusing situation for consumers. When you can't trust the name of a SKU to reflect its generation, it's bad. When you can't even trust the name of a SKU to refer to one product with precise specifications, it's worse. The thing is, both AMD and NVIDIA do this because it works: it helps them sell more graphics cards. The press usually makes a couple of snide comments, but quickly moves on. Clearly, that's not enough to deter such behavior.

This is why I've decided to go on a little crusade of my own, in the hope that it will get AMD and NVIDIA to stop doing this. Obviously, there's no way I can succeed on my own, so I urge every member of the tech press to do the same: from now on, every single post about NVIDIA or AMD will be concluded with the following sentence, linking to this post.

Furthermore, I think that AMD and NVIDIA's renaming practices are dishonest and harmful to consumers, and that they need to stop. 

Hey, it worked for Cato the Elder.

UPDATE: Dave Baumann chimed in here, and made the following comment: These support hardware accellerated MVC (Blu-Ray 3D) playback where Mobility Radeon HD 5000 didn't. And across the board HDMI 1.4a support.

I apperciate that, but I still don't think that the HD 6000 name is justified.

UPDATE 2: More information from Dave here: UVD2 has to be driven in a different way in order to get MVC decode and this requires a VBIOS update (or an SBIOS update in the cases of most notebooks) and additionally requires qualification by us and the vendor. HDMI 1.4a can be achieved by a driver upate (as it was on desktop Radeon HD 5000) but some notebook vendors still re-qual the software updates.

This pretty much confirms that we're dealing with the same chip. 

Saturday, November 27, 2010

Hans de Vries dissects Bulldozer

For every major CPU release, Hans de Vries from Chip Architect takes a look at the die shot with his magic magnifying glass and tries to determine just which part does what. And as expected, he's done it with Bulldozer too. This time though, it was a little bit trickier than usual because AMD went through extra trouble and photoshopped the die shot, scaling parts up and down, blurring stuff, cutting and pasting components… The point of this was to make it difficult for Intel to draw any conclusive information from the picture.

But that wasn't enough to discourage Hans, and here's what he's been able to produce:

Sandy Bridge prices

"Hello, I'm Sandy Bridge"
If you were wondering how much Sandy Bridge processors would cost, Expreview has the answser for you.

They have a nice little table chart, so take a look at it if you want all the details. With prices ranging from $64 for the Pentium G620 (2 Cores, 2.6GHz, 3MB of L3, no HyperThreading, no CPU Turbo) to $317 for the Core i7-2600K (4 cores, 8 threads, 3.4GHz with Turbo up to 3.8GHz) there's something for everyone.

However it's unfortunate that if you want the full chip with 4 cores, HT and Turbo, there's nothing below $294 (Core i7-2600). This is a clear sign of a lack of competition in this space. Hopefully, things will improve in Q2'11 with Bulldozer, but until then…

Friday, November 26, 2010

NVIDIA's Endless City demo on Radeons

Remember NVIDIA's Endless City demo? Here is how NVIDIA describes it: 

Take a cruise through the most complex city ever rendered in real-time. NVIDIA’s Endless City harnesses the horsepower of our incredible tessellation engine to procedurally generate urban detail never before possible in an interactive world. Sit back, relax and enjoy the view. 
More here > Demos > Endless City.

When it was released, this demo wouldn't run on Radeons because it required CUDA, but it wasn't clear what CUDA was used for exactly. As it turns out, it's not used for anything at all, at least according to Scali's latest weblog post. He found out that you can disable CUDA as well as a vendor check, and the demo runs just fine on any DX11 card, though apparently not very fast. He's even uploaded a patch to make it easy for everyone.

New Sandy Bridge benchmarks!

About three months ago, Anandtech published a performance preview for Sandy Bridge, well ahead of its launch. And now, Inpai has just joined the party with a preview of their own. It's in Chinese, but the charts speak for themselves.


UPDATE: now it's also available in English, here

Thursday, November 25, 2010

NVIDIA Echelon

Xbitlabs has a new piece about NVIDIA's Echelon, a research project investigating heterogeneous computing in future ExaFLOPS (10^18 FLOPS) systems.

NVIDIA isn't willing to share much more than a bunch of pretty slides with big numbers at this stage, but it's worth a look.

Sandy Bridge and low-end SKUs

Sandy Bridge, in its quad-core version.

ComputerBase.de has managed to get a listing for future low-end Intel processors based on Sandy Bridge.

They have a nice little chart so I won't detail every single SKU, but I'll say this: first, Intel's naming scheme is still confusing as hell; second, even the lowly Pentium G620 has Turbo enabled for the graphics part (though not the CPU). This is both slightly unexpected and quite welcome, since that feature is key to Sandy Bridge's power efficiency. It's nice to see that even the bottom end isn't completely crippled. Though the CPU cores lack HyperThreading and Turbo, they still have a decent amount of L3 cache (3MB) and run at a respectable 2.6GHz, so I expect very respectable performance from this part.

Fudzilla also has some information about Sandy Bridge-based Celerons, but they don't seem to have the full specifications. Then again, it's supposed to be released in Q3'11, so those might not be set yet.

Wednesday, November 24, 2010

Good news about Llano

Charlie Demerjian has just published a new article about Llano over at SemiAccurate. The short story is that even though it initially ran into some pretty bad trouble, it's now doing a lot better and might actually be released sooner that AMD has let on so far.

And the company is in a difficult competitive situation at the moment in the mobile market, so that's very good news for them.

Monday, November 22, 2010

Cayman specifications leaked

It seems the Polish website FrazPC had a little mishap this morning. They mistakenly uploaded a lot of slides about Cayman, AMD's upcoming high-end GPU, apparently from a presentation recently given by AMD. I think the deadline was supposed to be today, so either FrazPC got the time wrong or the NDA changed. Either way, I—and others—had just enough time to save the relevant slides, so there you go:


So first of all, we can finally confirm that Cayman is based on a VLIW4 architecture. I've talked about it here, so I won't dwell on it much, let's just say that a VLIW4 unit should be almost as fast as a VLIW5 one, but smaller. We "know" from the recent Antilles leak that Cayman has 30 SIMDs, and here we can see that there is still one quad-TMU per SIMD, so that's 120 TMUs total, a 50% increase over Cypress (HD 5870)! Cayman looks like a real texturing beast.

Cayman also features what you might call a distributed geometry engine, similar to Fermi's, but more limited. Still, it can process two primitives per clock, and at 900MHz or more, that's over 1800 Mtri/s. That should be amply sufficient in even the most demanding games, but I can't help feeling that there could still be further improvements down the road. The bit about "off-chip buffer support for high-tessellation levels" sort of raises a red flag: it appears that Cayman can't handle high tessellation very well relying solely on on-die resources. Surely, being able to use an off-chip buffer is better than just choking on excessive information as Cypress seems to do in such cases, but it's not exactly ideal either. As usual, it's a trade-off, of course.

AMD promises tessellation performance at least 50% higher than that of Cypress, and 100+% higher with high tessellation factors. That's not nearly as good as Fermi, but I don't expect any game to reflect that.

As expected, the new VLIW4 units combine simple SPUs to handle transcendentals, occupying 3 slots, as I mentioned in my previous post. Again, thanks go to Gipsel for providing that information. AMD claims similar performance with a ~10% area reduction compared to Cypress, which is exactly what I had predicted. I wish I could say that it was anything more than a lucky guess, but I really can't.
Also note that when the slide says "2 64-bit MUL or ADD", that's a mistake, it really means "either two 64-bit ADDs or one 64-bit MUL". Still, with up to one 64-bit MAD or FMA per clock, Cayman achieves a DP rate of ¼, which isn't bad at all. Obviously, removing the non DP-capable T unit has helped. GPGPU folks should be happy about that, especially since there's more:

UPDATE: I hadn't even noticed, but the slide also says two 32-bit ADDs per cycle. That's obviously a mistake too, each VLIW unit is capable of four 32-bit ADDs per cycle. Once again, Gipsel was vigilant. This just goes to show that you shouldn't drink and make slides. ;-)

There are a few welcome improvements here. Exactly how the L2 cache will be used is unclear, however.

This table doesn't tell us much, but I took the liberty of adding some information, based on the recent leak about Antilles, and an educated guess for the memory bandwidth.

The next two slides introduce a new high-quality anti-aliasing mode…

…which Cayman should be able to handle just fine, thanks to seriously beefed-up ROPs:

But wait, there's more! AMD is introducing a new type of power management that they call "Power Containment":

Exactly how this works is far from clear, but AMD has apparently substantially increased granularity for power management, with regard to both time and functional blocks. This feature is claimed to be user-controllable through AMD's Overdrive utility, but I doubt it really affords all that much control. At least, you should be able to disable it, which could be useful for overclocking. Beyond that, I doubt there's much that Overdrive lets you modify.

Well, I wasn't there at the presentation, and I'm not sure what this slide is about exactly, but it seems to highlight the fact that power containment allows the GPU to always remain exactly within TDP, without having to scale clocks down any lower than necessary, and apparently helps with idle power too. I'd suggest waiting for comments from someone who was actually there, though.

All in all, when compared to Cypress, Cayman provides:
  • 20% more SPUs, which are more efficient,
  • 50% (slightly less capable) VLIW units,
  • 50% more TMUs,
  • 100% higher geometry throughput per clock,
  • significantly improved ROPs,
  • 10% higher bandwidth, maybe more,
  • Higher clocks, most likely,
  • a few bits here and there…
Judging from all this, I'm going on record saying it should be faster than the GTX 580 when it is released some time next month. In the meantime, we'll just have to wait. 

ISSCC 2011 and new information

I'll keep it short this time. Dresdenboy has just published a new blog post on Citavia about the International Solid-State Circuits Conference 2011 (ISSCC) and you should read it because it contains new, juicy info. Here's a teaser:

4.5 Design Solutions for the Bulldozer 32nm SOI 2-Core Processor Module in an 8-Core CPU
T. Fischer, S. Arekapudi, E. Busta, C. Dietz, M. Golden, S. Hilker, A. Horiuchi, K. A. Hurd, D. Johnson, H. McIntyre, S. Naffziger, J. Vinh, J. White, K. Wilcox, AMD
The Bulldozer 2-core CPU module contains 213M transistors in an 11-metal layer 32nm high-k metalgate SOI CMOS process and is designed to operate from 0.8 to 1.3V. This micro-architecture improves performance and frequency while reducing area and power over a previous AMD x86-64 CPU in the same process. The design reduces the number of gates/cycle relative to prior designs, achieving 3.5GHz+ operation in an area (including 2MB L2 cache) of 30.9mm2.

And here are some figures about Sandy-Bridge, Westmere and Llano, for reference:

Author: Hans de Vries

As you can see, a Bulldozer module (2 cores) with 2MB of L2 cache is actually a bit smaller than 2 Llano cores with the same amount of cache! That's quite promising.

Sunday, November 21, 2010

Intel releases SDK for OpenCL

When Intel started talking about Sandy-Bridge, their upcoming CPU/GPU—or APU, if you will—architecture a while ago, they mentioned that it would be compatible with OpenCL, an open framework for parallel programming on a broad range of architectures, aimed at taking advantage of heterogeneous systems with traditional CPU cores and more parallel ones, for instance GPUs. OpenCL is managed by the Khronos Group, and backed by AMD, Apple, NVIDIA, and now Intel.

Indeed, considering their recent announcement regarding OpenCL and Sandy-Bridge, it should come as no surprise that they have just released their own SDK for OpenCL, albeit in an Alpha version. With Intel, AMD, Apple and NVIDIA actively supporting it, OpenCL now has potential to become the standard for parallel computing. Granted, NVIDIA would probably like you to use CUDA instead, but they will support any initiative that takes advantage of their GPUs for compute purposes.

The obvious advantage of OpenCL is that it's compatible with most widely-used architectures. That's not to say that you can just write your code once and have it run blissfully fast on all parallel processors, though. Unfortunately, some amount of tuning will always be necessary to extract performance from specific architectures, but at least, with OpenCL, you can do so using one language, sharing some code, and using one set of tools. As such, it's a huge improvement over having to use CUDA for NVIDIA, Brook+/CAL/CTM for ATI/AMD, and traditional programming languages for CPUs.

AMD Antilles specifications leaked

There's a lot going on with AMD these days. An apparently genuine slide with some specifications for Antilles, or the AMD Radeon HD 6990, an upcoming dual-GPU card has just surfaced here.

This tells us that Cayman, the GPU on which Antilles is based, has at least (and probably exactly) 1920 SPUs. It also tells us that the GPUs in Antilles are clocked at 775MHz, since it can output 3100 million triangles per second, and Cayman is rumored to be able to produce 2 triangles per clock. It's the only way the 3100 Mtri/s figure makes sense anyway, so this rumor must be true.

Managing to put a pair of such large GPUs at 775MHz in a 300W card is quite impressive, and from that, we an infer that the single-GPU Radeon HD 6970 should be clocked at 900MHz or above. I would estimate its maximum power draw at about 225W, perhaps a tad more since it features a 6-pin and an 8-pin power connector, providing the card with up to 300W.

Thursday, November 18, 2010

Bulldozer and Llano roadmaps leaked

The guys from ATI-Forum.de managed to get their hands on roadmaps for the client version of Bulldozer, otherwise known as Zambezi, and for Llano (desktop only).

This is very much in line with what AMD revealed during Analyst Day: Llano will be out in mid-2011, available in 2-, 3- and 4-core versions. Apparently, only two different power envelopes will coexist, at least initially: 65W and 100W. I'm sure 45W products will follow in Q4. Obviously, mobile SKUs will have much lower TDPs.

We already knew that Bulldozer would make its way into desktops in Q2, but now it even seems to be around May, which is unexpected, but very good news. It will be offered with 8 cores first, and in 125W as well as 95W versions. Then, 6- and 4-core versions will follow, all within 95W. I wonder whether there will be 65W versions later.

It appears that AMD might be competitive in high-end desktops in the first half of 2011 after all, which is great because it hasn't been the case since 2006.

Wednesday, November 17, 2010

AMD joins MeeGo

Earlier this year, Intel and Nokia announced a new joint project called MeeGo. Behind this slightly funny-sounding name was an open-source OS based on Linux and aimed at mobile platforms, such as tablets, netbooks and high-end smartphones. Actually, Intel and Nokia merged their Moblin and Maemo projects (respectively) to create MeeGo. This came as a bit of a surprise, because Nokia was using ARM chips at the time—and actually still is, as far as I'm aware—not x86. Nokia probably intends to use Atom in its products at some point, and that was the motivation behind MeeGo. As for Intel, the aim was to develop a new mobile OS highly optimized for Atom, in order to have something that actually runs well on a such a slow processor.

And in a surprising turn of events, AMD has just joined the MeeGo project. This is surprising because MeeGo was perceived as an Intel thing, but if you think about it, it makes a lot of sense. AMD is about to release a new APU (Zacate/Ontario, or Zacario as I like to call it) and would greatly benefit from a mobile OS highly tuned for it, even though it seems to run Windows 7 fairly smoothly. Plus, AMD has been pretty vocal lately about embracing open standards, and in that sense, their joining MeeGo is a big deal, because it effectively makes it a PC standard: both major x86 players are now behind it. Since Nokia intends to use it on its smartphones, ARM processors are supported as well, and of course optimized for. So you could even say it's now a mobile computing standard.

MeeGo still has a long way to go before it can get serious market share, but with support from both Intel and AMD, it could turn out to be the Android of netbooks… except Google is working on just that, and calling it Chrome OS. It will be interesting to see how things turn out, but it's safe to say that mobile users will "soon" have a couple more interesting alternatives.

Tuesday, November 16, 2010

Zacate benchmarked!

A few select journalists have been given access to a Brazos platform equipped with a Zacate APU, the 18W version of the chip otherwise known as Ontario. There are a few others, but here is a preview from Anandtech, here is one from PC Perspective with power comparisons, and here is one from The Tech Report. In that last one, I really like the subjective testing part:

Scientific benchmarks or not, we like to install different games on our laptops and manually tweak the options to see how well they run. A little subjective testing never hurt anybody, right? [page 4].

Zacate turns out to be consistently faster than Atom, especially in single-threaded workloads, which makes the system much snappier, according to Anand. It can can come pretty close to CULV platforms in CPU-bound applications, and often does (much) better in games. What the previews don't insist on is just how small this piece of silicon is. Just look at this:

Author: Hans de Vries

Basically, AMD has made a 75mm² (official figure) die that is consistently faster than Atom (87mm²) in CPU-bound workloads, and vastly superior in games or video. It can even compete with much bigger CULV processors, and should still do well against (possible) CULV variants of Sandy-Bridge. I expect those to feature 2 cores and 6 so-called EUs (GPU units) ending up with lower GPU performance than Zacate, and a ~150mm² die. The smaller the die, the lower the manufacturing cost, so that's very important, because it means AMD will be able to sell Zacate at a very attractive price while keeping healthy margins.
Obviously, CPU performance should be a clear win for Intel with Sandy-Bridge. It's difficult to predict how the market will react to this situation. I suppose that consumers will care a lot about video and graphics, even if it just means very casual gaming and Youtube stuff, but people looking for a business laptop might favor Sandy-Bridge. For those, having Excel running smoothly is probably more important than being able to play Call of Duty 36 with smooth framerates.

Also note that those benchmarks were run on a development platform that is anything but final, so while performance is very close to what you'll get in actual products, power draw isn't, and that is a big unknown. Well, load power should be fairly accurate, but idle power most likely isn't, and that will be crucial for battery life. That said, AMD seems pretty bullish on that front, so that's a good sign. 

Next up is Ontario (9W) but AMD isn't disclosing any performance numbers yet. However, its specs are known, and so are Zacate's benchmark results now, so you can take out your calculator and make pretty educated estimates. As I said before, there should be some pretty cool stuff in the mobile market in January.

Monday, November 15, 2010

Better yields for the GTX 580

According to a short post by Charlie Demerjian over at SemiAccurate, the GeForce GTX 580 is enjoying much better yields than the GTX 480 ever did (reading between the lines). The latter, if you recall, was plagued by dreadful yields, and therefore very high manufacturing costs. The fact that the 580 is much better in this respect is very good news, because it means NVIDIA and AMD will be able to engage in a price war, should the competitive situation call for it. And I love a good price war.

Besides, when Charlie says something positive about NVIDIA, you can bet that it's true.

He also mentions a GTX 560, which should make the $200 space very, very interesting. It's going to be a good winter if you're looking for a new high-end graphics card.

GPU Computing and Fusion

During the recent Analyst Day, AMD talked about something I haven't mentioned yet, partly because it sort of slipped my mind, and partly because it really deserves its own post. Most news reports haven't made a big deal out of it, in part because it's not exactly unexpected, but I think that's a mistake, and perhaps many people don't realize how significant it is.

I'm talking about Fusion, and it's impact on GPU computing (or GPGPU if you prefer). To understand why it's important, you must first look at the current state of GPGPU and its limitations. And when you ask people who use it, they usually tell you that it's really great if you need to do something that boils down to dense linear algebra on large arrays, with speedups up to 10× or so. But they also tell you that they have workloads that might benefit from using GPUs, but not enough to compensate for the overhead of constantly having to copy data to the GPU's memory and back. If you have big chunks of data to send to the GPU in order for it to do a lot of calculations and then send them back, that's fine, but if you only have small chunks of data, or relatively few calculations to make on these chunks before you have to send them back to the CPU for serial work, then your GPU starts being much less helpful.

And it just so happens that many problems have a rather simple, naive solution that relies on simple, static structures such as matrices; and a more sophisticated one that may rely on smarter data parsing, therefore more complex (dynamic) structures, fewer calculations per work unit, so to speak, and that solution is usually much faster. Unfortunately, while the latter may be a great improvement on a CPU, it can be much less of an improvement when you're using a GPU, because you have a hard time coalescing your memory transfers, and the calculations are just too small, so the overhead ends up negating the benefit of using a GPU. Naturally it doesn't have to be this extreme, and using a GPU can remain beneficial even with such algorithms, just a little less so.

So AMD intends to alleviate these issues through incremental improvements to their architecture. First, in 2011, comes the APU:

Now this slide is interesting in the context of graphics, and it does indicate improved latency for CPU—GPU communication, but no one uses IGPs for GPU computing, so let's take a look at this one too:

As you can see, even compared to full-width PCI-Express, APUs (in this case it should be Llano) provide plenty of internal bandwidth. As Jon Stokes pointed out, this is still a far cry from Sandy-Bridge, with its very fast ring bus, but I suspect it will suffice for the time being. That said, Llano isn't being targeted at server markets, and while AMD believes that developers will leverage its GPGPU capabilities for the consumer market, the company apparently doesn't hope for much on the High-Performance Computing front; at least not yet.

But there's a lot more coming:

This diagram may not seem all that exciting, but I have reason to believe that AMD isn't kidding about all those 'substantial's. It mentions substantial improvements to GPU—Memory-controller bandwidth, and I believe we might actually see a shared L3 here at some point. It mentions the same "substantial improvements" to Memory-Controller—Main-memory bandwidth, and AMD actually said a few words about memory die stacking, so we're talking something really big, there. The same goes for discrete GPU bandwidth.

Perhaps more importantly, APUs will move to a unified virtual address space for the CPU, the GPU inside the APU, and the discrete GPU if there is one. They will all have coherent memory and the GPU will support context switching as well as virtual memory support via IOMMU. All that will greatly reduce the overhead I discussed above, and will go a long way towards making GPU computing a reality for a large variety of workloads. Before this happens, I think it will remain a bit of a niche, but it's eventually poised for great expansion. However at that stage, it will be considered heterogeneous computing more than GPGPU.

AMD isn't alone in this game, though. Intel will be there as well with its CPU cores, and with Knights Corner, the 22nm evolution of the elusive Larrabee architecture. As far as I'm aware they will be on separate dies, at least initially, but this is slightly less of an issue because Knights is based on general-purpose (albeit rather simple) CPU cores, supplemented by vector hardware. Intel's approach is different from up close, but basically it's the same idea: a few big, complex, hot, high-clocked CPU cores for serial work, and many smaller, simpler, slower and more power-efficient parallel cores for… well, parallel work. As you might have gathered, NVIDIA is the odd man out, here. They've done a lot to develop GPGPU, but ironically, they might end up left out in the cold because they lack appropriate CPU technology. It's possible that they will try to build HPC-oriented System-On-Chips (SOCs) based on the fastest available ARM cores (so in the near future, Cortex-A15) but that probably won't quite cut it. Hard times are ahead for NVIDIA, no doubt.

That said, those hardware improvements won't magically make everything right, and problems will remain. Notably, AMD is having a hard time figuring out just how much GPU hardware should go into computing-oriented APUs for servers, and Intel is probably having similar issues. Hitting the right balance isn't easy, and a few workloads will remain difficult to exploit for heterogeneous systems no matter what.

But the revolution is coming. ;-)

Today is Crazy Graphics Cards Day

When the GeForce GTX 580 was released a few days ago, reviewers were surprised to find out that it featured a so-called "limiter" which throttles the card aggressively when either Furmark or OCCT is launched. Some of them dug up old versions of the aforementioned software, and proceeded to measure the beast's power draw. They usually recorded something like 300 to 310W.

But now, GPU-Z developer W1zzard from TechPowerUp has just added a new feature to GPU-Z, which permits disabling the limiter. He's even measured to GTX 580's power draw under Furmark, and as it turns out, the thing pulls 350W!

Now you might be wondering why he got such a high result. My theory is that the latest version of Furmark puts an even heavier load on GF110 than older ones. This may well be why only recent versions of Furmark and OCCT are detected by the limiter. When said old versions were released, GF110 wasn't around to "optimize" for, so it may not be fully utilized.

Then again, in actual games (or 3DMark) the GTX 580 tends to draw a bit less power than the 480. Why? One possible reason is that games aren't quite that demanding, and under such circumstances, the GTX 580's cooler is able to cope very well with the card's heat output.
And as we know, power increases with temperature. We even know (thanks to Dave Baumann) that around its standard operating temperature, Cypress (HD 5800) draws about one additional watt per additional degree Celsius.
If we assume the additional heat-related power draw to be proportional to TDP, then GF100/110 draws about 1.6W more per additional °C around its standard operating temperature. Since the GTX 580 typically operates 15 to 20°C lower than the 480, we can expect it to draw approximately 24 to 32W less, all other things being equal. Of course, all other things are not equal, the 580 has more enabled units, higher clocks, and is based on a newer revision.

UPDATE: Psycho from the Beyond3D forum has just reminded me of this test by Kyle Bennett at HardOCP. He measured total system power draw in Furmark while keeping an eye on temperatures. The PSU was roughly 87% efficient. Shortly after launching Furmark, Kyle says the GPU is at 75°C while the system draws 449W. At the end of the run, the GPU is at 95°C and the system draws 481W. So that's 32 additional watts, but probably around 28W, taking PSU efficiency under account. In other words, about 1.4W per additional °C. So it looks like I wasn't too far off.

The 580's lower temperatures are probably an important factor in keeping power down in games, but it that doesn't really help in Furmark, because with such a heavy load, its cooler has trouble keeping things… well, cool.

Still, interesting as this may be (and if you're reading this, then you haven't fallen asleep yet) in the end it doesn't really matter: I don't know about you, but I usually play games, not Furmark.

But the GTX 580 isn't the only crazy graphics card under the spotlight today. PCWorld.fr has just reviewed a Crossfire of ASUS ARESes. I certainly wouldn't advise anyone to buy that, especially now, but it's still fun to read, in a crazy, over-the-top kind of way. If Michael Bay were to publish a graphics cards review, that's probably how he'd do it. Except at the end, the testing rig would be destroyed in a huge explosion and the reviewer would narrowly escape on the back of a giant Transformer. But other than that, it would be the same. Anyway, here is the review [French].

On a related note, there are whispers about an upcoming GTX 470. I'd expect 480SPs, clocks around 675/1350MHz, and maybe 3.6Gbps memory for roughly 240W; basically a cheaper, cooler GTX 480. I'm not sure that will be enough against Cayman, though.

Sunday, November 14, 2010

RWT: Exploring the Intel and Achronix Deal

David Kanter has recently published a piece over at Real World Technologies about the recent deal between Intel and Achronix, a company that designs FPGAs, shedding some light on this puzzling event.

Last Monday, Intel announced that they had voluntarily entered into a manufacturing partnership with a third party, one of the first times in the company’s history. This broke an implicit assumption held by many observers, and prompted quite a few questions, which we will endeavor to answer. 

Saturday, November 13, 2010

Hello Cortex-A15

It seems that a pretty interesting document about ARM's upcoming Cortex-A15 core was just leaked. Thanks to Exophase for catching that one.

The document gives a significant amount of architectural details, it's definitely worth taking a look at.

Laura Izibor — Can't Be Love

I said Teχlog wasn't just about technology, didn't I? Well, here's a nice song: Laura Izibor — Can't Be Love.

GeForce GTX 580, Radeon HD 6970, Delays and Rumors

Unless you've been sleeping under a rock the last few days, you must have heard about the GTX 580. This new graphics card from NVIDIA is based on a "new" chip, called GF110. It's basically a fixed GF100, fully enabled, with higher clocks and slightly lower power draw—definitely a significant step in the right direction.

If you haven't already, I suggest you take a look at some of the following reviews: The Tech ReportAnandtech and Hardware.fr. That last one is in French, but Damien Triolet is one of the best reviewers out there, so it's definitely worth a look, even if it means running it through Google Translate. Plus, the charts speak for themselves anyway.

The consensus is that the card is "what Fermi should have been from the beginning" and that's definitely the feeling I got, except for the power draw, which is still freakishly high. But at least, now it's not disproportionate with regard to performance. In other words, the GTX 580 is more or less on the HD 5970's level in performance/watt, which isn't bad at all.

That said, don't run out to the store just yet, because NVIDIA's latest and greatest has two things going against it:
  • at $500 or more, it's very expensive;
  • AMD's Radeon HD 6970 is just a month away.
AMD is usually quite aggressive with pricing when introducing new graphics cards, so basically, three things can happen, the HD 6970 can be slower and much cheaper than the GTX 580, about equal and a bit cheaper, or faster and roughly the same price, perhaps a tad more expensive. Whichever of these options turns out to be true, you can bet that NVIDIA will cut the GTX 580's price, so even if you're dead set on the green team, you'd be well inspired to wait a bit.

The HD 6970 will be based on a new GPU called Cayman, which seems to feature completely new Stream Processing Units, as AMD likes to call them. Though rumors had been floating around for a while, most of what we know, we know from Gipsel's posts here. Follow that thread for all the gory details and ISA code galore.

If Gipsel's information does pertain to Cayman—and that is almost certain—then AMD has decided to move from a VLIW5 architecture to a VLIW4 one. Now you might be thinking "wait, 4<5, so how is that better?". First, let's take a look at the current VLIW5 setup, featured on Cypress (HD 5800 series):

As you can see, each VLIW5 unit features 4 small, simple, or "thin" Stream Cores, also called Streaming Processing Units (SPUs) and a complex "fat" one; plus a branch unit. This fat unit handles transcendental instructions, such as exp, log, trigonometry, etc. The problem is that transcendentals are not all that common, and this "fat" unit is… well, fat. It takes a lot of space, and often just sits there doing nothing, except leaking power.

So for Cayman, AMD has apparently removed this unit, and improved the thin ones so that they can handle transcendentals together. In practice, when a complex instruction is encountered, 3 out of the 4 SPUs combine to handle it. Obviously, that means reduced throughput for transcendental-heavy code, but that's OK because first, transcendentals are relatively rare, and second, the resulting VLIW4 unit is significantly smaller than the original VLIW5 one. If it's about 10% smaller (completely arbitrary figure) then you can put about 11% more of them in the same die area (assuming decoupled TMUs). Since your new VLIW4 unit is going to be almost as fast as the VLIW5 one in most cases, the net result is improved performance, and probably improved performance/watt as well.

We'll find out exactly how much of an improvement this is when Cayman is released, around December, 13. If you've been paying attention to rumors, then you've probably heard that it was originally scheduled for November, but AMD has decided to delay it. There has been some speculation as to why, and different theories have been put forth, which I will now discuss.

Some have speculated that AMD was surprised by the GTX 580's performance, and decided to raise clock speeds on Cayman in order to be more competitive. I find that unlikely, because the GTX 580 is "only" about 15% faster than the GTX 480 on average, which is really the least you could expect out of a "new" generation, so it's hard to believe that AMD would be surprised by that. Moreover, Dave Baumann, product manager for Radeon graphics cards, has pretty much dismissed this idea.

Other have said that poor yields—in the single digits!—could be at fault. But this is AMD's third, or arguably even fourth 40nm architecture, and by now, they should really have a good handle on TSMC's admittedly questionable 40nm process. Of course, when you're putting billions of nanoscale transistors on a few mm² of silicon, things can always go wrong, but I really doubt it in this case. Plus, this is not the sort of thing you notice at the last minute.

The most believable theory is, in my opinion, the one about shortages for a new so-called driver-MOSFET made by Texas Instruments and used on the HD 6800 and 6900 series. More here. Indeed, the recent shortages and price hikes for the HD 6800 series seem to support this.

But whatever the reason may be, Cayman should hit the market in mid-December and have a very positive impact, especially for people looking for high-end graphics. Stay tuned.

Friday, November 12, 2010

Analyst Day

It's been a pretty busy week in the hardware world. The main event was probably AMD's Analyst Day. For those of you not familiar with it, it's a more-or-less annual event during which AMD talks to financial analysts about what they've been doing, how they've been doing, what they will be doing, and how they will be doing it.

They've discussed many things during this roughly 7-hour long event, but mostly, it was about Fusion and APUs, or Accelerated Processing Units, which is what AMD calls its upcoming CPU/GPU hybrids. Fusion refers to APUs, but also to the software side of things, to the "experience", etc. Yes, it's mostly a marketing term.

First of all, they were pretty proud to announce that as of this week, Ontario/Zacate is shipping. It will be released during CES, in January 2011. This is Ontario/Zacate:

At just 75mm², that's a tiny, tiny chip aimed at netbooks, thin & light notebooks, and small form factor or all-in-one desktops. It packs two out-of-order CPU cores based on the all-new Bobcat architecture, as well as a DX11 GPU with 80 Stream Processing Units, very similar to the one featured on the Radeon HD 5450. It is being manufactured by TSMC on a 40nm bulk process. Given its size, it is very very cheap to make. Here are the specs for Ontario/Zacate, which will henceforth be called Zacario, because I'm getting tired of writing Ontario/Zacate.

Note that at 1.2GHz and 9W, Ontario will only feature one core. You'll have to make do with 1GHz if you want two cores. More details here. I'm certain there will be a lot of different netbook models that cater to all possible needs. Indeed, AMD was quite bullish about Zacario, claiming a lot of design wins (over 100) and very enthusiastic response from OEMs altogether. And I can easily believe that. With two full OoO cores and a very capable GPU, all for just 9W, Ontario is going to make Atom look like a terrible option for anything bigger than a tablet. At higher clocks and 18W, Zacate will be quite appealing for OEMs looking to build very affordable thin and light notebooks, of the 11 to 13 or even 14" ilk. I believe Zacario is poised to become what fancy people call a disruptive innovation. Expect a lot of cool stuff in the mobile market come next year. And since you've behaved, here's a diagram of Bobcat:

But Zacario isn't the only upcoming APU, AMD is also working on Llano, which features 4 cores and a decent DX11 GPU, likely similar to Redwood, which equips the HD 5670. It will be manufactured in 32nm SOI with High-K + Metal Gate by GlobalFoundries, AMD's main foundry partner, and is expect to hit the market in H1'2011 (which is probably PR-speak for June). Here is a composite image of Llano, courtesy of Hans de Vries:

With full power-gating for each core, and a brand new manufacturing process, Llano should offer very good idle power, which is capital for notebooks, its main target. It will also be aimed at desktops and should replace Athlon and Phenom up to X4 in this space. Its CPU cores are based on the current Stars architecture, which powers Phenom, but they are improved in several ways, detailed here. All in all this should be a well-rounded product with very decent CPU performance, good power characteristics, and graphics performance rivaling that of current mid-range cards (you know, the ≈$80 models). It's still early to be able to tell much beyond that, though.

It's not all about APUs, though, AMD is also readying a completely new CPU architecture, called Bulldozer. If by now you still haven't heard about it, read this. If you're too lazy, let's just say that it's a peculiar design that replicates some hardware within cores (now called modules) and shares some other components in order to maximize area- and power-efficiency. Each module can therefore handle two threads with dedicated integer units, a shared front-end, and a flexible, shared FPU… called "Flex FP". But honestly that just sounds way too "marketingy". Here's a basic diagram, with an SMT core next to it for comparison:

There's been a lot of speculation about Bulldozer, and if you want to know more I really urge you to read David Kanter's piece. As for this Analyst Day, AMD did reveal two very interesting things:
  • Bulldozer will be released first on desktops, as Zambezi, in Q2'2011 (probably June)
  • It will then make its way into servers, in Q3. 
AMD also chose to seize this opportunity to reveal bits about their roadmaps, so here they are:

Here you see Zacario and Llano, which we've already discussed. Notice that Llano will also be offered as a dual-core variant, but whether that's a quad-core die with disabled cores or a distinct ASIC remains unknown at this point.
Trinity will replace Llano, and it is pretty much what you would expect: 4 Bulldozer cores and a new GPU, based on Northern Islands, the GPU family that AMD is currently introducing to market, starting with the already released HD 6800 series. It is still a 32nm part.
Krishna, however, is a little more interesting: it's a 28nm, HK+MG part, meaning that instead of lagging a half-node and a metal gate behind its mainstream counterpart as Zacario does, Krishna will actually be a half-node ahead. As a result, it will be offered with up to 4 cores. Yes, that means you will almost certainly see 18W quad-cores with an integrated next-generation DX11 GPU in 2012. In ≤$499 notebooks. Pretty cool, uh?

This one is pretty self-explanatory, and yet slightly misleading: Komodo does not feature integrated graphics, just like Zambezi, but it is meant to pair-up with discrete DX11 graphics. It will be based on Bulldozer, but "improved". How? That's a very good question, AMD didn't say.

Bulldozer was probably designed with servers as a primary target, and this is where it is most likely to shine. In 2012, AMD will add more cores, on top of improvements to the architecture, and probably some other stuff, but that's about as much as they're willing to say for now. Note that Interlagos is a drop-in replacement for the Opteron 6100 series, and the same applies to Valencia and the 4100 series. And just like the 6100 series, Interlagos is actually an MCM: two dies on one package, and therefore one socket.

The most surprising part of this event was probably this process roadmap:

As you can see, AMD intends to transition from full nodes to half-nodes, starting with 28nm. Is that for all products, or just APUs? Or is it just for low-end APUs similar to Zacario? Are those processes still SOI? AMD was reluctant to give more details, and understandably so, but this has left me a bit puzzled. Perhaps they believe this strategy can help them stay closer to Intel as the blue behemoth relentlessly moves forward with a new process generation every 24 months or so.

And of course, Analyst Day wouldn't be Analyst Day without a lot of stuff to make analysts feel all funny and tingly in places I dare not mention, right? So here are a few slides about the money side of things: