Graphics Cards

GeForce GTX 460 - GF104 Architecture Rumors Dissected

GeForce GTX 460, GTX 465 "revision 2" rumors investigated. Leaked pictures show a smaller PCB and a better cooler. The GPU is not square, which we will see how this may be possible given the first GF100 chip as a basis.

Please see: GF104 Updates post for more accurate specs.

A Recap, Then The Details

Nvidia has been having lots of problems building GF100 chips, which has most recently meant cutting huge amounts of TPCs out of chips and focusing on the GeForce GTX 465(please see: Tesla dropped to 448 shaders, GeForce GTX 465 specs and Nvidia's GeForce GTX 480M) while dropping the GTX 470.
The GeForce GTX 465 has 5 TPC's disabled, out of 16, making it a very expensive harvest chip for Nvidia. The GeForce GTX 480M for laptops also has 5 TPCs disabled and that brings the total CUDA cores("shaders") from 512 all the way down to 352.

Recently, rumors started surfacing that Nvidia's GF104 chip will feature 336 shaders: that's a very odd number because it is a multiple of 24 CUDA cores. If you look at Fermi's TPC, there is no perceptible way that Nvidia can disable 8 cores without re-engineering the chip:

There's two warp schedulers, tightly coupled with two blocks of 16 cores. There's not much to do about that. The only possible path I see right now, if Nvidia changed the chip like this:

  1. Cores arranged in groups of 8, like in GT200 and G80/G92, running half-warps instead of full warps each "slow" clock cycle.
  2. Shrinking the register file by the same 33% as the number of cores, as register requirements would be relaxed.
  3. Half the load/store units, equal to the number of cores - I'm not sure of this change though, it's possible, not certain.
  4. Add one warp scheduler and dispatch unit, which adds some area but is needed to keep the redesign simple and functional(see below for details).
With this in mind, some die space can be saved and the L2 interconnect can also be made smaller, since I expect the L2 to completely go away in this chip. See the following picture for a comparison of what changes:

In actual silicon, the hardware will be arranged differently so don't take some of this space efficiency as is.

As for the scheduler, see Nvidia's Fermi paper:
Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream. Using this elegant model of dual-issue, Fermi achieves near peak hardware performance.
Checking dependencies is not something that's easy to do, in a hardware are cost perspective, so I expect the schedulers to be smaller. With that in mind and because they do contribute to better performance efficiency from the cores, I strongly believe that GF104 might now feature three warp schedulers, grossly simplifying the rest of the TPC design.


How small can the core be made? Here's a hint:

While I may have cut a bit too much on the L2 cache portion of the chip, the die should look something like this, especially given leaked shots of a rectangular core. The TPCs are actually more easy to shrink because given my estimate of the new design and that the TPCs in GF100 are horizontally laid, you just have to make them slightly shorter - since they miss some cores - and you don't need to redesign the core much. You can leave it at 16 TPCs of 24 shaders each, for a grand total of 384 "CUDA cores".
In pixels, this core measures 595x444. The GF100 core, below, 593x600. Some rough mat puts the core at around 35% smaller die, but I expect it to be around 25-30% - I really did do much cutting on the L2 cache and some other features may be added or upgraded.
It was still made into a rectangular die without too much thinking and rearranging of the core, which was the point of this whole exercise.

The GF100 core is big. While it may still be built for months to come, I don't expect it to be produced in mass volume, not even for the "mass" definition of Tesla manufacturing requirements. This is likely the reason why Nvidia had announced big design wins for next generation super computers and then we heard nothing more of the sort for months. Without shipping these chips in quantities, Nvidia has no customers in the HPC area, especially if reliability is also a concern.

Other Specifications

After looking at how the GF104 come into being, I leave you with the leaked specifications.

GeForce GTX 460 768MB
  • 336 CUDA cores
  • 1.8GHz GDDR5 @ 192 bit bus
  • 675MHz "slow" clock, 1350MHz "fast", cores clock
  • 37.8 GTexels/s (56 TMUs)
GeForce GTX 460 1GB
  • 336 CUDA cores
  • 1.8GHz GDDR5 @ 256 bit bus
  • 675MHz "slow" clock, 1350MHz "fast", cores clock
  • 37.8 GTexels/s (56 TMUs)
The 37.8 GPixels/s fill rate implies that the core will have an increased number of ROP units, up to 56 from 48 in the GeForce GTX 480, meaning a very good increase in performance at high resolutions, a drawback noticed on the GF100 based chips.
It is highly likely that the chip will feature 64 ROPs and blocks of 8 being disabled when they are found to malfunction. Also, given that there are two CUDA cores disabled on the leaked specs, 

The cards have the same number of TMUs as the GF100 chips currently available and while the 1GB GTX 460 may feature 32 ROPs, the 768MB GTX 460 should feature only 24 ROPs, due to the cut down memory bus that so far as been always correlated to the number of memory controllers - each 64bit memory controller is attached to 8 ROPs.
Nvidia might be releasing an updated GeForce GTX 465 with 360 or 384 cores that will also have 64 Texture Filtering/Address units and the full 256 bit bus - given the problems with yields, that may take some time though.

I will be updating this article with theoretical performance figures once the chip's design has been confirmed by Nvidia.

Please see: GF104 Updates post for more accurate specs.


Anonymous said...

sorry, but you made some crucial mistakes there.
1) clearly states 37.8 billion/sec TEXTURE fillrate. Pixel fillrate is not even mentioned.

2) consequently, 56 is the number of TMUs, not ROPs. GF104 has 32 ROPs for the full chip. The ROPs are directly linked to the memory interface, so the 1GB version will have 32 ROPs and the 768MB version will have 24 ROPs.

Tiago Marques said...

Thank you, it has been updated accordingly.

Best regards

Anonymous said...

The 1GB version could still have "24" ROP's just like the 5830 had 16 ROP's while still claiming 256-bit bandwidth.

Don't be surprised if Nvidia employs this kind of scheme that is almost like a lie. It's just like when ATI changed their generation number with R670 cards (3870) from R600 (2900XT) and then Nvidia was so appalled by this themselves that they went ahead and changed the 8800GTS-512 to 9800GTX with slightly improved clocks.

The 5830 might as well be using 256-bit bandwidth--exactly the same speed as that of 5850, but the fillrate tests that were always directly correlated with the memory bandwidth showed the 5830 to be seriously hampered--far worse than 5850 or even 5770. The halving of the ROP's made the card behave like a 128-bit child in several scenarios.

Tiago Marques said...

I don't know if they can control ROPs with that granularity, the 5830 has half the ROPs disabled, I don't find that likely to happen. I also don't think they can be competitive at the $250 or so that's rumored for the 1GB card, if they disable more ROPs. The GF100 already suffered a bit from the lack of ROPs compared to the 5870 due to the higher clocks of the Radeon.

Anonymous said...

GTX 460/GF104's SM


Tiago Marques said...

Thanks for the heads up! Leave me your name/nickname and I'll leave you a proper thanks in the post I just made about this.

Seems they decided to stick another 16 cores per SM and cut on the SMs. It's a shame that they had to cut on the polymorph engines but tesselation is not big right now.

At this price and keeping the L2 I might as well get one to get some CUDA Fermi ports done.

Curious about die sizes though, might be hard to cut 30% with the L2 there and the cars are priced very decently even upon release.

Post a Comment