Jump to content
Facebook Twitter Youtube

☕ Eid al-Fitr ☕


Days

Hours

Minutes

Seconds

[Hardware] Nvidia Ada Lovelace and GeForce RTX 40-Series: Everything We Know


Recommended Posts

Posted

Nvidia's Ada architecture and GeForce RTX 40-series graphics cards first started shipping on October 12, 2022, starting with the GeForce RTX 4090. The GeForce RTX 4080 followed one month later, on November 16, 2022, and the RTX 4070 Ti (formerly RTX 4080 12GB) launched on January 5, 2023. That's two years after the Nvidia Ampere architecture and basically right on schedule given the slowing down (or if you prefer, death) of Moore's 'Law,' and it's good news as the best graphics cards are in need of some new competition. With the Nvidia hack early in 2022, we had a good amount of information on what to expect. Cards are now shipping and Nvidia has confirmed specs on the first RTX 40-series cards. We've collected everything into this central hub detailing everything we know and expect from Nvidia's Ada architecture and the RTX 40-series family. There are still rumors swirling around, mostly concerning future Ada Lovelace cards like the rumored Titan RTX Ada / RTX 4090 Ti, and the lower tier models like the RTX 4070 (non-Ti), RTX 4060, and RTX 4050 — and those lower spec GPUs are already shipping in the Nvidia RTX 40-series mobile solutions. But model numbers notwithstanding, we now have a good idea of what we can expect from the Ada Lovelace architecture. With the Ada whitepaper now available alongside the GPUs, we've updated the information here to cover exactly what the new generation of GPUs delivers. The first three desktop RTX 40-series cards have launched. If Nvidia follows a similar release schedule as in the past, we can expect the rest of the RTX 40-series to trickle out over the next year. RTX 4070 will likely ship in April, followed by the RTX 4060 and 4050 over the coming months. Let's start with the high level overview of the specs and rumored specs for the Ada series of GPUs.

The 4090, 4080, and 4070 Ti cards are now official and the specs are fully accurate. The 4090 Ti and/or Titan, 4070, 4060, and 4050 cards require some generous helpings of salt, as they're more speculation than anything concrete. Nvidia hasn't officially revealed even the existence of these cards, and it won't until they're close to release — and the laptop variants are likely to be quite different from the desktop models. There are also likely to be intermediate cards that aren't in that table. For the RTX 30-series as an example, Nvidia has ten major models with varying specs, ranging from the 3090 Ti down to the 3050. Other than the 4070 Ti, no other 40-series Ti cards have been revealed yet, but it's a safe bet that they'll arrive at some point. Certainly, there's plenty of room at the top for a future RTX 4090 Ti. Note that the maximum L2 cache is cut down on the 4090 (six blocks of 12MB instead of six blocks of 16MB), ROPs are trimmed a bit, and Nvidia could certainly push higher on clocks and power... and price. [Sigh.] But while credible rumors of a 4-slot Founders Edition card have been circulating, nothing is official at present. We do know that Nvidia is hitting clock speeds of 2.5–2.6 GHz on the 4090 and 4080, and we expect similar clocks on the other GPUs in the RTX 40-series. Nvidia has also successfully overclocked RTX 4090 to 3.0GHz and beyond. We've put in tentative clock speed estimates of 2.6 GHz on the unannounced GPUs for now. The three released models are also using three different GPUs, which is a big change from previous launches. RTX 4090 uses a significantly trimmed down AD102 implementation (89% of the cores, 75% of the cache). Meanwhile, RTX 4080 uses an "almost complete" AD103 chip (95% of the cores and all the cache), and RTX 4070 Ti uses a fully enabled AD104 chip. Again, we can expect either harvested or more fully enabled variants of each GPU at some point.

3H2j7h2NFtJcWYCwmTyMjh-970-80.jpg

Nvidia will most likely use TSMC's 4N process — "4nm Nvidia" — on all of the Ada GPUs, and it's definitely used the launched cards. We know for certain that AD102, AD103, and AD104 along with Hopper H100 all use TSMC's 4N node, a tweaked variation on TSMC's N5 node that's been widely used in other chips and which is also used for AMD's Zen 4 and RDNA 3. We don't think Samsung will have a compelling alternative that wouldn't require a serious redesign of the core architecture, so the whole family will likely be on the same node. Nvidia will be "going big" with the AD102 GPU, and it's closer in size and transistor counts to the H100 than GA102 was to GA100. Frankly, it's a monster, with performance and price to match. It packs in far more SMs and the associated cores than any Ampere GPUs, it has much higher GPU clocks, and it also contains a number of architectural enhancements to further boost performance. Nvidia claimed that the RTX 4090 is 2x–4x faster than the outgoing RTX 3090 Ti, though caveats apply to those benchmarks. Our own testing puts performance at more like 60% faster in aggregate compared to the previous generation RTX 3090 Ti. That's at 4K and maxed out settings, without DLSS 2 or DLSS 3. But as we noted in our reviews, while DLSS 3 Frame Generation can boost frame rates, it's not the same as "real" frames and it adds latency, meaning it feels more like a 10–20 percent improvement over the baseline performance. It's also worth noting that if you're currently running a more modest processor rather than one of the absolute best CPUs for gaming, you could very well end up CPU limited even at 1440p ultra. A larger system upgrade will likely be necessary to get the most out of the fastest Ada GPUs.

f2bDUfr2z2fZCZz2BG2LaM-970-80.jpg

With the high-level overview out of the way, let's get into the specifics. The most noticeable change with Ada GPUs will be the number of SMs compared to the current Ampere generation. At the top, AD102 potentially packs 71% more SMs than the GA102. Even if nothing else were to significantly change in the architecture, we would expect that to deliver a huge increase in performance. That will apply not just to graphics but to other elements as well. Most of the calculations haven't changed from Ampere, though the Tensor cores now support FP8 (with sparsity) to potentially double the FP16 performance. Each 4th generation Tensor core can perform 256 FP16 calculations per clock, double that with sparsity, and double that again with FP8 and sparsity. The RTX 4090 has theoretical deep learning/AI compute of up to 661 teraflops in FP16, and 1,321 teraflops of FP8 — and a fully enabled AD102 chip could hit 1.4 petaflops at similar clocks. The full GA102 in the RTX 3090 Ti by comparison tops out at around 321 TFLOPS FP16 (again, using Nvidia's sparsity feature). That means RTX 4090 delivers a theoretical 107% increase, based on core counts and clock speeds. The same theoretical boost in performance applies to the shader and ray tracing hardware as well, except those are also changing. The GPU shader cores will have a new Shader Execution Reordering (SER) feature that Nvidia claims will improve general performance by 25%, and can improve ray tracing operations by up to 200%. Unfortunately, support for SER will require developers to use proprietary Nvidia extensions, so existing games won't necessarily benefit. The RT cores meanwhile have doubled down on ray/triangle intersection hardware (or at least the throughput per core), plus they have a couple more new tricks available. The Opacity Micro-Map (OMM) Engine enables significantly faster ray tracing for transparent surfaces like foliage, particles, and fences. The Displaced Micro-Mesh (DMM) Engine on the other hand optimizes the generation of the Bounding Volume Hierarchy (BVH) structure, and Nvidia claims it can create the BVH up to 10x faster while using 20x less (5%) memory for BVH storage. Again, these require that developers make use of the new features, so existing ray tracing games won't benefit without a patch. Together, these architectural enhancements should enable Ada Lovelace GPUs to offer a massive generational leap in performance. Except it will be up to developers to enable most of them, so uptake might be rather diminished.

Ada's ROP counts are going up quite a bit in some cases, particularly the top model (for now) RTX 4090. As with Ampere, Nvidia ties the ROPs to the GPCs, the Graphics Processing Clusters, but some of these can still be disabled. The AD102 has up to 144 SMs with 12 GPCs of 12 SMs each. That yields 192 ROPs as the maximum, though the final number on the RTX 4090 is 11 GPCs and 176 ROPs. RTX 4080 gas seven GPCs, just like GA102, though in an odd change of pace it appears one of the GPC clusters only has 8 SMs while the other six have up to 12 SMs. Regardless, all seven are enabled on the RTX 4080 and it has 112 ROPs. AD104 in the RTX 4070 Ti uses five GPCs of 12 SMs, with 80 ROPs. For the time being, the remaining three cards should be taken as a best guess. We don't know for certain what GPUs will be used, and there may be other models (i.e., RTX 4060 Ti) interspersed between cards. We'll fill in the blanks as more information becomes available in the coming months, once the other Ada GPUs are closer to launching.

ZMm4A8aqGLUPLqRNRSzVYL-970-80.jpg

Last year, Micron announced it has roadmaps for GDDR6X memory running at speeds of up to 24Gbps. The latest RTX 3090 Ti only uses 21Gbps memory, and Nvidia is currently the only company using GDDR6X for anything. That immediately raises the question of what will be using 24Gbps GDDR6X, and the only reasonable answer seems to be Nvidia Ada. The lower-tier GPUs are more likely to stick with standard GDDR6 rather than GDDR6X as well, which tops out at 20Gbps and is used in AMD's RX 7900 XTX/XT cards. This represents a bit of a problem, as GPUs generally need compute and bandwidth to scale proportionally to realize the promised amount of performance. The RTX 3090 Ti for example has 12% more compute than the 3090, and the higher clocked memory provides 8% more bandwidth. Based on the compute details shown above, there's a huge disconnect brewing. The RTX 4090 has around twice as much compute as the RTX 3090 Ti, but it offers the same 1008 GB/s of bandwidth — 24Gbps for an eventual RTX 4090 Ti, anyone? There's far more room for bandwidth to grow on the lower tier GPUs, assuming GDDR6X power consumption can be kept in check. The current RTX 3050 through RTX 3070 all use standard GDDR6 memory, clocked at 14–15Gbps. We already know GDDR6 running at 20Gbps is available, so a hypothetical RTX 4050 with 18Gbps GDDR6 ought to easily keep up with the increase in GPU computational power. If Nvidia still needs more bandwidth, it could tap GDDR6X for the lower tier GPUs as well. The catch is that Nvidia doesn't need massive increases in pure memory bandwidth, because instead it will rework the architecture, similar to what we saw AMD do with RDNA 2 compared to the original RDNA architecture. Namely, it will pack in a lot more cache to relieve the demands on the memory subsystem.

One great way of reducing the need for more raw memory bandwidth is something that has been known and used for decades. Slap more cache on a chip and you get more cache hits, and every cache hit means the GPU doesn't need to pull data from the GDDR6/GDDR6X memory. A large cache can be particularly helpful for gaming performance. AMD's Infinity Cache allowed the RDNA 2 chips to basically do more with less raw bandwidth, and the Nvidia Ada L2 cache shows Nvidia has taken a similar approach. AMD uses a massive L3 cache of up to 128MB on the Navi 21 GPU, with 96MB on Navi 22, 32MB on Navi 23, and just 16MB on Navi 24. Surprisingly, even the smaller 16MB cache does wonders for the memory subsystem. We didn't think the Radeon RX 6500 XT was a great card overall, but it basically keeps up with cards that have almost twice the memory bandwidth. The Ada architecture pairs up to an 8MB L2 cache with each 32-bit memory controller, or 16MB per 64-bit controller. That means the cards with a 128-bit memory interface would get 32MB of total L2 cache, and the 384-bit interface on AD102 has up to 96MB of L2 cache. Except, part of the L2 cache blocks can also be disabled, so the RTX 4090 only has 72MB of L2 cache (six blocks of 12MB instead of 16MB). While that's less than AMD's RDNA 2 Infinity Cache in many cases, AMD also dropped to 96MB total L3 cache for its top RX 7900 XTX. We also don't know latencies or other aspects of the design yet. L2 cache tends to have lower latencies than L3 cache, so a slightly smaller L2 could definitely keep up with a larger but slower L3 cache, and as we saw with RDNA 2 GPUs, even a 16MB or 32MB Infinity Cache helped a lot. If we look at AMD's RX 6700 XT as an example. It has about 35% more compute than the previous generation RX 5700 XT. Performance in our GPU benchmarks hierarchy meanwhile is about 32% higher at 1440p ultra, so performance overall scaled pretty much in line with compute. Except, the 6700 XT has a 192-bit interface and only 384 GB/s of bandwidth, 14% lower than the RX 5700 XT's 448 GB/s. That means the big Infinity Cache gave AMD at least a 50% boost to effective bandwidth. In general, it looks like Nvidia gets similar results with Ada, and even without wider memory interfaces the Ada GPUs should still have plenty of effective bandwidth. It's also worth mentioning that Nvidia's memory compression techniques in past architectures have proven capable, so slightly smaller caches compared to AMD may not matter at all.

 

 

https://www.tomshardware.com/features/nvidia-ada-lovelace-and-geforce-rtx-40-series-everything-we-know

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

WHO WE ARE?

CsBlackDevil Community [www.csblackdevil.com], a virtual world from May 1, 2012, which continues to grow in the gaming world. CSBD has over 70k members in continuous expansion, coming from different parts of the world.

 

 

Important Links