WilkerCSBD Posted March 29, 2021 Posted March 29, 2021 With RTX 3000s on the market and few units available in stores due to mining, thinking about the next generation of NVIDIA is quite hasty, especially when it is not expected until the end of 2022, but what we know is enough to make us an idea of what we could find with the NVIDIA Lovelace. The NVIDIA Lovelace architecture will not be released until 2022, making use of TSMC's 5nm node, although we cannot assure this last detail yet, since it was also speculated with TSMC's 7nm node for the current RTX 3000 and in the end NVIDIA has made use of Samsung's 8nm node. In any case, at the moment it is a great unknown and the little we know creates certain doubts that have not been taken into account, especially regarding the configuration of some parts and bandwidth. What we know about the NVIDIA Lovelace architecture Rumor Lopite Specs Lovelace Of all the little we know about the future NVIDIA Lovelace, what is surprising is its configuration, which we know thanks to the NVIDIA insider Kopite7kimi, who we remember is one of the most reliable sources after having advanced the specifications of the RTX 3000 based on the GeForce Ampere architecture one year in advance. Your information on GeForce Lovelace? The top-of-the-range chip, AD102, will have a 12 GPC configuration instead of the 7 GPC of the TU102. GPU CUDA cores GPUGM200 CUDA2816 cores GPUGP102 CUDA3584 cores GPUTU102 CUDA4608 cores GPUGA102 CUDA10752 cores GPU AD102? CUDA18432 cores What Kopite refers to is the number of GPCs and the amount of TPC per GPC. There are 2 SM or Compute Units per GPC and a total of 128 CUDA cores in the case of the GeForce Ampere, which should be inherited by Lovelace. This brought news about a configuration of 18,432 CUDA cores or ALUs in FP32, a number that would be the biggest leap in that regard from the last generations of NVIDIA GPUs. What's wrong with these specs? Especially with regard to bandwidth, which is related to energy consumption, and is that the increase in the number of GPCs implies an increase in the number of certain units that will have to be cut to make this configuration possible. An assumption with feet of clay NVIDIA GPU VRAM Render The first thing we have to take into account is that the configuration of a GPU scales with the VRAM memory it has assigned and we have no reference that NVIDIA is going to use any type of special memory to be able to feed a total of 12 Lovelace GPCs. It is true that at the level of GPU size they cannot go much further with the GA102, which is almost touching the limit of the size accepted by Samsung's 8 nm node and that the 5 nm node could double the transistors, but the question is: does the memory exist to feed such a quantity of GPCs? We are talking about almost doubling the bandwidth, which we cannot do with GDDR6X. It is possible that NVIDIA makes use of the FG-DRAM that it has been developing, but this type of memory seems to be more focused for the HPC market than not for the home market, if NVIDIA ever develops it. So NVIDIA will have to settle for GDDR6X, which can become a huge bottleneck. Continuing with an organization like the GeForce Ampere, means that the load on the VRAM will increase considerably, not only on the part of the SMs, but also of fixed function units, especially texture units and the ROPS that are the units of this type. that more data read and write from the VRAM, which in the new configuration would increase considerably. Possible changes to ROPS and texture units at Lovelace Lovelace Possible SM If we count the number of ROPS in the current GA102 we will see that there is a total of 112 ROPS, having 7 GPCs in said GPU, this is 16 ROPS per GPU, a configuration of 12 GPC with 16 ROPS would raise the amount to 192 ROPS. From the moment that the ROPS are the ones that write the pixels already finished in the VRAM, the necessary bandwidth is doubled. The solution? Reduce the number of ROPS per GPC from 16 to 8, making a total of 96 ROPS, which is more in line for a 384-bit GDDR6X bus. The other point is the texture units, at Maxwell the ratio of FP32 ALUs per texture unit was 16 ALUs. Since the texture units go from 4 to 4 in the SMs, this makes a total of 64 ALUs in FP32 within the SMs of those architectures. Ratio that has been broken in Ampere in its configuration, since at certain times it can be 128 ALUs per SM. What do we believe? We think that what we are going to change is from a configuration of 4 Sub-Cores per SM to 8 Sub-Cores per SM, in this way the load on the VRAM will be much less from the texture units that are in the GPU, but the ratio will increase again to 64 ALUs in FP32 or CUDA cores per texture unit. The concl
Recommended Posts