NVIDIA's New Ampere GPU is a Game Changer for Artificial Intelligence

NVIDIA

Today, NVIDIA announced its new Ampere architecture, alongside the new A100 on which it operates. This is a significant improvement over Turing, already an AI-driven architecture that powers high-end raytracing and ML data centers in the consumer graphics space.

If you want a full summary of all the very technical details, you can read the detailed presentation of the architecture. We will break down the most important elements.

The new matrix is ​​absolutely massive

From the door, they all come out with this new chip. The latest generation Tesla V100 chip was 815mm on TSMC’s already mature 14nm process node, with 21.1 billion transistors. Already quite big, but the A100 is ashamed at 826 mm on the 7 nm of TSMC, a much denser process and 54.2 billion transistors. Impressive for this new knot.

This new GPU offers 19.5 teraflops of FP32 performance, 6,912 CUDA cores, 40 GB of memory and 1.6 TB / s of memory bandwidth. In a fairly specific workload (sparse INT8), the A100 actually cracks 1 PetaFLOPS of raw computing power. Of course, it’s on INT8, but still, the card is very powerful.

Then, just like the V100, they took eight of these GPUs and created a mini-supercomputer which they sell for $ 200,000. You will likely see them arriving soon from cloud providers like AWS and Google Cloud Platform.

However, unlike the V100, it is not a massive GPU: these are actually 8 separate GPUs that can be virtualized and rented alone for different tasks, as well as a memory speed 7 times higher to boot.

As for the use of all these transistors, the new chip works much faster than the V100. For AI training and inference, A100 offers 6x acceleration for FP32, 3x for FP16 and 7x by inference when using all of these GPUs together.

NVIDIA

Note that the V100 marked in the second graphic is the 8 GPU V100 server, not a single V100.

NVIDIA also promises acceleration up to 2x in many HPC workloads:

NVIDIA

Regarding the raw numbers of TFLOP, the double precision performance of the A100 FP64 is 20 TFLOP, against 8 for the V100 FP64. Overall, these accelerations are a real generational improvement over Turing, and are great news for AI and the machine learning space.

TensorFloat-32: a new digital format optimized for tensor cores

With Ampere, NVIDIA uses a new digital format designed to replace FP32 in certain workloads. Essentially, FP32 uses 8 bits for the number range (regardless of its size) and 23 bits for precision.

NVIDIA claims these 23 bits of precision aren’t entirely necessary for many AI workloads, and you can achieve similar results and much better performance with just 10 of them. This new format is called Tensor Float 32, and the Tensor kernels of the A100 are optimized to handle it. This is, in addition to the decrease in dies and the increase in the number of cores, how they get massive 6x acceleration in AI training.

NVIDIA

They claim that “Users do not have to change the code, because TF32 only works inside the GPU A100. TF32 works on FP32 inputs and produces results in FP32. Tensor-free operations continue to use FP32”. This means that there should be a replacement drop for workloads that do not need additional precision.

By comparing FP performance on the V100 to TF performance on the A100, you will see where these massive accelerations come from. TF32 is up to ten times faster. Of course, this is also largely due to the fact that other Ampere improvements are twice as fast in general, and this is not a direct comparison.

NVIDIA

They also introduced a new concept called structured fine-grained clarity, which contributes to the computational performance of deep neural networks. Basically, some weights are less important than others, and matrix calculations can be compressed to improve throughput. Although throwing data does not seem like a good idea, they argue that it does not affect the accuracy of the network formed for inference, and simply speeds it up.

NVIDIA

For Sparse INT8 calculations, the maximum performance of a single A100 is 1,250 TFLOPS, an incredibly high number. Sure, you’ll be hard pressed to find a real workload starting only INT8, but accelerations are accelerations.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.