Getting Multiple GPUs with an Interconnect
For the following experiments, I am using a host that has 4 A100 GPUs connected to it. In pairs they are connected through NVLink for a total of 2 NVLink connections. You can create a similar setup in AWS by provisioning the p4d
class machines. They will provision 8 A100 GPUs and the results might vary from what you see here. But the analysis remains the same.
As always, you might have to request quota in order to provision these GPUs and for that you’ll have to submit a ticket. Also, A100 is expensive and please pay attention to the price per hour before provisioning them in order to avoid a sticker shock 😆
NVLink and What is it?
Nvidia describes NVLink as the following on their website -
1
| NVLink is a 1.8TB/s bidirectional, direct GPU-to-GPU interconnect that scales multi-GPU input and output (IO) within a server. The NVIDIA NVLink Switch chips connect multiple NVLinks to provide all-to-all GPU communication at full NVLink speed within a single rack and between racks.
|
To unpack it, GPUs connected through NVLink get almost 1.8TB/s of bandwidth. It’s a separate piece of hardware that connects two GPUs directly and bypasses the whole PCI express slot and host thereby avoiding any latency there. As yo will see the later sections, this allows two GPUs to talk to each other as if they are almost one GPU which is great from data transfer standpoint.
We can get the topology information of our host by running nvidia-smi
with matrix
set as flag.
1
| nvidia-smi topo --matrix
|
An example output is the following –
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 SYS SYS 0-23 0 N/A
GPU1 NV12 X SYS SYS 24-47 1 N/A
GPU2 SYS SYS X NV12 48-71 2 N/A
GPU3 SYS SYS NV12 X 72-95 3 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
|
Looking at the matrix, we see that GPU0
and GPU1
are connected to each other using NVLink. Similarly GPU2
and GPU3
are connected to each other using NVLink. If any non-NVLink-ed GPUs want to talk to each other, they will have to do so via PCI-e slot. I will do a future post talking more about this as I understand it better. Right now, it is suffient to understand that data transfer across two GPUs which are not connected through NVLink will be much slower. To see this we can go to the next section.
Running the P2P Bandwidth Test
Talk about the sample executable available in CUDA samples and run it and provide some information on what is the output telling us.
The output we got was -
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
| [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:1
Device: 1, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:2
Device: 2, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:3
Device: 3, NVIDIA A100 80GB PCIe, pciBusID: 0, pciDeviceID: 0, pciDomainID:4
Device=0 CAN Access Peer Device=1
Device=0 CANNOT Access Peer Device=2
Device=0 CANNOT Access Peer Device=3
Device=1 CAN Access Peer Device=0
Device=1 CANNOT Access Peer Device=2
Device=1 CANNOT Access Peer Device=3
Device=2 CANNOT Access Peer Device=0
Device=2 CANNOT Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=3 CANNOT Access Peer Device=0
Device=3 CANNOT Access Peer Device=1
Device=3 CAN Access Peer Device=2
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1 2 3
0 1 1 0 0
1 1 1 0 0
2 0 0 1 1
3 0 0 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1508.20 21.51 21.63 21.80
1 21.40 1509.66 21.68 21.59
2 21.48 21.55 1509.66 21.60
3 21.49 21.66 21.60 1511.12
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3
0 1502.40 275.07 21.62 21.60
1 275.51 1521.42 21.69 21.50
2 21.46 21.72 1508.20 275.34
3 21.51 21.69 274.11 1521.42
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1525.88 26.41 29.57 29.35
1 26.00 1531.86 29.68 29.49
2 27.41 27.69 1528.12 28.93
3 26.85 27.22 28.99 1529.61
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3
0 1531.11 516.19 29.58 29.35
1 515.51 1527.37 29.79 29.42
2 27.36 27.80 1528.86 517.45
3 27.07 27.19 515.35 1523.65
P2P=Disabled Latency Matrix (us)
GPU 0 1 2 3
0 2.61 21.89 20.48 17.81
1 11.68 2.72 14.22 20.16
2 18.63 15.70 2.48 13.97
3 18.73 18.53 18.04 2.49
CPU 0 1 2 3
0 2.65 7.81 6.98 6.28
1 7.10 2.55 6.22 6.19
2 7.20 6.47 2.29 5.74
3 6.64 6.49 5.61 2.30
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1 2 3
0 2.62 2.25 19.85 16.96
1 2.22 2.71 14.12 20.38
2 18.87 16.03 2.47 2.32
3 18.78 18.08 2.32 2.48
CPU 0 1 2 3
0 2.34 1.92 6.34 6.43
1 1.96 2.32 6.34 6.56
2 6.79 7.20 1.99 1.63
3 6.67 6.55 1.68 2.07
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
|
This matrix shows us things much more clearly. When P2P is diabled, no NVLink is used and therefore every communication across GPUs will be through PCIexpress slot and it will be slow. The magic happens for the cases when P2P is enabled.
We observe the following things -
- If two GPUs are connected using NVLink, their bandwidth is 10x as compared to communicating across GPUs not connected through NVLink. This becomes 20x. i.e., it doubles when we are checking bidirectional data transfer.
- The latency to transfer data from a GPU to itself about 2.62 us. The same amount of data will almost 19 us to transfer to another GPU that is not NVLink-ed. If we have NVLink, it drops to 2.75 us. Almost close to GPU transferring it to itself. This shows that NVLink is like almost bonding the two GPUs in terms of data bandwidth.
- Lastly, we see CPU jump up when talking about transfers across GPUs that are not NVLink-ed. This is expected considering that data is being transferred through PCIe slot and the CPU will have to step in to do some synchronization and housekeeping.
Conclusion
Talk about the broad points. Maybe also throw in a line around orchestrators like SLURM - a workload orchestrator. They are more less similar to Kubernetes but more popular in the HPC world.