USM: A first look
Updated introduction with eCSE project.
When porting HPC codes to GPU-based architectures, a major concern is whether the cost of data movement between the host and device will outweigh the benefit due to accelerating the computation on the GPU. As part of work on the ARCHER2 eCSE project “Porting x3d2 to AMD GPUs” I have been exploring some optimisations available in OpenMP to control data movement and comparing this with the Unified Shared Memory (USM) programming model. Unified Shared Memory is available in recent hardware from the major GPU vendors1, 2, 3 and simplifies accelerator programming by presenting a single memory space that is accessible to both the host and device; with appropriate hardware support this can reduce the effort required to port/develop codes to/for accelerators. An obvious question to ask is “What is the cost of using USM vs controlling the data movements manually?” in this post I investigate applying some basic data motion-focused optimisations to a simple program using OpenMP’s target offload directives and compare these against using USM.
The following experiments were performed on one of EPCC’s GH200 Grace-Hopper nodes using nvfortran
25.1
, the code and results are available at https://github.com/pbartholomew08/omptgt_setget (commit
93b1b57
).
Throughout the experiments dynamically allocated arrays of 109 32-bit floats (4 GB) are used.
Unless stated otherwise programs are compiled using
FFLAGS=-g -O3 -mp=gpu -Minfo=mp
1. The reference program
The reference program started as a simplified version of a test case used in developing x3d2
4
in which an array a
is initialised on the host, its values copied to array b
on the device where it
is modified by a simple kernel, and finally the modified values copied back to a
for validation.
The main body of the program is shown in Listing 1, as can be seen there is no attempt to
control data transfers between host and device, each operation simply launches a parallel loop.
! Initialise a(:) = 1.0 ! Set !$omp target teams distribute parallel do do i = 1, n b(i) = a(i) end do !$omp end target teams distribute parallel do ! Kernel !$omp target teams distribute parallel do do i = 1, n b(i) = 2 * b(i) end do !$omp end target teams distribute parallel do ! Get !$omp target teams distribute parallel do do i = 1, n a(i) = b(i) end do !$omp end target teams distribute parallel do if (any(a /= 2.0)) then error stop else print *, "PASS" end if
Running this code, using time
and the NVCOMPILER_ACC_NOTIFY
environment variable to measure the
execution time and trace data transfers between host and device, we can see in Listing 2
that before each parallel loop the array(s) are copied to the device and after the loop is executed
they are copied back to the host.
This excessive data transfer likely explains the 93.92 s runtime.
$ NVCOMPILER_ACC_NOTIFY=3 OMP_TARGET_OFFLOAD=MANDATORY time ./main1 upload CUDA data file=.../main.f90 function=main line=16 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main.f90 function=main line=16 device=0 threadid=1 variable=a(:) bytes=4000000000 launch CUDA kernel file=.../main.f90 function=main line=16 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L16_2_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main.f90 function=main line=20 device=0 threadid=1 variable=a(:) bytes=4000000000 download CUDA data file=.../main.f90 function=main line=20 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main.f90 function=main line=23 device=0 threadid=1 variable=b(:) bytes=4000000000 launch CUDA kernel file=.../main.f90 function=main line=23 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L23_4_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main.f90 function=main line=27 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main.f90 function=main line=30 device=0 threadid=1 variable=a(:) bytes=4000000000 upload CUDA data file=.../main.f90 function=main line=30 device=0 threadid=1 variable=b(:) bytes=4000000000 launch CUDA kernel file=.../main.f90 function=main line=30 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L30_6_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main.f90 function=main line=34 device=0 threadid=1 variable=b(:) bytes=4000000000 download CUDA data file=.../main.f90 function=main line=34 device=0 threadid=1 variable=a(:) bytes=4000000000 PASS 93.92user 4.44system 1:39.31elapsed 99%CPU (0avgtext+0avgdata 7925760maxresident)k 0inputs+0outputs (0major+162431minor)pagefaults 0swaps
2. Unified Shared Memory
The initial goal of this work was really to gain an understanding of how device memory can be
controlled to optimise performance using OpenMP, it was out of curiosity that after implementing the
initial program I turned on unified shared memory by adding -gpu=mem:unified
to the FFLAGS
used to
compile main1
, outside of this test this compiler flags is not used.
As comparing the time reported in Listing 3 against Listing 2 shows, enabling
USM, without any code modification, resulted in a more than 10× speedup!
This raises the question: how does this compare to manual memory management, is this good
performance or can we do better?
Interestingly the trace does not show any data transfers, however as far as I’m aware the USM
mechanism still results in data migration to the processing unit that is currently operating on that
memory.
$ NVCOMPILER_ACC_NOTIFY=3 OMP_TARGET_OFFLOAD=MANDATORY time ./main1.usm launch CUDA kernel file=.../main1.f90 function=main line=16 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L16_2_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b launch CUDA kernel file=.../main1.f90 function=main line=23 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L23_4_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b launch CUDA kernel file=.../main1.f90 function=main line=30 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L30_6_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b PASS 1.65user 3.34system 0:06.01elapsed 83%CPU (0avgtext+0avgdata 4027392maxresident)k 0inputs+0outputs (1major+14943minor)pagefaults 0swaps
3. Controlling data motion
As Listing 2 makes clear, without USM we are moving data unnecessarily - for example in the
Set
kernel the uninitialised contents of b
are uploaded to the device and the unmodified contents of
a
are copied back to the host.
The direction of data motion can be specified by adding map
clauses to the OpenMP directives to
reduce data transfers, considering the Set
kernel we use map(to:a) map(from:b)
to eliminate copying
b
to the device and a
from the device.
The program body with these optimisations applied is shown in Listing 4, note that b
must be copied to and from the device in the main kernel so that its initialised values are
available for the operation and modified values are returned for use in the subsequent Get
kernel.
Listing 5 shows the trace and timing from running main2
with reduced data transfer reported
and correspondingly reduced runtime (≈50% improvement) as expected.
! Initialise a(:) = 1.0 ! Set !$omp target teams distribute parallel do map(to:a) map(from:b) do i = 1, n b(i) = a(i) end do !$omp end target teams distribute parallel do ! Kernel !$omp target teams distribute parallel do map(tofrom:b) do i = 1, n b(i) = 2 * b(i) end do !$omp end target teams distribute parallel do ! Get !$omp target teams distribute parallel do map(to:b) map(from:a) do i = 1, n a(i) = b(i) end do !$omp end target teams distribute parallel do if (any(a /= 2.0)) then error stop else print *, "PASS" end if
$ NVCOMPILER_ACC_NOTIFY=3 OMP_TARGET_OFFLOAD=MANDATORY time ./main2 upload CUDA data file=.../main2.f90 function=main line=16 device=0 threadid=1 variable=a$sd1(:) bytes=128 upload CUDA data file=.../main2.f90 function=main line=16 device=0 threadid=1 variable=b$sd2(:) bytes=128 upload CUDA data file=.../main2.f90 function=main line=16 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main2.f90 function=main line=16 device=0 threadid=1 variable=a(:) bytes=4000000000 upload CUDA data file=.../main2.f90 function=main line=16 device=0 threadid=1 variable=descriptor bytes=128 launch CUDA kernel file=.../main2.f90 function=main line=16 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L16_2_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main2.f90 function=main line=20 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main2.f90 function=main line=23 device=0 threadid=1 variable=b$sd2(:) bytes=128 upload CUDA data file=.../main2.f90 function=main line=23 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main2.f90 function=main line=23 device=0 threadid=1 variable=b(:) bytes=4000000000 launch CUDA kernel file=.../main2.f90 function=main line=23 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L23_4_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main2.f90 function=main line=27 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main2.f90 function=main line=30 device=0 threadid=1 variable=a$sd1(:) bytes=128 upload CUDA data file=.../main2.f90 function=main line=30 device=0 threadid=1 variable=b$sd2(:) bytes=128 upload CUDA data file=.../main2.f90 function=main line=30 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main2.f90 function=main line=30 device=0 threadid=1 variable=b(:) bytes=4000000000 upload CUDA data file=.../main2.f90 function=main line=30 device=0 threadid=1 variable=descriptor bytes=128 launch CUDA kernel file=.../main2.f90 function=main line=30 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L30_6_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main2.f90 function=main line=34 device=0 threadid=1 variable=a(:) bytes=4000000000 PASS 44.21user 4.76system 0:49.96elapsed 98%CPU (0avgtext+0avgdata 7925760maxresident)k 0inputs+0outputs (0major+200171minor)pagefaults 0swaps
4. Device-resident data
Although we have achieved a reasonable speedup by controlling data motion, we can still do better.
In reality array b
is never required on the host: its values are initialised, modified and read on
the device, the associated data transfers shown in Listing 5 are therefore unnecessary
overhead.
Rather than map
’ing b
between the host and device, it can be held resident in device memory by
creating a target data
region that allocates b
on the device and deletes it on exit.
This optimisation is shown in Listing 6, note that all map
clauses for b
have been
removed and the offloaded code is now within the target data
block that creates b
on the device.
The reduction in data transfers is confirmed by the trace in Listing 7 and the total elapsed
time is now over 10× less than the original program.
Without making more drastic changes to the program - for example we don’t really need to copy a
into
b
, operate on b
then copy the modified result back to a
- this is probably a reasonable limit of
optimisation that is possible5.
! Initialise a(:) = 1.0 !$omp target enter data map(alloc:b) ! Set !$omp target teams distribute parallel do map(to:a) do i = 1, n b(i) = a(i) end do !$omp end target teams distribute parallel do ! Kernel !$omp target teams distribute parallel do do i = 1, n b(i) = 2 * b(i) end do !$omp end target teams distribute parallel do ! Get !$omp target teams distribute parallel do map(from:a) do i = 1, n a(i) = b(i) end do !$omp end target teams distribute parallel do if (any(a /= 2.0)) then error stop else print *, "PASS" end if !$omp target exit data map(delete:b)
$ NVCOMPILER_ACC_NOTIFY=3 OMP_TARGET_OFFLOAD=MANDATORY time ./main3 upload CUDA data file=.../main3.f90 function=main line=17 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main3.f90 function=main line=17 device=0 threadid=1 variable=a$sd1(:) bytes=128 upload CUDA data file=.../main3.f90 function=main line=17 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main3.f90 function=main line=17 device=0 threadid=1 variable=a(:) bytes=4000000000 launch CUDA kernel file=.../main3.f90 function=main line=17 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L17_2_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b upload CUDA data file=.../main3.f90 function=main line=24 device=0 threadid=1 variable=descriptor bytes=128 launch CUDA kernel file=.../main3.f90 function=main line=24 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L24_4_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b upload CUDA data file=.../main3.f90 function=main line=31 device=0 threadid=1 variable=descriptor bytes=128 upload CUDA data file=.../main3.f90 function=main line=31 device=0 threadid=1 variable=a$sd1(:) bytes=128 upload CUDA data file=.../main3.f90 function=main line=31 device=0 threadid=1 variable=descriptor bytes=128 launch CUDA kernel file=.../main3.f90 function=main line=31 device=0 host-threadid=0 num_teams=0 thread_limit=0 kernelname=nvkernel_MAIN__F1L31_6_ grid=<<<7812500,1,1>>> block=<<<128,1,1>>> shmem=0b download CUDA data file=.../main3.f90 function=main line=35 device=0 threadid=1 variable=a(:) bytes=4000000000 PASS 0.52user 3.86system 0:05.37elapsed 81%CPU (0avgtext+0avgdata 4027392maxresident)k 0inputs+0outputs (0major+62607minor)pagefaults 0swaps
5. Conclusion
Comparing the timings reported for the reference program with USM enabled and the hand-optimised version without USM (see Listings 3 and 7) shows that there is very little performance impact from using USM, and a more careful measurement of performance combined with repeated measurements may even reveal negligible differences between the two approaches. Although further testing is necessary including testing different vendor’s hardware and associated compilers, and it must be noted that this example is extremely simple and contrived, using USM to initially port/develop a program for accelerators seems like a reasonable approach if suitable hardware is available.
Footnotes:
Intel Unified Shared Memory documentation: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/host-device-memory-buffer-and-usm.html
AMD Unified Shared Memory documentation: https://rocm.docs.amd.com/projects/llvm-project/en/latest/conceptual/openmp.html#unified-shared-memory
Nvidia Unified Shared Memory documentation: https://docs.nvidia.com/hpc-sdk/archive/24.3/compilers/hpc-compilers-user-guide/index.html#openmp-unified-mem
Further improvements might be gained by tuning the kernel launch parameters such as grid and thread block dimensions.