CUDA.jl 3.5-3.8

Jan 28, 2022
Tim Besard

CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.

Copies between devices

Typically, when sending data between devices you need to stage through the CPU. CUDA.jl now does this automatically, making it possible to directly copy between CuArrays on different devices:

julia> device!(0);

julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

julia> device!(1);

julia> b = CUDA.zeros(2,2);

julia> copyto!(b, a)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:

julia> src = CuDevice(0)
CuDevice(0): NVIDIA A100-PCIE-40GB

julia> dst = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GB

julia> can_access_peer(src, dst)
false

In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports true:

julia> src = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GB

julia> dst = CuDevice(2)
CuDevice(2): Tesla V100-PCIE-16GB

julia> can_access_peer(src, dst)
true

Thanks to @kshyatt for help with this change!

Helper function to use `compute-sanitizer`

The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the compute sanitizer. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and provides a helper function to restart Julia under the compute-sanitizer. Let's demonstrate, and trigger a memory error to show what the compute sanitizer can detect:

julia> using CUDA

julia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...

========= COMPUTE-SANITIZER
julia> using CUDA

julia> unsafe_wrap(CuArray, pointer(CuArray([1])), 2) .= 1
========= Invalid __global__ write of size 8 bytes
=========     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892(CuKernelContext, CuDeviceArray<Int64, (int)1, (int)1>, Broadcasted<void, Tuple<OneTo<Int64>>, _identity, Broadcasted<Int64>>, Int64)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0xa64000008 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes

Other tools are available too, e.g. racecheck for detecting races or synccheck for finding synchronization issues. These tools can be selected using the tool keyword argument to run_compute_sanitizer.

Updated binary dependencies

As is common with every release, CUDA.jl now supports newer versions of NVIDIA's tools and libraries:

CUDA toolkit 11.5 and 11.6
CUDNN 8.3.2
CUTENSOR 1.4.0

The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like compute-sanitizer or cuda-gdb, and you need debug info (the equivalent of nvcc -G), ensure CUDA.jl can use the latest version of the CUDA toolkit.

To make it easier to use the latest supported toolkit, CUDA.jl now implements CUDA's so-called Forward Compatibility mode: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to the NVIDIA documentation for more details.

Preserving array indices

Julia's integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl's device array type now carefully preserves the type of indices so that 32-bits indices aren't unnecessarily promoted to 64-bits. With some careful kernel programming (note the use of 0x1 instead of 1 below), this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:

julia> function memset(arr, val)
           i = (blockIdx().x-0x1) * blockDim().x + threadIdx().x
           @inbounds arr[i] = val
           return
       end

julia> CUDA.code_ptx(memset, Tuple{CuDeviceArray{Float32,1,AS.Global},Float32})
.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;

        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r2, %r1, %r3;
        ld.u64          %rd2, [%rd1];
        mul.wide.s32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        st.global.f32   [%rd4], %f1;
        ret;
}

On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:

.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<8>;

        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mul.wide.u32    %rd2, %r2, %r1;
        mov.u32         %r3, %tid.x;
        add.s32         %r4, %r3, 1;
        cvt.u64.u32     %rd3, %r4;
        ld.u64          %rd4, [%rd1];
        add.s64         %rd5, %rd2, %rd3;
        shl.b64         %rd6, %rd5, 2;
        add.s64         %rd7, %rd4, %rd6;
        st.global.f32   [%rd7+-4], %f1;
        ret;
}

More aggressive memory management

Starting with CUDA 3.8, the memory pool used to allocate CuArrays will be configured differently: The pool will now be allowed to use all available GPU memory, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.

This behavior can be observed by calling the memory_status() function:

julia> CUDA.memory_status()
Effective GPU memory usage: 13.57% (2.001 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

julia> a = CuArray{Float32}(undef, (1024, 1024, 1024));
julia> Base.format_bytes(sizeof(a))
"4.000 GiB"

julia> a = nothing
julia> GC.gc()

julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU (e.g., by copying memory to the CPU) would have resulted in a release of this reserved memory. This is not the case anymore:

julia> synchronize()

julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

If you still want to release this memory, you can call the reclaim() function:

julia> CUDA.reclaim()

julia> CUDA.memory_status()
Effective GPU memory usage: 13.48% (1.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

With interactive Julia sessions, this function is called periodically so that the GPU's memory isn't held on to unnecessarily. Otherwise it shouldn't be necessary to call this function, as memory is freed automatically when it is needed.

Minor changes and improvements

Bitonic sort is now used instead of quicksort (by @xaellison).
CuDeviceArray now stores the length of the array, greatly speeding up indexing with high-dimensional arrays.
Device intrinsics cannot be called on the CPU anymore, protecting against segfaults when something isn't dispatching correctly.
Support for Multi-GPU instances has been improved, providing the parent_uuid function to look up the UUID of the parent device.
randn and randexp are now supported in kernel code, which should help with initial support of Distributions.jl-based operations.