CUDA.jl 5.1: Unified memory and cooperative groups

Nov 7, 2023
Tim Besard

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.

Unified memory

Unified memory is a feature of CUDA that allows the programmer to access memory from both the CPU and GPU, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.

CUDA.jl did already support unified memory, but only for the most basic use cases. With CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that memory from the CPU:

julia> gpu = cu([1., 2.]; unified=true)
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 1.0
 2.0

julia> # accessing GPU memory from the CPU
       gpu[1] = 3;

julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 3.0
 2.0

Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is safe and efficient to perform scalar iteration on CuArrays backed by unified memory. This greatly simplifies porting applications to the GPU, as it no longer is a problem when code uses AbstractArray fallbacks from Base that process element by element.

In addition, CUDA.jl 5.1 also makes it easier to convert CuArrays to Array objects. This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK which do not support CuArrays:

julia> cpu = unsafe_wrap(Array, gpu)
2-element Vector{Float32}:
 3.0
 2.0

julia> LinearAlgebra.BLAS.scal!(2f0, cpu);

julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 6.0
 4.0

The reverse is also possible: CPU-based Arrays can now trivially be converted to CuArray objects for use on the GPU, without the need to explicitly allocate unified memory. This further simplifies memory management, as it makes it possible to use the GPU inside of an existing application without having to copy data into a CuArray:

julia> gpu = unsafe_wrap(CuArray, cpu)
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1
 2

julia> CUDA.@sync gpu .+= 1;

julia> cpu
2-element Vector{Int64}:
 2
 3

Note that the above methods are prefixed unsafe because of how they require careful management of object lifetimes: When creating an Array from a CuArray, the CuArray must be kept alive for as long as the Array is used, and vice-versa when creating a CuArray from an Array. Explicit synchronization (i.e. waiting for the GPU to finish computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU memory through a CPU pointer.

For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be changed using the default_memory preference of the CUDA.jl module, which can be set to either "device", "unified" or "host". When these changes have been sufficiently tested, and the remaining rough edges have been smoothed out, we may consider switching the default allocator.

Cooperative groups

Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it possible to write kernels that are more flexible than the traditional approach of differentiating computations based on thread and block indices. Instead, cooperative groups allow the programmer to use objects representing groups of threads, pass those around, and differentiate computations based on queries on those objects.

For example, let's port the example from the introductory NVIDIA blogpost post, which provides a function to compute the sum of an array in parallel:

function reduce_sum(group, temp, val)
    lane = CG.thread_rank(group)

    # Each iteration halves the number of active threads
    # Each thread adds its partial sum[i] to sum[lane+i]
    i = CG.num_threads(group) ÷ 2
    while i > 0
        temp[lane] = val
        CG.sync(group)
        if lane <= i
            val += temp[lane + i]
        end
        CG.sync(group)
        i ÷= 2
    end

    return val  # note: only thread 1 will return full sum
end

When the threads of a group call this function, they cooperatively compute the sum of the values passed by each thread in the group. For example, let's write a kernel that calls this function using a group representing the current thread block:

function sum_kernel_block(sum::AbstractArray{T},
                          input::AbstractArray{T}) where T
    # have each thread compute a partial sum
    my_sum = thread_sum(input)

    # perform a cooperative summation
    temp = CuStaticSharedArray(T, 256)
    g = CG.this_thread_block()
    block_sum = reduce_sum(g, temp, my_sum)

    # combine the block sums
    if CG.thread_rank(g) == 1
        CUDA.@atomic sum[] += block_sum
    end

    return
end

function thread_sum(input::AbstractArray{T}) where T
    sum = zero(T)

    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    while i <= length(input)
        sum += input[i]
        i += stride
    end

    return sum
end

n = 1<<24
threads = 256
blocks = cld(n, threads)

data = CUDA.rand(n)
sum = CUDA.fill(zero(eltype(data)), 1)
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)

This style of programming makes it possible to write kernels that are safer and more modular than traditional kernels. Some CUDA features also require the use of cooperative groups, for example, asynchronous memory copies between global and shared memory are done using the CG.memcpy_async function.

With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support has been added for implicit groups (with the exception of cluster groups and the deprecated multi-grid groups), all relevant queries on these groups, as well as the many important collective functions, such as shuffle, vote, and memcpy_async. Support for explicit groups is still missing, as are collectives like reduce and invoke. For more information, refer to the CUDA.jl documentation.

Other updates

Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:

Support for CUDA 12.3
Performance improvements related to memory copies, which regressed in CUDA 5.0
Improvements to the native profiler (CUDA.@profiler), now also showing local memory usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter
Many CUSOLVER and CUSPARSE improvements by @amontoison