CUDA.jl 5.1: Unified memory and cooperative groups
Tim Besard
CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.
Unified memory
Unified memory is a feature of CUDA that allows the programmer to access memory from both the CPU and GPU, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.
CUDA.jl did already support unified memory, but only for the most basic use cases. With CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that memory from the CPU:
julia> gpu = cu([1., 2.]; unified=true)
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
1.0
2.0
julia> # accessing GPU memory from the CPU
gpu[1] = 3;
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
3.0
2.0
Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is safe and efficient to perform scalar iteration on CuArray
s backed by unified memory. This greatly simplifies porting applications to the GPU, as it no longer is a problem when code uses AbstractArray
fallbacks from Base that process element by element.
In addition, CUDA.jl 5.1 also makes it easier to convert CuArray
s to Array
objects. This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK which do not support CuArray
s:
julia> cpu = unsafe_wrap(Array, gpu)
2-element Vector{Float32}:
3.0
2.0
julia> LinearAlgebra.BLAS.scal!(2f0, cpu);
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
6.0
4.0
The reverse is also possible: CPU-based Array
s can now trivially be converted to CuArray
objects for use on the GPU, without the need to explicitly allocate unified memory. This further simplifies memory management, as it makes it possible to use the GPU inside of an existing application without having to copy data into a CuArray
:
julia> gpu = unsafe_wrap(CuArray, cpu)
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
1
2
julia> CUDA.@sync gpu .+= 1;
julia> cpu
2-element Vector{Int64}:
2
3
Note that the above methods are prefixed unsafe
because of how they require careful management of object lifetimes: When creating an Array
from a CuArray
, the CuArray
must be kept alive for as long as the Array
is used, and vice-versa when creating a CuArray
from an Array
. Explicit synchronization (i.e. waiting for the GPU to finish computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU memory through a CPU pointer.
For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be changed using the default_memory
preference of the CUDA.jl module, which can be set to either "device"
, "unified"
or "host"
. When these changes have been sufficiently tested, and the remaining rough edges have been smoothed out, we may consider switching the default allocator.
Cooperative groups
Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it possible to write kernels that are more flexible than the traditional approach of differentiating computations based on thread and block indices. Instead, cooperative groups allow the programmer to use objects representing groups of threads, pass those around, and differentiate computations based on queries on those objects.
For example, let's port the example from the introductory NVIDIA blogpost post, which provides a function to compute the sum of an array in parallel:
function reduce_sum(group, temp, val)
lane = CG.thread_rank(group)
# Each iteration halves the number of active threads
# Each thread adds its partial sum[i] to sum[lane+i]
i = CG.num_threads(group) ÷ 2
while i > 0
temp[lane] = val
CG.sync(group)
if lane <= i
val += temp[lane + i]
end
CG.sync(group)
i ÷= 2
end
return val # note: only thread 1 will return full sum
end
When the threads of a group call this function, they cooperatively compute the sum of the values passed by each thread in the group. For example, let's write a kernel that calls this function using a group representing the current thread block:
function sum_kernel_block(sum::AbstractArray{T},
input::AbstractArray{T}) where T
# have each thread compute a partial sum
my_sum = thread_sum(input)
# perform a cooperative summation
temp = CuStaticSharedArray(T, 256)
g = CG.this_thread_block()
block_sum = reduce_sum(g, temp, my_sum)
# combine the block sums
if CG.thread_rank(g) == 1
CUDA.@atomic sum[] += block_sum
end
return
end
function thread_sum(input::AbstractArray{T}) where T
sum = zero(T)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
while i <= length(input)
sum += input[i]
i += stride
end
return sum
end
n = 1<<24
threads = 256
blocks = cld(n, threads)
data = CUDA.rand(n)
sum = CUDA.fill(zero(eltype(data)), 1)
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)
This style of programming makes it possible to write kernels that are safer and more modular than traditional kernels. Some CUDA features also require the use of cooperative groups, for example, asynchronous memory copies between global and shared memory are done using the CG.memcpy_async
function.
With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support has been added for implicit groups (with the exception of cluster groups and the deprecated multi-grid groups), all relevant queries on these groups, as well as the many important collective functions, such as shuffle
, vote
, and memcpy_async
. Support for explicit groups is still missing, as are collectives like reduce
and invoke
. For more information, refer to the CUDA.jl documentation.
Other updates
Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:
Support for CUDA 12.3
Performance improvements related to memory copies, which regressed in CUDA 5.0
Improvements to the native profiler (
CUDA.@profiler
), now also showing local memory usage, supporting more NVTX metadata, and with better support for Pluto.jl and JupyterMany CUSOLVER and CUSPARSE improvements by @amontoison