CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

Mar 11, 2025
Tim Besard

CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved CuRef type, which enables fully asynchronous CUBLAS calls.

Reworking `CuRef` for asynchronous CUBLAS

The CuRef type is similar to Julia's Ref, a boxed value, often used with C APIs. In CUDA.jl v5.7, we've made several changes to this type. First of all, we've aligned its API much more closely with the Ref type from Base, e.g, adding getindex and setindex! methods, which should make it more familiar to users:

julia> box = CuRef(1)
CuRefValue{Int64}(1)

julia> box[]
1

julia> box[] = 2
2

julia> box
CuRefValue{Int64}(2)

We also optimized and improved the CuRef implementation. As part of that work, we removed the eager synchronization when copying from unpinned memory. This was done to make it possible for Julia code to execute when waiting for the memory copy to start. However, it turns out that certain (small) copies, such as those performed by CuRef, can be performed without having to wait for the copy to start. By removing eager synchronization from those copies, CuRef objects can now be constructed fully asynchronously, i.e., without having to wait for the GPU to be ready.

Building on these changes, @kshyatt has switched our CUBLAS wrappers over to using GPU-based CuRef boxes for scalar inputs instead of host-based Ref boxes. Although this increases the complexity of invoking CUBLAS APIs – the allocation of CuRef boxes requires CUDA API calls whereas a Ref box is much cheaper to allocate – this results in the API behaving asynchronously, whereas before every CUBLAS API taking scalar inputs would have resulted in a so-called "bubble" waiting for the GPU to finish executing.

A Julia-level allocator cache

To help with the common issue of running out of GPU memory, or to reduce the cost of CUDA.jl hitting the GC too often, @pxl-th has added a reusable caching allocator to GPUArrays.jl, which CUDA.jl now supports and integrates with.

The idea is simple: GPU allocations made in a GPUArrays.@cached block are recorded in a cache, and when the block is exited the allocations are made available for reuse. Only when the cache goes out of scope, or when you call unsafe_free! on it, the allocations will be fully freed. This is useful when you have a repetitive workload that performs the same allocations over and over again, such as in a machine learning training loop:

cache = GPUArrays.AllocCache()
for epoch in 1:1000
    GPUArrays.@cached cache begin
        # dummy workload
        sin.(CUDA.rand(Float32, 1024^3))
    end
end

# wait for `cache` to be collected, or optionally eagerly free the memory
GPUArrays.unsafe_free!(cache)

Even though CUDA already has a caching allocator, the Julia-level caching mechanism may still improve performance by lowering pressure on the GC and reducing fragmentation of the underlying allocator. For example, the above snippet only performs two memory allocations that require 8 GiB, instead of 2000 allocations totalling 8 TiB (!) of GPU memory.

The cherry on top is that the caching interface is generic, implemented in GPUArrays.jl, and available to all GPU back-ends that are compatible with v11.2.

Minor changes

Device-to-host copies now eagerly synchronize to improve concurrent execution.
On multi-GPU systems, unified memory is not automatically prefetched anymore when launching kernels, making it possible to process a single array on multiple devices.
A change to CuDeviceArray should allow eliding additional bounds checks in code that already performs a manual bounds check (such as KernelAbstractions.jl code)
CUDA toolkit 12.8 is now supported, as well as Jetson Orin devices.
It is now possible to pass symbols to kernels.
CUBLAS: Support for Givens rotation methods.
CUSPARSE: Support for using CuSparseMatrixBSR with generic mm!.
Windows support for NVTX has been fixed.

CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

Reworking CuRef for asynchronous CUBLAS

A Julia-level allocator cache

Minor changes

Reworking `CuRef` for asynchronous CUBLAS