CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers
Tim Besard
CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved CuRef
type, which enables fully asynchronous CUBLAS calls.
Reworking CuRef
for asynchronous CUBLAS
The CuRef
type is similar to Julia's Ref
, a boxed value, often used with C APIs. In CUDA.jl v5.7, we've made several changes to this type. First of all, we've aligned its API much more closely with the Ref
type from Base, e.g, adding getindex
and setindex!
methods, which should make it more familiar to users:
julia> box = CuRef(1)
CuRefValue{Int64}(1)
julia> box[]
1
julia> box[] = 2
2
julia> box
CuRefValue{Int64}(2)
We also optimized and improved the CuRef
implementation. As part of that work, we removed the eager synchronization when copying from unpinned memory. This was done to make it possible for Julia code to execute when waiting for the memory copy to start. However, it turns out that certain (small) copies, such as those performed by CuRef
, can be performed without having to wait for the copy to start. By removing eager synchronization from those copies, CuRef
objects can now be constructed fully asynchronously, i.e., without having to wait for the GPU to be ready.
Building on these changes, @kshyatt has switched our CUBLAS wrappers over to using GPU-based CuRef
boxes for scalar inputs instead of host-based Ref
boxes. Although this increases the complexity of invoking CUBLAS APIs – the allocation of CuRef
boxes requires CUDA API calls whereas a Ref
box is much cheaper to allocate – this results in the API behaving asynchronously, whereas before every CUBLAS API taking scalar inputs would have resulted in a so-called "bubble" waiting for the GPU to finish executing.
A Julia-level allocator cache
To help with the common issue of running out of GPU memory, or to reduce the cost of CUDA.jl hitting the GC too often, @pxl-th has added a reusable caching allocator to GPUArrays.jl, which CUDA.jl now supports and integrates with.
The idea is simple: GPU allocations made in a GPUArrays.@cached
block are recorded in a cache
, and when the block is exited the allocations are made available for reuse. Only when the cache goes out of scope, or when you call unsafe_free!
on it, the allocations will be fully freed. This is useful when you have a repetitive workload that performs the same allocations over and over again, such as in a machine learning training loop:
cache = GPUArrays.AllocCache()
for epoch in 1:1000
GPUArrays.@cached cache begin
# dummy workload
sin.(CUDA.rand(Float32, 1024^3))
end
end
# wait for `cache` to be collected, or optionally eagerly free the memory
GPUArrays.unsafe_free!(cache)
Even though CUDA already has a caching allocator, the Julia-level caching mechanism may still improve performance by lowering pressure on the GC and reducing fragmentation of the underlying allocator. For example, the above snippet only performs two memory allocations that require 8 GiB, instead of 2000 allocations totalling 8 TiB (!) of GPU memory.
The cherry on top is that the caching interface is generic, implemented in GPUArrays.jl, and available to all GPU back-ends that are compatible with v11.2.
Minor changes
Device-to-host copies now eagerly synchronize to improve concurrent execution.
On multi-GPU systems, unified memory is not automatically prefetched anymore when launching kernels, making it possible to process a single array on multiple devices.
A change to
CuDeviceArray
should allow eliding additional bounds checks in code that already performs a manual bounds check (such as KernelAbstractions.jl code)CUDA toolkit 12.8 is now supported, as well as Jetson Orin devices.
It is now possible to pass symbols to kernels.
CUBLAS: Support for Givens rotation methods.
CUSPARSE: Support for using CuSparseMatrixBSR with generic
mm!
.Windows support for NVTX has been fixed.