CUDA.jl 3.4

Tim Besard

The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl 3.0 is now the default fallback, and support for memory pools other than the CUDA stream-ordered one has been removed.

Streamlined atomic operations

In preparation of integrating with the new standard @atomic macro introduced in Julia 1.7, we have streamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-level atomic_ methods for atomic functionality that's directly supported by the hardware, and a high-level @atomic macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:

julia> a = CuArray([1]);

julia> function kernel(a)
         CUDA.@atomic a[] <<= 1

julia> @cuda threads=16 kernel(a)

julia> a
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:

julia> 1<<16

The only requirement is that the types being used are supported by CUDA.atomic_cas!. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.

Note that on Julia 1.7 and higher, CUDA.jl does not export the @atomic macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use CUDA.@atomic as in the example above.

Arrays with unified memory

You may have noticed that the CuArray type in the example above included an additional parameter, Mem.DeviceBuffer. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible to allocate arrays backed by unified buffers that can be used on multiple devices:

julia> a = cu([0]; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:

julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:

julia> device!(1)

julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:

Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.

New default random number generator

CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been added in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.

Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:

julia> using BenchmarkTools
julia> cuda_rng = CUDA.RNG();
julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);
julia> a = CUDA.zeros(1024,1024);

julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.040 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.04%
 Time  (median):     18.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.604 μs ± 34.734 μs  ┊ GC (mean ± σ):  1.17% ±  0.99%

  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  17 μs           Histogram: frequency by time        24.1 μs <

julia> @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  72.489 μs …  2.790 ms  ┊ GC (min … max): 0.00% … 98.44%
 Time  (median):     74.479 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.211 μs ± 61.598 μs  ┊ GC (mean ± σ):  0.67% ±  1.40%

  █                                                           ▁
  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
  72.5 μs      Histogram: log(frequency) by time       443 μs <
julia> using RNGTest
julia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);
julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);

julia> RNGTest.smallcrushTestU01(test_cuda_rng)
 All tests were passed

julia> RNGTest.smallcrushTestU01(test_gpuarrays_rng)
 The following tests gave p-values outside [0.001, 0.9990]:

       Test                          p-value
  1  BirthdaySpacings                 eps
  2  Collision                        eps
  3  Gap                              eps
  4  SimpPoker                       1.0e-4
  5  CouponCollector                  eps
  6  MaxOft                           eps
  7  WeightDistrib                    eps
 10  RandomWalk1 M                   6.0e-4
 (eps  means a value < 1.0e-300):

Removal of old memory pools

With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have removed support for those Julia-managed memory pools (i.e., binned, split and simple). You can now only use the cuda memory pool, or use no pool at all by setting the JULIA_CUDA_MEMORY_POOL environment variable to none.

Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.

Also note that the new stream-ordered allocator has turned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.

Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.

Device capability-dependent kernel code

Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.

Currently, the device capability and PTX ISA version are embedded and made available using respectively the compute_capability and ptx_isa_version functions. A simplified version number type, constructable using the sv"..." string macro, can be used to test against these properties. For example:

julia> function kernel(a)
           a[] = compute_capability() >= sv"6.0" ? 1 : 2
kernel (generic function with 1 method)

julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 1.000000e+00, float addrspace(1)* %2, align 4
  ret void

julia> capability(device!(1))

julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 2.000000e+00, float addrspace(1)* %2, align 4
  ret void

The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.

Other changes