CUDA.jl 1.3 - Multi-device programming


Tim Besard

Today we’re releasing CUDA.jl 1.3, with several new features. The most prominent change is support for multiple GPUs within a single process.

Multi-GPU programming

With CUDA.jl 1.3, you can finally use multiple CUDA GPUs within a single process. To switch devices you can call device!, query the current device with device(), or reset it using device_reset!():

julia> collect(devices())
9-element Array{CuDevice,1}:
 CuDevice(0): Tesla V100-PCIE-32GB
 CuDevice(1): Tesla V100-PCIE-32GB
 CuDevice(2): Tesla V100-PCIE-32GB
 CuDevice(3): Tesla V100-PCIE-32GB
 CuDevice(4): Tesla V100-PCIE-16GB
 CuDevice(5): Tesla P100-PCIE-16GB
 CuDevice(6): Tesla P100-PCIE-16GB
 CuDevice(7): GeForce GTX 1080 Ti
 CuDevice(8): GeForce GTX 1080 Ti

julia> device!(5)

julia> device()
CuDevice(5): Tesla P100-PCIE-16GB

Let’s define a kernel to show this really works:

julia> function kernel()
           dev = Ref{Cint}()
           CUDA.cudaGetDevice(dev)
           @cuprintln("Running on device $(dev[])")
           return
       end

julia> @cuda kernel()
Running on device 5

julia> device!(0)

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

julia> @cuda kernel()
Running on device 0

Memory allocations, like CuArrays, are implicitly bound to the device they were allocated on. That means you should take care to only use an array when the owning device is active, or you will run into errors:

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

julia> a = CUDA.rand(1)
1-element CuArray{Float32,1}:
 0.6322775

julia> device!(1)

julia> a
ERROR: CUDA error: an illegal memory access was encountered

Future improvements might make the array type device-aware.

Multitasking and multithreading

Dovetailing with the support for multiple GPUs, is the ability to use these GPUs on separate Julia tasks and threads:

julia> device!(0)

julia> @sync begin
         @async begin
           device!(1)
           println("Working with $(device()) on $(current_task())")
           yield()
           println("Back to device $(device()) on $(current_task())")
         end
         @async begin
           device!(2)
           println("Working with $(device()) on $(current_task())")
         end
       end
Working with CuDevice(1) on Task @0x00007fc9e6a48010
Working with CuDevice(2) on Task @0x00007fc9e6a484f0
Back to device CuDevice(1) on Task @0x00007fc9e6a48010

julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

Each task has its own local GPU state, such as the device it was bound to, handles to libraries like CUBLAS or CUDNN (which means that each task can configure libraries independently), etc.

Minor features

CUDA.jl 1.3 also features some minor changes:

Known issues

Several operations on sparse arrays have been broken since CUDA.jl 1.2, due to the deprecations that were part of CUDA 11. The next version of CUDA.jl will drop support for CUDA 10.0 or older, which will make it possible to use new cuSPARSE APIs and add back missing functionality.