CUDAnative.jl 3.0 and CuArrays.jl 2.0


Tim Besard

This post is located at /cudanative_3.0-cuarrays_2.0/

This release of the Julia CUDA stack contains some exciting new features: automatic installation of CUDA using artifacts, full support for GPU method redefinitions, and experimental support for multitasking and multithreading. The release is technically breaking, but most end-users should not be affected.

API changes

Changes to certain APIs require these releases to be breaking, however, most users should not be affected and chances are you can just bump your Compat entries without any additional changes. Flux.jl users will have to wait a little longer though, as the package uses non-public APIs that have changed and requires an update.

Artifacts

CUDA and its dependencies will now be automatically installed using artifacts generated by BinaryBuilder.jl. This greatly improves usability, and only requires a functioning NVIDIA driver:

julia> ENV["JULIA_DEBUG"] = "CUDAnative"

julia> using CUDAnative

julia> CUDAnative.version()
┌ Debug: Trying to use artifacts...
└ @ CUDAnative CUDAnative/src/bindeps.jl:52
┌ Debug: Using CUDA 10.2.89 from an artifact at /depot/artifacts/...
└ @ CUDAnative CUDAnative/src/bindeps.jl:108
v"10.2.89"

Use of a local installation is still possible by setting the environment variable JULIA_CUDA_USE_BINARYBUILDER to false. For more details, refer to the documentation.

Relevant PRs: CUDAnative.jl#492 and CuArrays.jl#490

Method redefinitions

CUDAnative 3.0 now fully supports method redefinitions, commonly referred to as Julia issue #265, and makes it possible to use interactive programming tools like Revise.jl:

julia> child() = 0
julia> parent() = (@cuprintln(child()); return)
julia> @cuda parent()
0

julia> parent() = (@cuprintln(child() + 1); return)
julia> @cuda parent()
1


julia> child() = 1
julia> @cuda parent()
2

Relevant PRs: CUDAnative.jl#581

Experimental: Multitasking and multithreading

With CUDAnative 3.0 and CuArrays 2.0 you can now use Julia tasks and threads to organize your code. In combination with CUDA streams, this makes it possible to execute kernels and other GPU operations in parallel:

@sync begin
    function my_expensive_kernel()
        return
    end
    @async @cuda stream=CuStream() my_expensive_kernel()
    @async @cuda stream=CuStream() my_expensive_kernel()
end

Every task, whether it runs on a separate thread or not, can work with a different device, as well as independently work with CUDA libraries like CUBLAS and CUFFT.

Note that this support is experimental, and lacks certain features to be fully effective. For one, the CuArrays memory allocator is not device-aware, and it is currently not possible to configure the CUDA stream for operations like map or broadcast.

Relevant PRs: CUDAnative.jl#609 and CuArrays.jl#645

Minor changes

GPU kernels are now name-mangled like C++, which offers better integration with NVIDIA tools (CUDAnative.jl#559).

A better N-dimensional mapreducedim! kernel, properly integrating with all Base interfaces (CuArrays.jl#602 and GPUArrays#246).

A CuIterator type for batching arrays to the GPU (by @jrevels, CuArrays.jl#467).

Integration with Base's 5-arg mul! (by @haampie, CuArrays.jl#641 and GPUArrays#253).

Integration with Cthulhu.jl for interactive inspection of generated code (CUDAnative.jl#597).

Known issues

With a release as big as this one there's bound to be some bugs, e.g., with the installation of artifacts on exotic systems, or due to the many changes to make the libraries thread-safe. If you need absolute stability, please wait for a point release.

There are also some known issues. CUDAnative is currently not compatible with Julia 1.5 due to Base compiler changes (julia#34993), the new mapreducedim! kernel appears to be slower in some cases (CuArrays.jl#611), and there are some remaining thread-safety issues when using the non-default memory pool (CuArrays.jl#647).