CUDAnative.jl 3.0 and CuArrays.jl 2.0
This release of the Julia CUDA stack contains some exciting new features: automatic installation of CUDA using artifacts, full support for GPU method redefinitions, and experimental support for multitasking and multithreading. The release is technically breaking, but most end-users should not be affected.
Changes to certain APIs require these releases to be breaking, however, most users should not be affected and chances are you can just bump your Compat entries without any additional changes. Flux.jl users will have to wait a little longer though, as the package uses non-public APIs that have changed and requires an update.
CUDA and its dependencies will now be automatically installed using artifacts generated by BinaryBuilder.jl. This greatly improves usability, and only requires a functioning NVIDIA driver:
$ JULIA_DEBUG=CUDAnative julia julia> using CUDAnative julia> CUDAnative.version() ┌ Debug: Trying to use artifacts... └ @ CUDAnative CUDAnative/src/bindeps.jl:52 ┌ Debug: Using CUDA 10.2.89 from an artifact at /depot/artifacts/... └ @ CUDAnative CUDAnative/src/bindeps.jl:108 v"10.2.89"
Use of a local installation is still possible by setting the environment variable
JULIA_CUDA_USE_BINARYBUILDER to false. For more details, refer to the
CUDAnative 3.0 now fully supports method redefinitions, commonly referred to as Julia issue #265, and makes it possible to use interactive programming tools like Revise.jl:
julia> child() = 0 julia> parent() = (@cuprintln(child()); return) julia> @cuda parent() 0 julia> parent() = (@cuprintln(child() + 1); return) julia> @cuda parent() 1 julia> child() = 1 julia> @cuda parent() 2
Relevant PRs: CUDAnative.jl#581
Experimental: Multitasking and multithreading
With CUDAnative 3.0 and CuArrays 2.0 you can now use Julia tasks and threads to organize your code. In combination with CUDA streams, this makes it possible to execute kernels and other GPU operations in parallel:
@sync begin function my_expensive_kernel() return end @async @cuda stream=CuStream() my_expensive_kernel() @async @cuda stream=CuStream() my_expensive_kernel() end
Every task, whether it runs on a separate thread or not, can work with a different device, as well as independently work with CUDA libraries like CUBLAS and CUFFT.
Note that this support is experimental, and lacks certain features to be fully effective. For one, the CuArrays memory allocator is not device-aware, and it is currently not possible to configure the CUDA stream for operations like map or broadcast.
GPU kernels are now name-mangled like C++, which offers better integration with NVIDIA tools (CUDAnative.jl#559).
CuIterator type for batching arrays to the GPU (by @jrevels,
Integration with Cthulhu.jl for interactive inspection of generated code (CUDAnative.jl#597).
With a release as big as this one there’s bound to be some bugs, e.g., with the installation of artifacts on exotic systems, or due to the many changes to make the libraries thread-safe. If you need absolute stability, please wait for a point release.
There are also some known issues. CUDAnative is currently not compatible with Julia 1.5 due
to Base compiler changes (julia#34993),
mapreducedim! kernel appears to be slower in some cases
(CuArrays.jl#611), and there are some
remaining thread-safety issues when using the non-default memory pool