CUDA.jl 5.0: Integrated profiler and task synchronization changes

Sep 19, 2023
Tim Besard

CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions.

Integrated profiler

The most exciting new feature in CUDA.jl 5.0 is the new integrated profiler, which is similar to the @profile macro from the Julia standard library. The profiler can be used by simply prefixing any code that uses the CUDA libraries with CUDA.@profile:

julia> CUDA.@profile CUDA.rand(1).+1
Profiler ran for 268.46 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   76.47% │ 205.28 µs │     1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel        │
│    5.42% │  14.54 µs │     2 │   7.27 µs │   5.01 µs │   9.54 µs │ cuMemAllocFromPoolAsync │
│    2.93% │   7.87 µs │     1 │   7.87 µs │   7.87 µs │   7.87 µs │ cuLaunchKernel          │
│    0.36% │ 953.67 ns │     2 │ 476.84 ns │    0.0 ns │ 953.67 ns │ cudaGetLastError        │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘

Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                        ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
│    0.44% │   1.19 µs │     1 │   1.19 µs │   1.19 µs │   1.19 µs │ _Z13gen_sequencedI17curandS ⋯
│    0.36% │ 953.67 ns │     1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
                                                                                  1 column omitted
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.7242923

The output shown above is a summary of what happened during the execution of the code. It is split into two sections: host-side activity, i.e., API calls to the CUDA libraries, and the resulting device-side activity. As part of each section, the output shows the time spent and the ratio to the total execution time. These ratios are important, and a good tool to quickly assess the performance of your code. For example, in the above output, we see that most of the time is spent on the host calling the CUDA libraries, and only very little time is actually spent computing things on the GPU. This indicates that the GPU is severely underutilized, which can be solved by increasing the problem size.

Instead of a summary, it is also possible to view a chronological trace by passing the trace=true keyword argument:

julia> CUDA.@profile trace=true CUDA.rand(1).+1;
Profiler ran for 262.98 µs, capturing 21 events.

Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace)
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
│ ID │     Start │      Time │                    Name │ Details                │
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
│  5 │   6.44 µs │   9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│  7 │  19.31 µs │ 715.26 ns │        cudaGetLastError │ -                      │
│  8 │  22.41 µs │ 204.09 µs │        cudaLaunchKernel │ -                      │
│  9 │ 227.21 µs │    0.0 ns │        cudaGetLastError │ -                      │
│ 14 │  232.7 µs │   3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│ 18 │ 250.34 µs │   7.39 µs │          cuLaunchKernel │ -                      │
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘

Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace)
┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
│ ID │     Start │    Time │ Threads │ Blocks │ Regs │ Name                                      ⋯
├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
│  8 │ 225.31 µs │ 1.19 µs │      64 │     64 │   38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯
│ 18 │ 257.73 µs │ 1.19 µs │       1 │      1 │   18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
                                                                                  1 column omitted

Here, we can see a list of events that the profiler captured. Each event has a unique ID, which can be used to corelate host-side and device-side events. For example, we can see that event 8 on the host is a call to cudaLaunchKernel, which corresponds to to the execution of a CURAND kernel on the device.

The integrated profiler is a great tool to quickly assess the performance of your GPU application, identify bottlenecks, and find opportunities for optimization. For complex applications, however, it is still recommended to use NVIDIA's NSight Systems or Compute profilers, which provide a more detailed, graphical view of what is happening on the GPU.

Synchronization on worker threads

Another noteworthy change affects how tasks are synchronized. To enable concurrent execution, i.e., to make it possible for other Julia tasks to execute while waiting for the GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a significant source of latency, at least 25us per invocation but sometimes much longer, and have also been slated for deprecation and eventual removal from the CUDA toolkit.

Instead, on Julia 1.9 and later, CUDA.jl now uses worker threads to wait for GPU operations to finish. This mechanism is significantly faster, taking around 5us per invocation, but more importantly offers a much more reliable and predictable latency. You can observe this mechanism using the integrated profiler:

julia> a = CUDA.rand(1024, 1024, 1024)
julia> CUDA.@profile trace=true CUDA.@sync a .+ a
Profiler ran for 12.29 ms, capturing 527 events.

Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace)
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
│  ID │     Start │      Time │ Thread │                    Name │
├─────┼───────────┼───────────┼────────┼─────────────────────────┤
│   5 │   6.91 µs │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │
│   9 │  36.72 µs │ 199.56 µs │      1 │          cuLaunchKernel │
│ 525 │ 510.69 µs │  11.75 ms │      2 │     cuStreamSynchronize │
└─────┴───────────┴───────────┴────────┴─────────────────────────┘

For some users, this may still be too slow, so we have added two mechanisms that disable nonblocking synchronization and simply block the calling thread until the GPU operation finishes. The first is a global setting, which can be enabled by setting the nonblocking_synchronization preference to false, which can be done using Preferences.jl. The second is a fine-grained flag to pass to synchronization functions: synchronize(x; blocking=true), CUDA.@sync blocking=true ..., etc. Both these mechanisms should not be used widely, and are only intended for use in latency-critical code, e.g., when benchmarking or profiling.

Local toolkit discovery

One of the breaking changes involves how local toolkits are discovered, when opting out of the use of artifacts. Previously, this could be enabled by calling CUDA.set_runtime_version!("local"), which generated a version = "local" preference. We are now changing this into two separate preferences, version and local, where the version preference overrides the version of the CUDA toolkit, and the local preference independently indicates whether to use a local CUDA toolkit or not.

Concretely, this means that you will now need to call CUDA.set_runtime_version!(local_toolkit=true) to enable the use of a local toolkit. The toolkit version will be auto-detected, but can be overridden by also passing a version: CUDA.set_runtime_version!(version; local_toolkit=true). This may be necessary when CUDA is not available during precompilation, e.g., on the log-in node of a cluster, or when building a container image.

Raised minimum requirements

Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit version is now 11.4, but this cannot be enforced by the package manager. As a result, if you need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or below. The README will maintain a table of supported CUDA toolkit versions.

Most users will not be affected by this change: If you use the artifact-provided CUDA toolkit, you will automatically get the latest version supported by your CUDA driver.

Other changes

Support for CUDA 12.2;
Memory limits are now enforced by CUDA, resulting in better performance;
Support for Julia 1.10 (with help from @dkarrasch);
Support for batched gemm, gemv and svd (by @lpawela and @nikopj.