JuliaGPU

cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support

Wed, 08 Apr 2026 00:00:00 +0000

cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.

The release is showcased by two new examples that exercise many of the features described below: a fused Mixture of Experts kernel with token routing via gather/scatter, and a Flash Multi-Head Attention implementation with online softmax and causal masking.

Breaking changes

ct.where removed: use ifelse.(cond, x, y) (standard Julia broadcast);
FP mode kwargs removed: per-operation rounding_mode and flush_to_zero kwargs on reductions/scans replaced with ct.@fpmode blocks (see below);
Matmul batch dimensions: muladd now uses trailing batch dims (M, K, B...) matching Julia convention, instead of leading (B, M, K).

Native `for` loops

Previously, cuTile.jl required a while-loop workaround for iteration. Starting with v0.2, standard Julia for loops work directly:

for k in Int32(1):K_tiles
    a = ct.load(A; index=(pid_m, k), shape=(TILE_M, TILE_K))
    b = ct.load(B; index=(k, pid_n), shape=(TILE_K, TILE_N))
    acc = muladd(a, b, acc)
end

The compiler recognizes Julia's iterator protocol and lowers for i in start:stop and for i in start:step:stop to Tile IR ForOp directly.

Floating-point mode: `ct.@fpmode`

A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:

ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin
    s = sum(tile; dims=1)
    x = exp2.(tile)
end

Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.

Keyword arguments for operations

Most cuTile operations now use keyword arguments, aligning with cuTile Python's API and making call sites more readable:

load/store: index, shape, and tile are now kwargs (ct.load(arr; index=pid, shape=(M, N))). Positional syntax still works.
arange: now works without a type, defaulting to Int32. Use dtype for other types (e.g. ct.arange(16; dtype=Int64)).
gather/scatter: new mask, padding_value, and check_bounds kwargs. User masks are AND'd with automatic bounds masks; check_bounds=false skips bounds comparisons when indices are known safe.
Atomics: all atomic operations accept check_bounds to optionally skip the bounds mask.
allow_tma: default changed from true to nothing (compiler decides).

Experimental host abstractions

cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the ct.Tiled wrapper type, which represents a tiled view of an array.

Broadcasting fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:

ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)# Or via the convenience macro (wraps all arrays automatically):
ct.@. C = A + sin(B)

mapreduce on Tiled arrays generates a tiled reduction kernel:

mapreduce(identity, +, ct.Tiled(A); dims=1)

These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default CuArray operations in CUDA.jl.

Debugging with `print`/`println`

You can now use standard Julia print and println inside kernels:

function debug_kernel(A, tile_size::Int)
    pid = ct.bid(1)
    tile = ct.load(A; index=pid, shape=(tile_size,))
    println("Block ", pid, ": sum=", sum(tile; dims=1))
    return
end

String constants, scalars, and tiles can be mixed freely. String interpolation ("x=$x") is also supported.

Minor changes

Atomics: atomic_max, atomic_min, atomic_or, atomic_and, atomic_xor join the existing atomic_cas, atomic_xchg, and atomic_add;
fill/zeros/ones overlays: standard Base constructors now work inside kernels (e.g. zeros(Float32, 64, 64)), thanks to @AntonOresten;
isnan: works via a single unordered float self-comparison, avoiding Julia's bit-manipulation fallback;
Debug info: source file and line information is now embedded in Tile IR bytecode;
Julia 1.13: support for the upcoming Julia release.

Performance improvements

A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.

The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:

# FMA fusion: mulf + addf → fma
@rewrite addf(mulf(~a, ~b), ~c) => fma(~a, ~b, ~c)# pow(x, 2) → x * x
@rewrite pow(~x, broadcast(constant(2.0))) => mulf(~x, ~x)

Built on this, the pipeline includes:

Algebraic simplification cancels matching arithmetic pairs like (x + 1) - 1 → x, even when reshapes or broadcasts sit in between. This eliminates the overhead of Julia's 1-based indexing normalization;
Comparison strength reduction canonicalizes patterns like (x + 1) <= y into x < y, collapsing the two-instruction arange-plus-compare that results from Julia's 1-based arange mask idiom down to a single comparison. This alone reduced layernorm's SASS from 10036 to 3253 instructions;
Pow2 strength reduction replaces pow(x, 2) with x * x, eliminating the expensive pow transcendental in layernorm's variance computation;
LICM hoists loop-invariant operations out of loop bodies;
Constant folding and propagation evaluates compile-time-known arithmetic and tracks constants through the IR for further optimizations;
Alias-aware token ordering uses alias analysis to identify independent memory operations on different arrays, avoiding unnecessary serialization that previously blocked instruction-level parallelism;

Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10% of their Python counterparts, and some are even faster. For up to date performance comparisons, see the cuTile.jl README.

Upcoming webinar

On May 12, 2026 at 1 PM ET, Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar. We will cover the package's design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what's next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the JuliaHub event.

Metal.jl 1.6: Initial MPSGraph Support

Fri, 30 May 2025 00:00:00 +0000

Metal.jl adds initial support for MPSGraph, with the matrix multiplication functions wrapped, resolving some matrix multiplication issues in the previous method.

Initial MPSGraph support

PR #526 enabled the automatic generation of wrappers for all enums, structs, and Objective-C objects for the frameworks that Metal.jl relies upon. This made adding support for MPSGraph, Apple's MLIR gpu compiler interface, realistic.

To try out the new framework, constructor and method wrappers necessary for matrix multiplication were added, as well as linking it to the LinearAlgebra interface to work around the NaN issue that could show up on M1/M2 devices.

Lets go through a simple example doing pairwise multiplication followed by pairwise addition using MPSGraph directly:

using Metal, Random
using ObjectiveC: Foundation.NSDictionary
using Metal: encode!;using .MPS: MPSCommandBuffer
using .MPSGraphs: MPSGraph, placeholderTensor, MPSGraphTensorData, MPSGraphTensor, multiplicationWithPrimaryTensor, additionWithPrimaryTensorT = Float32;a = Metal.rand(10);
b = Metal.rand(10);
c = Metal.rand(10);# To compare with the MPSGraph equivalent
res = (a .* b) .+ c;graph = MPSGraph() # Initialize the graph# Create placeholder tensors to be used to compile our graph
placeA = placeholderTensor(graph, size(a), T)
placeB = placeholderTensor(graph, size(b), T)
placeC = placeholderTensor(graph, size(c), T)# Link the placeholder tensors to the data via a Dict
feeds = Dict{MPSGraphTensor, MPSGraphTensorData}(
    placeA => MPSGraphTensorData(a),
    placeB => MPSGraphTensorData(b),
    placeC => MPSGraphTensorData(c)
)# Add multiplication to the graph
pwisemul = MPSGraphs.multiplicationWithPrimaryTensor(graph, placeA, placeB)# Add addition to the graph
pwiseadd = MPSGraphs.additionWithPrimaryTensor(graph, pwisemul, placeC)# Our output tensor will be our c MtlArray
resultdict = Dict{MPSGraphTensor, MPSGraphTensorData}(
    pwiseadd => feeds[placeC]
)# Encode and run the graph
cmdbuf = MPS.MPSCommandBuffer(Metal.global_queue(device()))
MPS.encode!(cmdbuf, graph, NSDictionary(feeds), NSDictionary(resultdict))
Metal.commit!(cmdbuf)
Metal.wait_completed(cmdbuf)# The MPSGraph result is equal to the typical way of doing things.
@assert isapprox(res, c)

Clearly, for simple operations like the above example, it is a lot of extra boilerplate without much benefit, but for more complex operations, MPSGraph will optimize the graph and operations before running, reducing expensive kernel launches and remving unecessary operations.

Another exciting aspect of this new framework wrapper is that it is now easier to add functionality that has been long-requested. One can find MPSGraph functionality not yet in Metal.jl and write wrappers using the existing wrappers as a starting point. If anyone is interested in helping out, feel free to open a pull request or an issue on the Metal.jl repository, and we will do our best to help you get your code merged.

Minor Changes

Metal.jl 1.6 also includes several other useful updates:

Fixes with using irrationals in kernels.
Many improvements and fixes to intrinsics.
Support for pow with an Integer exponent .

As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the changelog for a full list of changes.

CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more

Wed, 14 May 2025 00:00:00 +0000

CUDA.jl v5.8 brings several enhancements, most notably the introduction of broadcasting support for CuSparseVector. The release also includes support for CUDA 12.9, and updates to key CUDA libraries like cuTENSOR, cuQuantum, and cuDNN.

Broadcasting for `CuSparseVector`

A significant enhancement in CUDA.jl v5.8 is the support for broadcasting CuSparseVector. Thanks to @kshyatt, it is now possible to use sparse GPU vectors in broadcast expressions just like it was already possible with sparse matrices:

julia> using CUDA, .CUSPARSE, SparseArraysjulia> x = cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.459139
  [3]  =  0.964073
  [8]  =  0.904363
  [9]  =  0.721723julia> # a zero-preserving elementwise operation
       x .* 2
10-element CuSparseVector{Float32, Int32} with 4 stored entries:
  [2]  =  0.918278
  [3]  =  1.928146
  [8]  =  1.808726
  [9]  =  1.443446julia> # a non-zero-preserving elementwise operation
       x .+ 1
10-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 1.0
 1.4591388
 1.9640732
 1.0
 1.0
 1.0
 1.0
 1.9043632
 1.7217231
 1.0julia> # combining multiple sparse inputs
       x .+ cu(sprand(Float32, 10, 0.3))
10-element CuSparseVector{Float32, Int32} with 6 stored entries:
  [1]  =  0.906
  [2]  =  0.583197
  [3]  =  0.964073
  [4]  =  0.259103
  [8]  =  0.904363
  [9]  =  0.935917

Minor Changes

CUDA.jl 5.8 also includes several other useful updates:

Added support for CUDA 12.9;
Subpackages have been updated to CUDNN 9.10, cuTensor 2.2, and cuQuantum 25.03;
CUSPARSE.gemm! now supports additional algorithms choices to limit memory usage;
Symbols can now be passed to CUDA kernels and stored in CuArrays;
CuTensor multiplication now preserves the memory type of the input tensors;
Sparse CSR matrices are now interfaced with the SparseMatricesCSR.jl package.

As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the changelog for a full list of changes.

CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

Tue, 11 Mar 2025 00:00:00 +0000

CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved CuRef type, which enables fully asynchronous CUBLAS calls.

Reworking `CuRef` for asynchronous CUBLAS

The CuRef type is similar to Julia's Ref, a boxed value, often used with C APIs. In CUDA.jl v5.7, we've made several changes to this type. First of all, we've aligned its API much more closely with the Ref type from Base, e.g, adding getindex and setindex! methods, which should make it more familiar to users:

julia> box = CuRef(1)
CuRefValue{Int64}(1)julia> box[]
1julia> box[] = 2
2julia> box
CuRefValue{Int64}(2)

We also optimized and improved the CuRef implementation. As part of that work, we removed the eager synchronization when copying from unpinned memory. This was done to make it possible for Julia code to execute when waiting for the memory copy to start. However, it turns out that certain (small) copies, such as those performed by CuRef, can be performed without having to wait for the copy to start. By removing eager synchronization from those copies, CuRef objects can now be constructed fully asynchronously, i.e., without having to wait for the GPU to be ready.

Building on these changes, @kshyatt has switched our CUBLAS wrappers over to using GPU-based CuRef boxes for scalar inputs instead of host-based Ref boxes. Although this increases the complexity of invoking CUBLAS APIs – the allocation of CuRef boxes requires CUDA API calls whereas a Ref box is much cheaper to allocate – this results in the API behaving asynchronously, whereas before every CUBLAS API taking scalar inputs would have resulted in a so-called "bubble" waiting for the GPU to finish executing.

A Julia-level allocator cache

To help with the common issue of running out of GPU memory, or to reduce the cost of CUDA.jl hitting the GC too often, @pxl-th has added a reusable caching allocator to GPUArrays.jl, which CUDA.jl now supports and integrates with.

The idea is simple: GPU allocations made in a GPUArrays.@cached block are recorded in a cache, and when the block is exited the allocations are made available for reuse. Only when the cache goes out of scope, or when you call unsafe_free! on it, the allocations will be fully freed. This is useful when you have a repetitive workload that performs the same allocations over and over again, such as in a machine learning training loop:

cache = GPUArrays.AllocCache()
for epoch in 1:1000
    GPUArrays.@cached cache begin
        # dummy workload
        sin.(CUDA.rand(Float32, 1024^3))
    end
end# wait for `cache` to be collected, or optionally eagerly free the memory
GPUArrays.unsafe_free!(cache)

Even though CUDA already has a caching allocator, the Julia-level caching mechanism may still improve performance by lowering pressure on the GC and reducing fragmentation of the underlying allocator. For example, the above snippet only performs two memory allocations that require 8 GiB, instead of 2000 allocations totalling 8 TiB (!) of GPU memory.

The cherry on top is that the caching interface is generic, implemented in GPUArrays.jl, and available to all GPU back-ends that are compatible with v11.2.

Minor changes

Device-to-host copies now eagerly synchronize to improve concurrent execution.
On multi-GPU systems, unified memory is not automatically prefetched anymore when launching kernels, making it possible to process a single array on multiple devices.
A change to CuDeviceArray should allow eliding additional bounds checks in code that already performs a manual bounds check (such as KernelAbstractions.jl code)
CUDA toolkit 12.8 is now supported, as well as Jetson Orin devices.
It is now possible to pass symbols to kernels.
CUBLAS: Support for Givens rotation methods.
CUSPARSE: Support for using CuSparseMatrixBSR with generic mm!.
Windows support for NVTX has been fixed.

OpenCL.jl 0.10: Now with native Julia kernels

Mon, 13 Jan 2025 00:00:00 +0000

Version 0.10 of OpenCL.jl is a significant release that adds support for native Julia kernels. This necessitated a major overhaul of the package's internals, bringing the package in line with modern Julia GPU programming practices.

Native Julia kernels

The highlight of this release is the addition of a compiler that makes it possible to write OpenCL kernels in Julia instead of having to use OpenCL C and accompanying string-based APIs. Let's illustrate using the typical vadd vector-additional example, which starts by generating some data and uploading it to the GPU:

using OpenCLdims = (2,)
a = round.(rand(Float32, dims) * 100)
b = round.(rand(Float32, dims) * 100)
c = similar(a)d_a = CLArray(a)
d_b = CLArray(b)
d_c = CLArray(c)

The typical way to write a kernel is to use a string with OpenCL C code, which is then compiled and executed on the GPU. This is done as follows:

const source = """
   __kernel void vadd(__global const float *a,
                      __global const float *b,
                      __global float *c) {
      int i = get_global_id(0);
      c[i] = a[i] + b[i];
    }"""prog = cl.Program(; source) |> cl.build!
kern = cl.Kernel(prog, "vadd")len = prod(dims)
clcall(kern, Tuple{Ptr{Float32}, Ptr{Float32}, Ptr{Float32}},
       d_a, d_b, d_c; global_size=(len,))

With the new GPUCompiler.jl-based compiler, you can now write the kernel in Julia just like with our other back-ends:

function vadd(a, b, c)
    i = get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
endlen = prod(dims)
@opencl global_size=len vadd(d_a, d_b, d_c)

This is of course a much more natural way to write kernels, and it also allows for OpenCL.jl to be plugged into the rest of the JuliaGPU ecosystem. Concretely, OpenCL.jl now implements the GPUArrays.jl interface, enabling lots of vendor-neutral functionality, and also provides a KernelAbstractions.jl back-end for use with the plenty of libraries that build on top of KernelAbstractions.jl.

There is no free lunch, though, and the native compiler functionality currently relies on your OpenCL driver supporting SPIR-V. This is sadly not a common feature, e.g., neither NVIDIA or ADM's OpenCL drivers support it, only Intel's. But if you are stuck with a driver that does not support SPIR-V, there is still hope: SPIR-V can be compiled back to OpenCL C, using the experimental spirv2clc. If you are interested, check out this issue and feel free to reach out.

Breaking API changes

Existing users of OpenCL.jl will of course have noticed that even the string-based example above uses a different API than before. In order to support the new compiler, and bring OpenCL.jl in line with modern Julia programming practices, we have significantly overhauled the package's internals as well as some external APIs.

The most significant high-level changes include:

Memory management is now done using CLArray, backed by Shared Virtual Memory (SVM), instead of opaque buffers. Raw buffers are still supported, but not compatible with native kernel execution (because they can not be converted to a pointer).
Kernels are called using the new clcall function, which performs automatic conversion of objects much like how ccall works.

At the lower-level (of the cl submodule), the changes are more extensive:

Context, device and queue arguments have been removed from most APIs, and are now stored in task-local storage. These values can be queried (cl.platform(), cl.device(), etc) and set (cl.platform!(platform), cl.device!(device), etc) as needed.
As part of the above change, questionable APIs like cl.create_some_context() and cl.devices() have been removed;
The Buffer API has been completely reworked. It now only provides low-level functionality, such as unsafe_copyto! or unsafe_map!, while high-level functionality like copy! is implemented for the CLArray type;
The cl.info method, and the getindex overloading to access properties of OpenCL objects, have been replaced by getproperty overloading on the objects themselves (e.g., cl.info(dev, :name) and dev[:name] are now simply dev.name);
The blocking cl.launch has been replaced by a nonblocking cl.call, while also removing the getindex-overloading shorthand. However, it's recommended to use the newly-added cl.clcall function, which takes an additional tuple type argument and performs automatic conversions of arguments to those types. This makes it possible to pass a CLArray to an OpenCL C function expecting Buffer-backed pointers, for example.
Argument conversion has been removed; the user should make sure Julia arguments passed to kernels match the OpenCL argument types (i.e., no empty types, 4-element tuples for a 3-element float3 arguments).
The to_host function has been replaced by simply calling Array on the CLArray.
Queue and execution capabilities of a device are now to be queried using dedicated functions, cl.queue_properties and cl.exec_capabilities.

Working towards the first stable version of this package, we anticipate having to make even more breaking changes. However, we want to get the current changes out there to get feedback from the community. If some of the removed functionality is crucial to your workflow, feel free to reach out and we can discuss how to best support it in the future.

JLL-based OpenCL drivers

Another significant change is the integration with OpenCL drivers built and provided using Julia's BinaryBuilder infrastructure. Over time, this should simplify the installation of OpenCL drivers by avoiding the need to install global drivers. For now, the only driver provided as a JLL is a CPU driver based on the Portable Computing Language (PoCL) library. This driver can be used by simply installing and loading pocl_jll before you start using OpenCL.jl:

julia> using OpenCL, pocl_jlljulia> OpenCL.versioninfo()
OpenCL.jl version 0.10.0Toolchain:
 - Julia v1.11.2
 - OpenCL_jll v2024.5.8+1Available platforms: 1
 - Portable Computing Language
   OpenCL 3.0, PoCL 6.0  Apple, Release, RELOC, SPIR-V, LLVM 16.0.6jl, SLEEF, DISTRO, POCL_DEBUG
   · cpu (fp16, fp64, il)

Notice the il capability reported by OpenCL.versioninfo(), indicating that PoCL supports SPIR-V and can thus be used with the new native Julia kernel compiler. In fact, this is one of the goals of reworking OpenCL.jl: to provide a CPU fallback implementation for use with Julia GPU libraries.

Work towards OpenCL.jl 1.0

This release is a significant step towards a stable 1.0 release of OpenCL.jl, bringing the package in line with our other Julia GPU-backends. Our focus is on improving OpenCL.jl in order to support a CPU fallback back-end for KernelAbstractions.jl based on PoCL. If you are a user of OpenCL.jl, or are interested in using the package in the future, please test out this release with your application and/or driver, and provide feedback on the changes we've made. Pull requests are greatly appreciated, and we are happy to help you get started with contributing to the package.

GPUArrays v11: Port to KernelAbstractions.jl

Tue, 07 Jan 2025 00:00:00 +0000

The latest version of GPUArrays.jl involved a port of all vendor-neutral kernels to KernelAbstractions.jl. This should make it easier to add new functionality and improve the performance of existing kernels.

Vendor-neutral kernel DSL

Back in the day, we created GPUArrays.jl to avoid having to write separate kernels for each GPU back-end, by relying on a very simple vendor-neutral domain-specific language (DSL) that could be translated very easily to the back-end's native kernel language. As a simple example, the following kernel was used to compute the adjoint of a vector:

function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
    gpu_call(B, A) do ctx, B, A
        idx = @linearidx A
        @inbounds B[1, idx] = adjoint(A[idx])
        return
    end
    return B
end

This DSL was designed almost a decade ago, by Simon Danisch, and has served us well! Since then, KernelAbstractions.jl has been developed by Valentin Churavy, providing a more principled and powerful DSL. With many application developers switching to KernelAbstractions.jl, it was time to port GPUArrays.jl to this new DSL as well.

Thanks to the tireless work by James Schloss, GPUArrays.jl v11 now uses KernelAbstractions.jl for all vendor-neutral kernels. The aforementioned adjoint! kernel now looks like this:

function LinearAlgebra.adjoint!(B::AbstractGPUMatrix, A::AbstractGPUVector)
    @kernel function adjoint_kernel!(B, A)
        idx = @index(Global, Linear)
        @inbounds B[1, idx] = adjoint(A[idx])
    end
    adjoint_kernel!(get_backend(A))(B, A; ndrange=size(A))
    return B
end

As shown above, the KernelAbstractions.jl DSL is very similar to the old DSL, but it provides more flexibility and power (e.g., support for atomics through Atomix.jl). In addition, many more users are familiar with KernelAbstractions.jl, making it easier for them to contribute to GPUArrays.jl. A good first step here would be to port some of the vendor-specific kernels from CUDA.jl to GPUArrays.jl, making them available to all GPU back-ends. If you are interested in contributing, please reach out!

That said, the change is not without its challenges. The added flexibility offered by KernelAbstractions.jl with respect to indexing currently results in certain kernels being slower than before, specifically when there is not much computational complexity to amortise the cost of indexing (e.g., when doing very simple broadcasts). We are working on improving this, but it will take some time. Not to hold back the rest of the JuliaGPU ecosystem, we are releasing despite these performance issues. It's recommended to carefully benchmark your application after upgrading to v11, and to report any performance regressions

Back-end package versions

As GPUArrays.jl is not a direct dependency of most applications, the update will be pulled in by the following back-end package versions (some of which may not be released yet):

CUDA.jl v5.6
Metal.jl v1.5
oneAPI.jl v2.0
AMDGPU.jl v1.1

Metal.jl 1.4: Improved random numbers

Mon, 07 Oct 2024 00:00:00 +0000

Metal.jl 1.4 adds higher-quality random number generators from the Metal Performance Shaders library. Some limitations apply, with a fallback to the current implementation in those situations.

`Metal.rand` and friends

Using functionality provided by the Metal Performance Shaders (MPS) library, Metal.jl now comes with much improved GPU random number generators. Uniform distributions using Metal.rand (and its in-place variant Metal.rand!) are available for all Metal-supported integer types and Float32. However, due to Metal API limitations, 8-bit and 16-bit integers may fall back to the lower-quality GPUArrays.jl random number generator if their size in bytes is not a multiple of 4. Normally distributed Float32 values can be generated for with Metal.randn and Metal.randn!, while Float16 is not supported by the MPS library and will always fall back to the GPUArrays implementation.

The easiest way to use these is to use the Metal convenience functions Metal.rand[n][!] as you would the usual functions from the Random.jl standard library:

julia> a = Metal.rand(Float32, 2)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 0.95755994
 0.7110207julia> Metal.randn!(a)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 1.7230463
 0.55636907

However, the Random.jl methods can also be used by providing the appropriate RNG either from MPS.default_rng() or MPS.RNG() to the standard Random.rand[n][!] functions:

julia> using Randomjulia> rng = MPS.RNG();julia> Random.rand(rng, 2)
2-element MtlVector{Float32, Metal.PrivateStorage}:
 0.8941469
 0.67628527

Seeding is done by calling Metal.seed! for the global RNG, or Random.seed! when working with an explicit RNG object.

Future work

Although Metal.jl is now in v1, there is still work to be done to make it as fast and feature-complete as possible. In particular:

Metal.jl is now using native ObjectiveC FFI for wrapping Metal APIs. However, these wrappers have to be written manually for every piece of Objective-C code. We are looking for help with improving Clang.jl and ObjectiveC.jl to enable the automatic generation of these wrappers;
The MPS wrappers are incomplete, automatic wrapper generation would greatly help with full MPS support;
To implement a full-featured KernelAbstractions.jl back-end, Metal atomic operations need to be hooked up to Atomix;
Full support for BFloat16 values, which has been supported since Metal 3.1 (macOS 14), is not yet available in Metal.jl. There is, however, a draft PR in the works. Check it out if you're interested in helping out;
Some functionality present in CUDA.jl could be ported to Metal.jl to improve usability.

CUDA.jl 5.5: Maintenance release

Wed, 18 Sep 2024 00:00:00 +0000

CUDA.jl 5.5 is a minor release that comes with a couple of small improvements and new features.

The only important change is that the minimal required Julia version has been bumped to 1.10, in anticipation of it becoming the next LTS release.

New features

Support for the upcoming Julia 1.11 release has been added, as well as for CUDA 12.6 (Update 1).
Launch overhead has been reduced by avoiding double argument conversions. Note that this does not apply to kernels that are obtained using @cuda launch=false.
CUSOLVER's dense wrappers have been improved by Ben Arthur, now caching workspace buffers. This should greatly reduce the number of allocations needed for repeated calls.
Alexis Montoison has improved the CUSPARSE wrappers, adding conversions between sparse vectors and sparse matrices that enable a version of gemv which preserves sparsity of the inputs.
CUDA.jl's CUFFT wrappers now support Float16, thanks to Erik Schnetter.

CUDA.jl 5.4: Memory management mayhem

Tue, 28 May 2024 00:00:00 +0000

CUDA.jl 5.4 comes with many memory-management related changes that should improve performance of memory-heavy applications, and make it easier to work with heterogeneous set-ups involving multiple GPUs or using both the CPU and GPU.

Before anything else, let's get the breaking changes out of the way. CUDA.jl v5.4 only bumps the minor version, so it should be compatible with existing codebases. However, there are a couple of API changes that, although covered by appropriate deprecation warnings, applications should be updated to:

The CUDA.Mem submodule has been removed. All identifiers have been moved to the parent CUDA submodule, with a couple being renamed in the process:
- Mem.Device and Mem.DeviceBuffer have been renamed to CUDA.DeviceMemory (the same applies to Mem.Host and Mem.Unified);
- enums from the Mem submodule have gained a MEM suffix, e.g., Mem.ATTACH_GLOBAL has been renamed to CUDA.MEM_ATTACH_GLOBAL;
- Mem.set! has been renamed to CUDA.memset;
- Mem.info() has been renamed to CUDA.memory_info();
CUDA.memory_status() has been renamed to CUDA.pool_status();
CUDA.available_memory() has been renamed to CUDA.free_memory().

The meat of this release is in the memory management improvements detailed below. These changes can have a significant impact of the performance of your application, so it's recommended to thoroughly test your application after upgrading!

Eager garbage collection

Julia is a garbage collected language, which means that (GPU) allocations can fail because garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled this at the allocation site, detecting out-of-memory errors and triggering the GC. This was not ideal, as it could lead to significant pauses and a bloated memory usage.

To improve this, CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.

For example, the following toy model implemented with Flux.jl allocates a ton of memory:

using CUDA, Flux
using MLUtils: DataLoadern_obs = 300_000
n_feature = 1000
X = rand(n_feature, n_obs)
y = rand(1, n_obs)
train_data = DataLoader((X, y) |> gpu; batchsize = 2048, shuffle=false)model = Dense(n_feature, 1) |> gpu
loss(m, _x, _y) = Flux.Losses.mse(m(_x), _y)
opt_state = Flux.setup(Flux.Adam(), model)
for epoch in 1:100
  Flux.train!(loss, model, train_data, opt_state)
end

Without eager garbage collection, this leads to expensive pauses while freeing a large amount of memory at every epoch. We can simulate this by artificially limiting the memory available to the GPU, while also disabling the new eager garbage collection feature by setting the JULIA_CUDA_GC_EARLY environment variable to false (this is a temporary knob that will be removed in the future, but may be useful now for evaluating the new feature):

❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \
  julia --project train.jl
...
[ Info: Epoch 90 train time 0.031s
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 91 train time 0.031s
[ Info: Epoch 92 train time 0.027s
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 93 train time 0.03s
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 94 train time 0.031s
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 95 train time 0.03s
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 96 train time 0.031s
[ Info: Epoch 97 train time 0.027s
retry_reclaim: freed 2.873 GiB
[ Info: Epoch 98 train time 0.031s
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 99 train time 0.031s
retry_reclaim: freed 2.865 GiB
[ Info: Epoch 100 train time 0.031s
[ Info: Total time 4.307s

With eager garbage collection enabled, more frequent but less costly pauses result in significantly improved performance:

❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \
  julia --project wip.jl
...
[ Info: Epoch 90 train time 0.031s
maybe_collect: collected 1.8 GiB
maybe_collect: collected 1.8 GiB
[ Info: Epoch 91 train time 0.033s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 92 train time 0.031s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 93 train time 0.031s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 94 train time 0.03s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 95 train time 0.03s
maybe_collect: collected 1.8 GiB
maybe_collect: collected 1.8 GiB
[ Info: Epoch 96 train time 0.033s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 97 train time 0.03s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 98 train time 0.03s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 99 train time 0.03s
maybe_collect: collected 1.8 GiB
[ Info: Epoch 100 train time 0.03s
[ Info: Total time 3.76s

Eager garbage collection is driven by a heuristic that considers the current memory pressure, how much memory was freed during previous collections, and how much time that took. It is possible that the current implementation is not optimal, so if you encounter performance issues, please file an issue.

Tracked memory allocations

When working with multiple GPUs, it is important to differentiate between the device that memory was allocated on, and the device used to execute code. Practically, this meant that users of CUDA.jl had to manually remember that allocating and using CuArray objects (typically) needed to happen with the same device active. The same is true for streams, which are used to order operations executing on a single GPU.

To improve this, CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to "do the right thing" when using that memory in kernels or with library functionality. This does not mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.

Let's break down what the implications are of this change.

1. Using multiple GPUs

If you have multiple GPUs, it may be possible that direct P2P access between devices is possible (e.g., using NVLink, or just over PCIe). In this case, CUDA.jl will now automatically configure the system to allow such access, making it possible to seamlessly use memory allocated on one device in kernels executing on a different device:

julia> # Allocate memory on device 0
       device!(0)
CuDevice(0): Tesla V100-PCIE-16GB
julia> a = CuArray([1]);julia> # Use on device 1
       device!(1)
CuDevice(1): Tesla V100S-PCIE-32GB
julia> a .+ 1;

If P2P access between devices is not possible, CUDA.jl will now raise an error instead of throwing an illegal memory access error as it did before:

julia> # Use on incompatible device 2
       device!(2)
CuDevice(2): NVIDIA GeForce GTX 1080 Ti
julia> a .+ 1
ERROR: cannot take the GPU address of inaccessible device memory.You are trying to use memory from GPU 0 on GPU 2.
P2P access between these devices is not possible;
either switch to GPU 0 by calling `CUDA.device!(0)`,
or copy the data to an array allocated on device 2.

As the error message suggests, you can always copy memory between devices using the copyto! function. In this case, CUDA.jl will fall back to staging the copy on the host when P2P access is not possible.

2. Using multiple streams

Streams are used to order operations executing on a single GPU. In CUDA.jl, every Julia task has its own stream, making it very easy to group independent operations together, and make it possible for the GPU to potentially overlap execution of these operations.

Before CUDA.jl v5.4, users had to be careful about synchronizing data used in multiple tasks. It was recommended, for example, to end every data-producing task with an explicit call to synchronize(), or alternatively make sure to device_synchronize() at the start of a data-consuming task. Now that CUDA.jl keeps track of the stream used to last access memory, it can automatically synchronize streams when needed:

# Allocate some data
a = CUDA.zeros(4096, 4096)
b = CUDA.zeros(4096, 4096)
#synchronize()  # No longer needed# Perform work on a task
t = @async begin
  a * b
  #synchronize()  # No longer needed
end# Fetch the results
c = fetch(t)

3. Using capturing APIs

All of the above is implemented by piggybacking on the function that converts memory objects to pointers, in the assumption that this will be the final operation before the memory is used. This is generally true, with one important exception: APIs that capture memory. For example, when recording an operation using the CUDA graph APIs, a memory address may be captured and used later without CUDA.jl being aware of it.

CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs may not covered yet. If you encounter issues with capturing APIs, let us know, and keep using additional synchronization calls to ensure correctness.

Unified memory iteration

Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and the GPU. We have now greatly improved the performance of using unified memory with CPU code that iterates over elements of a CuArray. Although this is typically unwanted, triggering the dreaded "scalar indexing" error when accessing device memory in such a way, it can be useful when incrementaly porting code to the GPU.

Concretely, accessing elements of a unified CuArray on the CPU is much faster now:

julia> # Reference
       a = [1];
julia> @btime $a[];
  1.959 ns (0 allocations: 0 bytes)julia> b = cu(a; unified=true);julia> # Before
       @btime $b[]
  2.617 μs (0 allocations: 0 bytes);julia> # After
       @btime $b[];
  4.140 ns (0 allocations: 0 bytes)

Notice the different unit! This has a massive impact on real-life performance, for example, as demonstrated by calling foldl which does not have a GPU-optimized implementation:

julia> a = cu(rand(1024, 1024); unified=true);julia> # Before
       @b foldl(+, a)
4.210 s (9 allocs: 208 bytes, without a warmup)julia> # After
       @b foldl(+, a)
3.107 ms (9 allocs: 208 bytes)

For completeness, doing this with regular device memory triggers a scalar indexing error:

julia> a = cu(rand(1024, 1024));julia> foldl(+, a)
ERROR: Scalar indexing is disallowed.

These changes should make it easier to port applications to the GPU by incrementally moving parts of the codebase to the GPU without having to worry about the performance of accessing memory from the CPU. The only requirement is to use unified memory, e.g., by calling cu with unified=true, or setting the CUDA.jl preference default_memory to use unified memory by default. However, as unified memory comes with a slight cost, and results in synchronous allocation behavior, it is still recommended to switch back to regular device memory when your application has been fully ported to the GPU.

Other changes

To keep this post from becoming even longer, a quick rundown of other changes:

@wsmoses introduced initial support for automatic differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have to differentiate through host and device code separately, and manually set up rules for crossing the host/device boundary. Now, you can differentiate through entire applications with ease;
CUDA.@profile now automatically detects external profilers, so it should not be required to specify external=true anymore when running under NSight;
Exception output has been improved, only reporting a single error message instead of generating output on each thread, and better forwarding the exception type;
Cached handles from libraries will now be freed when under memory pressure;
Tegra devices are now supported by our artifacts, obviating the use of a local toolkit;
Support for CUDA 12.5 has been added, as well as initial support for Julia 1.12.

oneAPI.jl 1.5: Ponte Vecchio support and oneMKL improvements

Fri, 24 May 2024 00:00:00 +0000

oneAPI.jl v1.5 is a significant release that brings many new features, from extended hardware support to greatly improved wrappers of the oneMLK math library.

Intel Ponte Vecchio

In oneAPI.jl v1.5 we introduce support for the Intel Ponte Vecchio (PVC) architecture, which empowers the Xe HPC GPUs as found in the Aurora supercomputer:

julia> oneAPI.versioninfo()
Binary dependencies:
- NEO: 24.13.29138+0
- libigc: 1.0.16510+0
- gmmlib: 22.3.18+0
- SPIRV_LLVM_Translator_unified: 0.4.0+0
- SPIRV_Tools: 2023.2.0+0Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.71 driver:
- 00000000-0000-0000-17d2-6b1e010371d2 (v1.3.29138, API v1.3.0)16 devices:
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550
- Intel(R) Data Center GPU Max 1550

Apart from a handful of MKL-related issues, oneAPI.jl is fully functional on PVC, and passes all tests.

oneMKL wrappers

Thanks to the work of @amontoison, oneAPI.jl now provides greatly improved wrappers of the oneMKL library. This includes support for:

LAPACK: geqrf(_batched), orgqr(_batched), ormqr, potrf(_batched), potrs(_batched), getrf(_batched), getri(_batched), gebrd, gesvd, syevd, heevd, sygvd, hegvd
Sparse arrays: sparse_gemm, sparse_gemv, sparse_symv, sparse_trmv, sparse_trsv, sparse_optimize_gemv, sparse_optimize_trsv

Where possible, these functions are integrated with standard library interfaces, e.g., making it possible to simply call eigen, or to multiply two oneSparseMatrixCSRs.

Minor changes

There have of course been many other changes and improvements in oneAPI.jl v1.5. For a full list, please refer to the release notes, but some highlights include:

a new launch configuration heuristic that should generally improve performance;
broadcast now preserves the buffer type (host, device, or shared);
support for very large arrays that exceed the default device memory limit;
several toolchain bumps, with v1.5 using oneAPI 2024.1.0 with driver 24.13.29138.7;
minimal support for native Windows (next to WSL, which is fully supported).

CUDA.jl 5.2 and 5.3: Maintenance releases

Fri, 26 Apr 2024 00:00:00 +0000

CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.

Profiler improvements

CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia GPU applications without having to use NSight Systems or other external tools. The tool has seen continued development, mostly improving its robustness, but CUDA.jl now also provides a @bprofile equivalent that runs your application multiple times and reports on the time distribution of individual events:

julia> CUDA.@bprofile CuArray([1]) .+ 1
Profiler ran for 1.0 s, capturing 1427349 events.Host-side activity: calling CUDA APIs took 792.95 ms (79.29% of the trace)
┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │  Calls │ Time distribution                     │ Name                    │
├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
│   19.27% │  192.67 ms │ 109796 │   1.75 µs ± 10.19  (  0.95 ‥ 1279.83) │ cuMemAllocFromPoolAsync │
│   17.08% │   170.8 ms │  54898 │   3.11 µs ± 0.27   (  2.15 ‥ 23.84)   │ cuLaunchKernel          │
│   16.77% │  167.67 ms │  54898 │   3.05 µs ± 0.24   (  0.48 ‥ 16.69)   │ cuCtxSynchronize        │
│   14.11% │  141.12 ms │  54898 │   2.57 µs ± 0.79   (  1.67 ‥ 70.57)   │ cuMemcpyHtoDAsync       │
│    1.70% │   17.04 ms │  54898 │ 310.36 ns ± 132.89 (238.42 ‥ 5483.63) │ cuStreamSynchronize     │
└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘Device-side activity: GPU was busy for 87.38 ms (8.74% of the trace)
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                     │ Name               │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
│    6.66% │   66.61 ms │ 54898 │   1.21 µs ± 0.16   (  0.95 ‥ 1.67)    │ kernel             │
│    2.08% │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 (238.42 ‥ 1192.09) │ [copy to device]   │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘NVTX ranges:
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   98.99% │  989.94 ms │ 54898 │  18.03 µs ± 49.88  ( 15.26 ‥ 10731.22) │ @bprofile.iteration │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘

By default, CUDA.@bprofile runs the application for 1 second, but this can be adjusted using the time keyword argument.

Display of the time distribution isn't limited to CUDA.@bprofile, and will also be used by CUDA.@profile when any operation is called more than once. For example, with the broadcasting example from above we allocate both the input CuArray and the broadcast result, which results in two calls to the allocator:

julia> CUDA.@profile CuArray([1]) .+ 1Host-side activity:
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│   99.92% │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
│    0.02% │   21.22 µs │     2 │  10.61 µs ± 6.57   (  5.96 ‥ 15.26) │ cuMemAllocFromPoolAsync │
│    0.02% │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
│    0.00% │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘

It is also not required anymore to specify external=true when using CUDA.@profile in combination with a tool like NSight Systems, as CUDA.jl will automatically detect the presence of an external profiler:

shell> nsys launch julia# warm-up
julia> CuArray([1]).+1
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 2julia> CUDA.@profile CuArray([1]).+1
[ Info: This Julia session is already being profiled; defaulting to the external profiler.
Capture range started in the application.
Capture range ended in the application.
Generating '/tmp/nsys-report-c42f.qdstrm'
[1/1] [========================100%] report1.nsys-rep

In case that detection fails, the external keyword argument remains available (but do file an issue).

Kernel launch debugging

A common issue with CUDA programming is that kernel launches may fail when exhausting certain resources, such as shared memory or registers. This typically results in a cryptic error message, but CUDA.jl will now try to diagnose launch failures and provide a more helpful error message, as suggested by @simonbyrne:

For example, when using more parameter memory than allowed by the architecture:

julia> kernel(x) = nothing
julia> @cuda kernel(ntuple(_->UInt64(1), 2^13))
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.

Or when using an invalid launch configuration, violating a device limit:

julia> @cuda threads=2000 identity(nothing)
ERROR: Number of threads in x-dimension exceeds device limit (2000 > 1024).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

We also diagnose launch failures that involve kernel-specific limits, such as exceeding the number of threads that are allowed in a block (e.g., because of register use):

julia> @cuda threads=1024 heavy_kernel()
ERROR: Number of threads per block exceeds kernel limit (1024 > 512).
caused by: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)

Sorting improvements

Thanks to @xaellison, our bitonic sorting implementation now supports sorting specific dimensions, making it possible to implement sortperm for multi-dimensional arrays:

julia> A = cu([8 7; 5 6])
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 8  7
 5  6julia> sortperm(A, dims = 1)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 2  4
 1  3julia> sortperm(A, dims = 2)
2×2 CuArray{Int64, 2, Mem.DeviceBuffer}:
 3  1
 2  4

The bitonic kernel is now used for all sorting operations, in favor of the often slower quicksort implementation:

# before (quicksort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  2.760 ms (30 allocations: 1.02 KiB)# after (bitonic sort)
julia> @btime CUDA.@sync sort($(CUDA.rand(1024, 1024)); dims=1)
  246.386 μs (567 allocations: 13.66 KiB)# reference CPU time
julia> @btime sort($(rand(Float32, 1024, 1024)); dims=1)
  4.795 ms (1030 allocations: 5.07 MiB)

Unified memory fixes

CUDA.jl 5.1 greatly improved support for unified memory, and this has continued in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting CuArrays we now correctly preserve the memory type of the input arrays. This means that if you broadcast a CuArray that is allocated as unified memory, the result will also be allocated as unified memory. In case of a conflict, e.g. broadcasting a unified CuArray with one backed by device memory, we will prefer unified memory:

julia> cu([1]; host=true) .+ 1
1-element CuArray{Int64, 1, Mem.HostBuffer}:
 2julia> cu([1]; host=true) .+ cu([2]; device=true)
1-element CuArray{Int64, 1, Mem.UnifiedBuffer}:
 3

Software updates

Finally, we also did routine updates of the software stack, support the latest and greatest by NVIDIA. This includes support for CUDA 12.4 (Update 1), cuDNN 9, and cuTENSOR 2.0. This latest release of cuTENSOR is noteworthy as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to follow this change. For more details, refer to the cuTENSOR 2 migration guide by NVIDIA.

Of course, cuTENSOR.jl also provides a high-level Julia API which has been mostly unaffected by these changes:

using CUDA
A = CUDA.rand(7, 8, 3, 2)
B = CUDA.rand(3, 2, 2, 8)
C = CUDA.rand(3, 3, 7, 2)using cuTENSOR
tA = CuTensor(A, ['a', 'f', 'b', 'e'])
tB = CuTensor(B, ['c', 'e', 'd', 'f'])
tC = CuTensor(C, ['b', 'c', 'a', 'd'])using LinearAlgebra
mul!(tC, tA, tB)

This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and have to adapt to the new API, now is a good time to consider improving the high-level interface instead!

Future releases

The next release of CUDA.jl is gearing up to be a much larger release, with significant changes to both the API and internals of the package. Although the intent is to keep these changes non-breaking, it is always possible that some code will be affected in unexpected ways, so we encourage users to test the upcoming release by simply running ] add CUDA#master and report any issues.

CUDA.jl 5.1: Unified memory and cooperative groups

Tue, 07 Nov 2023 00:00:00 +0000

CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.

Unified memory

Unified memory is a feature of CUDA that allows the programmer to access memory from both the CPU and GPU, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.

CUDA.jl did already support unified memory, but only for the most basic use cases. With CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that memory from the CPU:

julia> gpu = cu([1., 2.]; unified=true)
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 1.0
 2.0julia> # accessing GPU memory from the CPU
       gpu[1] = 3;julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 3.0
 2.0

Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is safe and efficient to perform scalar iteration on CuArrays backed by unified memory. This greatly simplifies porting applications to the GPU, as it no longer is a problem when code uses AbstractArray fallbacks from Base that process element by element.

In addition, CUDA.jl 5.1 also makes it easier to convert CuArrays to Array objects. This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK which do not support CuArrays:

julia> cpu = unsafe_wrap(Array, gpu)
2-element Vector{Float32}:
 3.0
 2.0julia> LinearAlgebra.BLAS.scal!(2f0, cpu);julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
 6.0
 4.0

The reverse is also possible: CPU-based Arrays can now trivially be converted to CuArray objects for use on the GPU, without the need to explicitly allocate unified memory. This further simplifies memory management, as it makes it possible to use the GPU inside of an existing application without having to copy data into a CuArray:

julia> gpu = unsafe_wrap(CuArray, cpu)
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1
 2julia> CUDA.@sync gpu .+= 1;julia> cpu
2-element Vector{Int64}:
 2
 3

Note that the above methods are prefixed unsafe because of how they require careful management of object lifetimes: When creating an Array from a CuArray, the CuArray must be kept alive for as long as the Array is used, and vice-versa when creating a CuArray from an Array. Explicit synchronization (i.e. waiting for the GPU to finish computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU memory through a CPU pointer.

For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be changed using the default_memory preference of the CUDA.jl module, which can be set to either "device", "unified" or "host". When these changes have been sufficiently tested, and the remaining rough edges have been smoothed out, we may consider switching the default allocator.

Cooperative groups

Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it possible to write kernels that are more flexible than the traditional approach of differentiating computations based on thread and block indices. Instead, cooperative groups allow the programmer to use objects representing groups of threads, pass those around, and differentiate computations based on queries on those objects.

For example, let's port the example from the introductory NVIDIA blogpost post, which provides a function to compute the sum of an array in parallel:

function reduce_sum(group, temp, val)
    lane = CG.thread_rank(group)    # Each iteration halves the number of active threads
    # Each thread adds its partial sum[i] to sum[lane+i]
    i = CG.num_threads(group) ÷ 2
    while i > 0
        temp[lane] = val
        CG.sync(group)
        if lane <= i
            val += temp[lane + i]
        end
        CG.sync(group)
        i ÷= 2
    end    return val  # note: only thread 1 will return full sum
end

When the threads of a group call this function, they cooperatively compute the sum of the values passed by each thread in the group. For example, let's write a kernel that calls this function using a group representing the current thread block:

function sum_kernel_block(sum::AbstractArray{T},
                          input::AbstractArray{T}) where T
    # have each thread compute a partial sum
    my_sum = thread_sum(input)    # perform a cooperative summation
    temp = CuStaticSharedArray(T, 256)
    g = CG.this_thread_block()
    block_sum = reduce_sum(g, temp, my_sum)    # combine the block sums
    if CG.thread_rank(g) == 1
        CUDA.@atomic sum[] += block_sum
    end    return
endfunction thread_sum(input::AbstractArray{T}) where T
    sum = zero(T)    i = (blockIdx().x-1) * blockDim().x + threadIdx().x
    stride = blockDim().x * gridDim().x
    while i <= length(input)
        sum += input[i]
        i += stride
    end    return sum
endn = 1<<24
threads = 256
blocks = cld(n, threads)data = CUDA.rand(n)
sum = CUDA.fill(zero(eltype(data)), 1)
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)

This style of programming makes it possible to write kernels that are safer and more modular than traditional kernels. Some CUDA features also require the use of cooperative groups, for example, asynchronous memory copies between global and shared memory are done using the CG.memcpy_async function.

With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support has been added for implicit groups (with the exception of cluster groups and the deprecated multi-grid groups), all relevant queries on these groups, as well as the many important collective functions, such as shuffle, vote, and memcpy_async. Support for explicit groups is still missing, as are collectives like reduce and invoke. For more information, refer to the CUDA.jl documentation.

Other updates

Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:

Support for CUDA 12.3
Performance improvements related to memory copies, which regressed in CUDA 5.0
Improvements to the native profiler (CUDA.@profiler), now also showing local memory usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter
Many CUSOLVER and CUSPARSE improvements by @amontoison

CUDA.jl 5.0: Integrated profiler and task synchronization changes

Tue, 19 Sep 2023 00:00:00 +0000

CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions.

Integrated profiler

The most exciting new feature in CUDA.jl 5.0 is the new integrated profiler, which is similar to the @profile macro from the Julia standard library. The profiler can be used by simply prefixing any code that uses the CUDA libraries with CUDA.@profile:

julia> CUDA.@profile CUDA.rand(1).+1
Profiler ran for 268.46 µs, capturing 21 events.Host-side activity: calling CUDA APIs took 230.79 µs (85.97% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   76.47% │ 205.28 µs │     1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel        │
│    5.42% │  14.54 µs │     2 │   7.27 µs │   5.01 µs │   9.54 µs │ cuMemAllocFromPoolAsync │
│    2.93% │   7.87 µs │     1 │   7.87 µs │   7.87 µs │   7.87 µs │ cuLaunchKernel          │
│    0.36% │ 953.67 ns │     2 │ 476.84 ns │    0.0 ns │ 953.67 ns │ cudaGetLastError        │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘Device-side activity: GPU was busy for 2.15 µs (0.80% of the trace)
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
│ Time (%) │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                        ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
│    0.44% │   1.19 µs │     1 │   1.19 µs │   1.19 µs │   1.19 µs │ _Z13gen_sequencedI17curandS ⋯
│    0.36% │ 953.67 ns │     1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
                                                                                  1 column omitted
1-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.7242923

The output shown above is a summary of what happened during the execution of the code. It is split into two sections: host-side activity, i.e., API calls to the CUDA libraries, and the resulting device-side activity. As part of each section, the output shows the time spent and the ratio to the total execution time. These ratios are important, and a good tool to quickly assess the performance of your code. For example, in the above output, we see that most of the time is spent on the host calling the CUDA libraries, and only very little time is actually spent computing things on the GPU. This indicates that the GPU is severely underutilized, which can be solved by increasing the problem size.

Instead of a summary, it is also possible to view a chronological trace by passing the trace=true keyword argument:

julia> CUDA.@profile trace=true CUDA.rand(1).+1;
Profiler ran for 262.98 µs, capturing 21 events.Host-side activity: calling CUDA APIs took 227.21 µs (86.40% of the trace)
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
│ ID │     Start │      Time │                    Name │ Details                │
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
│  5 │   6.44 µs │   9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│  7 │  19.31 µs │ 715.26 ns │        cudaGetLastError │ -                      │
│  8 │  22.41 µs │ 204.09 µs │        cudaLaunchKernel │ -                      │
│  9 │ 227.21 µs │    0.0 ns │        cudaGetLastError │ -                      │
│ 14 │  232.7 µs │   3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│ 18 │ 250.34 µs │   7.39 µs │          cuLaunchKernel │ -                      │
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘Device-side activity: GPU was busy for 2.38 µs (0.91% of the trace)
┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
│ ID │     Start │    Time │ Threads │ Blocks │ Regs │ Name                                      ⋯
├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
│  8 │ 225.31 µs │ 1.19 µs │      64 │     64 │   38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯
│ 18 │ 257.73 µs │ 1.19 µs │       1 │      1 │   18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
                                                                                  1 column omitted

Here, we can see a list of events that the profiler captured. Each event has a unique ID, which can be used to corelate host-side and device-side events. For example, we can see that event 8 on the host is a call to cudaLaunchKernel, which corresponds to to the execution of a CURAND kernel on the device.

The integrated profiler is a great tool to quickly assess the performance of your GPU application, identify bottlenecks, and find opportunities for optimization. For complex applications, however, it is still recommended to use NVIDIA's NSight Systems or Compute profilers, which provide a more detailed, graphical view of what is happening on the GPU.

Synchronization on worker threads

Another noteworthy change affects how tasks are synchronized. To enable concurrent execution, i.e., to make it possible for other Julia tasks to execute while waiting for the GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a significant source of latency, at least 25us per invocation but sometimes much longer, and have also been slated for deprecation and eventual removal from the CUDA toolkit.

Instead, on Julia 1.9 and later, CUDA.jl now uses worker threads to wait for GPU operations to finish. This mechanism is significantly faster, taking around 5us per invocation, but more importantly offers a much more reliable and predictable latency. You can observe this mechanism using the integrated profiler:

julia> a = CUDA.rand(1024, 1024, 1024)
julia> CUDA.@profile trace=true CUDA.@sync a .+ a
Profiler ran for 12.29 ms, capturing 527 events.Host-side activity: calling CUDA APIs took 11.75 ms (95.64% of the trace)
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
│  ID │     Start │      Time │ Thread │                    Name │
├─────┼───────────┼───────────┼────────┼─────────────────────────┤
│   5 │   6.91 µs │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │
│   9 │  36.72 µs │ 199.56 µs │      1 │          cuLaunchKernel │
│ 525 │ 510.69 µs │  11.75 ms │      2 │     cuStreamSynchronize │
└─────┴───────────┴───────────┴────────┴─────────────────────────┘

For some users, this may still be too slow, so we have added two mechanisms that disable nonblocking synchronization and simply block the calling thread until the GPU operation finishes. The first is a global setting, which can be enabled by setting the nonblocking_synchronization preference to false, which can be done using Preferences.jl. The second is a fine-grained flag to pass to synchronization functions: synchronize(x; blocking=true), CUDA.@sync blocking=true ..., etc. Both these mechanisms should not be used widely, and are only intended for use in latency-critical code, e.g., when benchmarking or profiling.

Local toolkit discovery

One of the breaking changes involves how local toolkits are discovered, when opting out of the use of artifacts. Previously, this could be enabled by calling CUDA.set_runtime_version!("local"), which generated a version = "local" preference. We are now changing this into two separate preferences, version and local, where the version preference overrides the version of the CUDA toolkit, and the local preference independently indicates whether to use a local CUDA toolkit or not.

Concretely, this means that you will now need to call CUDA.set_runtime_version!(local_toolkit=true) to enable the use of a local toolkit. The toolkit version will be auto-detected, but can be overridden by also passing a version: CUDA.set_runtime_version!(version; local_toolkit=true). This may be necessary when CUDA is not available during precompilation, e.g., on the log-in node of a cluster, or when building a container image.

Raised minimum requirements

Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit version is now 11.4, but this cannot be enforced by the package manager. As a result, if you need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or below. The README will maintain a table of supported CUDA toolkit versions.

Most users will not be affected by this change: If you use the artifact-provided CUDA toolkit, you will automatically get the latest version supported by your CUDA driver.

Other changes

Support for CUDA 12.2;
Memory limits are now enforced by CUDA, resulting in better performance;
Support for Julia 1.10 (with help from @dkarrasch);
Support for batched gemm, gemv and svd (by @lpawela and @nikopj.

Profiling oneAPI.jl applications with VTune

Wed, 19 Jul 2023 00:00:00 +0000

Profiling GPU applications is hard, so this post shows how to use Intel's VTune Profiler to profile GPU applications written in Julia with oneAPI.jl.

Because of the asynchronous nature of GPU execution, profiling GPU applications with Julia's tried and tested tools like @profile or even @time can be misleading: They will only show the time spent on the CPU, and will likely report that your application is spending most of its time waiting for the GPU.

To get a better understanding of what is happening on the GPU, we need specialized tools. In this post, we'll show how to use Intel's VTune Profiler to profile GPU applications written in Julia using oneAPI.jl.

Set-up

Start by downloading and installing the Intel VTune Profiler. This does not require administrative permissions, and will install in your home folder under the intel directory. On Linux, binaries will appear in ~/intel/oneapi/vtune/latest/bin64. There are three that are particularly important:

vtune: a command-line tool to profile applications;
vtune-gui: a graphical user interface to profile applications, or to visualize the results of a command-line profiling session;
vtune-backend: a daemon that creates a web interface for VTune, which you can use to profile applications both locally and remotely.

Hello VTune!

Let's start with a simple example: A Julia program that computes the sum of two arrays (i.e., the vadd example from the oneAPI repository):

using oneAPIfunction kernel(a, b, c)
    i = get_global_id()
    @inbounds c[i] = a[i] + b[i]
    return
endfunction vadd(a, b)
    d_a = oneArray(a)
    d_b = oneArray(b)
    d_c = similar(d_a)    @oneapi items=size(d_c) kernel(d_a, d_b, d_c)
    Array(d_c)
endfunction main(N=256)
    a = round.(rand(Float32, N) * 100)
    b = round.(rand(Float32, N) * 100)
    c = vadd(a, b)
end
main()

We've tweaked this example to make it more suited for profiling: We've enclosed the main application in a function so that it gets compiled, and we've increased the array sizes to make the GPU work harder.

There are several ways to profile this application. We'll start by demonstrating the command-line interface:

$ vtune -collect gpu-offload julia vadd.jlvtune: Collection started.
vtune: Collection stopped.vtune: Using result path `/home/tim/Julia/pkg/oneAPI/r000gh'
    GPU Time: 0.002s
EU Array Stalled/Idle: 100.0% of Elapsed time with GPU busy
 | The percentage of time when the EUs were stalled or idle is high, which has a
 | negative impact on compute-bound applications.
FPU Utilization: 0.0% of Elapsed time with GPU busy
...

This will run the application, and collect a number of GPU-related metrics. A summary is shown in the terminal, and a more detailed report will be written to a directory in the current working directory. You can open that report with the graphical user interface, possibly even on a different machine:

$ vtune-gui r000gh

Instrumenting the application

The trace we just collected includes the time spent compiling our application, making it difficult to analyze what is happening. To refine the trace, we can instrument our application with Intel's Instrumentation and Tracing Technology (ITT) APIs:

only start the profiler when we're running code of interest;
add markers to the trace to indicate what is happening.

We can interface with the ITT APIs using the IntelITT.jl package. Let's update our example:

using oneAPI, IntelITT# same as beforefunction main(N=256)
    a = round.(rand(Float32, N) * 100)
    b = round.(rand(Float32, N) * 100)
    c = IntelITT.@task "vadd" oneAPI.@sync vadd(a, b)
end# warm-up
main()# actual profile
IntelITT.@collect main()

Here, the IntelITT.@collect macro will start and stop the collection, so we should launch VTune with the -start-paused option:

$ vtune -collect gpu-offload -start-paused julia vadd.jl

In the GUI, we can now clearly see a nicely packed stream of API calls, grouped under the vadd task we added. Note that because API calls are asynchronous, i.e. they return immediately before the GPU has executed them, I grouped them under a oneAPI.@sync call so that the task not only captures the time spent on the CPU, but also the time spent on the GPU. This may not be wanted for your application.

Kernel details

The timeline view is great for getting an application-level overview of what is happening, but once you've isolated a kernel that doesn't perform as expected, you may want to switch from the GPU Offload to the GPU Compute Hotspots analysis. Here, you get a more detailed view of what's happening during execution on the GPU, including the memory bandwidth and execution properties:

$ vtune -collect gpu-hotspots -start-paused julia vadd.jl

Many of these analysis can be configured to collect more or less data, at the cost of more or less overhead.

Working remotely

In many cases, your local system will not have a GPU, and you will want to profile an application running on a remote system. As shown above, you can use the vtune CLI to create a trace and open that locally using vtune-gui, however there is an easier way: The vtune-backend daemon.

Start by launching the VTune back-end on the remote system:

$ vtune-backend --enable-server-profiling --web-port 8443 --log-to-console

If your remote system is directly reachable, you want to add --allow-remote-access --base-url "https://remoteServer:8443". However, most people will need to set-up an SSH tunnel:

$ ssh -L 8443:localhost:8443 remoteServer

You can now access the VTune GUI at https://localhost:8443/. Note that the first time you connect, you will need to do so using the one-time URL that is shown in the terminal where you launched the vtune-backend daemon.

The web interface that vtune-backend provides is identical to the GUI from vtune-gui: Start by creating a new project, and configuring an analysis: Select the local VTune profile server, enter the path to the Julia executable along with arguments and a working directory, and select the GPU Offload analysis type:

To start the analysis, click the big blue play button. If you use IntelITT.@collect to restrict the trace to the code of interest, use the second button with the pause symbol.

Give it a try!

Hopefully, this guide has shed some light on how to accurately profile oneAPI.jl applications using Intel's VTune Profiler. It turns out that one package could significantly benefit from some rigorous profiling: oneAPI.jl! Until now, development has focussed on correctness and usability, leaving considerable room for performance enhancements.

If you have access to an Intel GPU and want to gain experience profiling GPU applications with VTune, we encourage you to get involved! A good starting point would be analyzing some of oneAPI.jl's array operations like mapreduce or broadcast to identify potential bottlenecks. For more information or any queries, feel free to open an issue on GitHub, or join the discussion on Slack or Discourse. Your help could make a significant difference!

Metal.jl 0.2: Metal Performance Shaders

Fri, 03 Mar 2023 00:00:00 +0000

Metal.jl 0.2 marks a significant milestone in the development of the Metal.jl package. The release comes with initial support for the Metal Perform Shaders (MPS) framework for accelerating common operations like matrix multiplications, as well as various improvements for writing Metal kernels in Julia.

Metal Performance Shaders

Quoting the Apple documentation, The Metal Performance Shaders (MPS) framework contains a collection of highly optimized compute and graphics shaders for use in Metal applications. With Metal.jl 0.2, we have added initial support for this framework, and used it to accelerate the matrix multiplication operation:

julia> using Metal, LinearAlgebra, BenchmarkTools
julia> n = p = m = 2048
julia> flops = n*m*(2p-1)
17175674880julia> a = MtlArray(rand(Float32, n, p));
julia> b = MtlArray(rand(Float32, p, m));
julia> c = MtlArray(zeros(Float32, n, m));julia> using LinearAlgebra
julia> bench = @benchmark Metal.@sync mul!(c, a, b)
BenchmarkTools.Trial: 518 samples with 1 evaluation.
 Range (min … max):  9.366 ms …  13.354 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.629 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.646 ms ± 192.169 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%               ▃▂▅▅▆▆▆▇█▇▇▆▅▄▄▁▁ ▁
  ▄▁▄▄▄▄▆▆▆▄▄▁▇█████████████████▄█▄▁▆▁▄▁▆▁▇▁▄▄▁▁▄▄▇▁▄▆▄▁▁▁▁▁▄ █
  9.37 ms      Histogram: log(frequency) by time      10.1 ms < Memory estimate: 352 bytes, allocs estimate: 12.julia> flops / (minimum(bench.times)/1e9)
1.83e12

The benchmark above shows that on an 8-core M1 Pro matrix multiplication now reaches 1.8 TFLOPS (out of the 2.6TFLOPS of theoretical performance). The accelerated matrix multiplication is available for a variety of input types, incuding mixed-mode operations, and as shown above is integrated with the LinearAlgebra.jl mul! interface.

Of course, the MPS framework offers more than just matrix multiplication, and we expect to support more of it in the future. If you have a specific operation you would like to use from Julia, please let us know by opening an issue on the Metal.jl repository.

GPU profiling support

To support the development of Metal kernels, Max Hawkins has added support for GPU profiling. Similar to how this works in CUDA.jl, you can run code under the Metal.@profile macro to record its execution. However, this does first require setting the METAL_CAPTURE_ENABLED environment flag before import Metal.jl:

julia> ENV["METAL_CAPTURE_ENABLED"] = 1julia> using Metaljulia> a = mtl(rand(1024, 1024))
julia> Metal.@profile sum(a)
[ Info: GPU frame capture saved to jl_metal.gputrace/

The resulting capture can be opened with Xcode, presenting a timeline that's similar to other profilers:

Other improvements

Julia 1.9 is supported, but requires an up-to-date macOS version (issues have been encountered on macOS 12.4);
An mtl function has been added for converting Julia arrays to Metal arrays, similar to the cu function in CUDA.jl;
Multiple GPUs are supported, and the device! function can be used to select one;
Coverage for SIMD Group functions has been improved, so it's is now possible to use simdgroup_load, simdgroup_store, simdgroup_multiply, and simdgroup_multiply_accumulate in kernels functions.

Future work

Although Metal.jl is now usable for a variety of applications, there is still work to be done before it can be considered production-ready. In particular:

there are known performance issues with mapreduce, and other operations that realy on CartesianIndices;
the libcmt wrapper library for interfacing with the Metal APIs is cumbersome to use and improve, and we are looking into native ObjectiveC FFI instead;
the MPS wrappers are incomplete, and similar to the Metal APIs requires a replacement to libcmt to be improved;
support for atomic operations is missing, which is required to implement a full-featured KernelAbstractions.jl back-end.

Once (most of) these issues are addressed, we should be able to release Metal.jl 1.0.

oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9

Wed, 08 Feb 2023 00:00:00 +0000

The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel Library (oneMKL) to accelerate linear algebra operations on Intel GPUs. It also brings support for Julia 1.9 and Intel Arc GPUs.

oneMKL integration

oneAPI.jl now uses the Intel oneAPI Math Kernel Library (oneMKL), automatically downloaded as part of oneAPI_Support_jll.jl, to accelerate a great number of BLAS and LAPACK operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these wrappers are available at different levels of abstraction.

At the lowest level, we use a C library that wraps the oneMKL C++ APIs. For example, the oneapi::mkl::blas::column_major::gemm function for matrix-matrix multiplication is wrapped by the C functions onemklSgemm, onemklDgemm, etc. These wrappers are used to implement low-level methods like oneMKL.gemm!:

julia> using oneAPIjulia> A = oneArray(rand(Float32, 2, 3));
2×3 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.44302   0.125576  0.859145
 0.674291  0.428346  0.0400119
julia> B = oneArray(rand(Float32, 3, 4))
3×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.592748   0.529413   0.0323396  0.659528
 0.22489    0.0872259  0.253291   0.376519
 0.0121506  0.591135   0.706755   0.751686
julia> C = similar(B, (2, 4));julia> oneMKL.gemm!('N', 'N', true, A, B, true, C)
2×4 oneMatrix{Float32, oneAPI.oneL0.DeviceBuffer}:
 0.301279  0.753365  0.65334   0.985274
 0.496501  0.417994  0.158581  0.63607julia> Array(C) ≈ Array(A) * Array(B)
true

Of course, these low-level functions aren't very user-friendly, so we also integrate with Julia's standard libraries where possible:

julia> A = oneArray(rand(Float32, 2, 3));
julia> B = oneArray(rand(Float32, 3, 4));julia> using LinearAlgebra
julia> C = A * B;julia> Array(C) ≈ Array(A) * Array(B)
true

The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s standard linear algebra libraries. If you run into a missing function, please file a request to add it, or take a look at the source and contribute to oneAPI.jl! The current state of the wrappers should make it easy to extend their functionality, as well as form a good basis for integrating with other libraries like oneDNN.

Intel Arc support

The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs offer a significant performance improvement over their integrated predecessors:

julia> using oneAPI
julia> oneAPI.versioninfo()
1 device:
- Intel(R) Arc(TM) A770 Graphics [0x56a0]julia> T = Float32;
julia> n = p = m = 2048;
julia> a = oneArray(rand(T, n, p));
julia> b = oneArray(rand(T, p, m));
julia> c = oneArray(zeros(T, n, m));julia> using BenchmarkTools, LinearAlgebra
julia> bench = @benchmark oneAPI.@sync mul!(c, a, b)
BenchmarkTools.Trial: 1510 samples with 1 evaluation.
 Range (min … max):  3.233 ms …  3.791 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.298 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.308 ms ± 48.426 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%        ▁▃▄▇█▅▄▃▂   ▁▁▁
  ▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃
  3.23 ms        Histogram: frequency by time        3.47 ms < Memory estimate: 272 bytes, allocs estimate: 11.julia> flops = n*m*(2p-1)
17175674880julia> flops / (minimum(bench.times)/1e9)
5.3131281169900205e12

For example, here we're getting over 5 TFlops of Float32 performance, which is over 10x faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development. At the same time, the A770 used above should be able to deliver close to 20 TFlops, so there's still room for improvement in our software stack.

To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing, that kernel is still in beta, so refer to your distribution's documentation for how to install it. For example, on Arch Linux you can use the linux-mainline package from the AUR, Ubuntu has the kernel-ppa archive, Fedora provides the stable-rc repository, etc.

Other changes

Support for Julia 1.9 has been added.

CUDA.jl 4.0

Wed, 01 Feb 2023 00:00:00 +0000

CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.

JLLs for CUDA artifacts

While CUDA.jl has been using binary artifacts for a while, it was manually managing installation and selection of them, i.e., not by using standardised JLL packages. This complicated use of the artifacts by other packages, and made it difficult to build other binary packages against the CUDA runtime.

With CUDA.jl 4.0, we now use JLLs to load the CUDA driver and runtime. Specifically, there are two JLLs in play: CUDA_Driver_jll and CUDA_Runtime_jll. The former is responsible for loading the CUDA driver library (possibly upgrading it using a forward-compatible version), and determining the CUDA version that your set-up supports:

❯ JULIA_DEBUG=CUDA_Driver_jll julia
julia> using CUDA_Driver_jll
┌ System CUDA driver found at libcuda.so.1, detected as version 12.0.0
└ @ CUDA_Driver_jll
┌ System CUDA driver is recent enough; not using forward-compatible driver
└ @ CUDA_Driver_jll

With the driver identified and loaded, CUDA_Runtime_jll can select a compatible toolkit. By default, it uses the latest supported toolkit that is compatible with the driver:

julia> using CUDA_Runtime_jlljulia> CUDA_Runtime_jll.cuda_toolkits
10-element Vector{VersionNumber}:
 v"10.2.0"
 v"11.0.0"
 v"11.1.0"
 v"11.2.0"
 v"11.3.0"
 v"11.4.0"
 v"11.5.0"
 v"11.6.0"
 v"11.7.0"
 v"11.8.0"julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.8}

As you can see, the selected CUDA runtime is encoded in the host platform. This makes it possible for Julia to automatically select compatible versions of other binary packages. For example, if we install and load SuiteSparse_GPU_jll, which right now provides builds for CUDA 10.2, 11.0 and 12.0, the artifact resolution code knows to load the build for CUDA 11.0 which is compatible with the selected CUDA 11.8 runtime:

julia> using SuiteSparse_GPU_jlljulia> SuiteSparse_GPU_jll.best_wrapper
"~/.julia/packages/SuiteSparse_GPU_jll/.../x86_64-linux-gnu-cuda+11.0.jl"

The change to JLLs requires a breaking change: the JULIA_CUDA_VERSION and JULIA_CUDA_USE_BINARYBUILDER environment variables have been removed, and are replaced by preferences that are set in the current environment. For convenience, you can set these preferences by calling CUDA.set_runtime_version!:

❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.8.0"julia> CUDA.set_runtime_version!(v"11.7")
┌ Set CUDA Runtime version preference to 11.7,
└ please re-start Julia for this to take effect.❯ julia --project
julia> using CUDA
julia> CUDA.runtime_version()
v"11.7.0"julia> using CUDA_Runtime_jll
julia> CUDA_Runtime_jll.host_platform
Linux x86_64 {cuda=11.7}

The changed preference is reflected in the host platform, which means that you can use this mechanism to load a different builds of other binary packages. For example, if you rely on a package or JLL that does not yet have a build for CUDA 12, you could set the preference to v"11.x" to load an available build.

For discovering a local runtime, you can set the version to "local", which will replace the use of CUDA_Runtime_jll by CUDA_Runtime_discovery.jl, an API-compatible package that replaces the JLL with a local runtime discovery mechanism:

❯ julia --project
julia> CUDA.set_runtime_version!("local")
┌ Set CUDA Runtime version preference to local,
└ please re-start Julia for this to take effect.❯ JULIA_DEBUG=CUDA_Runtime_Discovery julia --project
julia> using CUDA
┌ Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA_Runtime_Discovery
┌ Looking for binary ptxas in /opt/cuda
│   all_locations =
│    2-element Vector{String}:
│     "/opt/cuda"
│     "/opt/cuda/bin"
└ @ CUDA_Runtime_Discovery
┌ Debug: Found ptxas at /opt/cuda/bin/ptxas
└ @ CUDA_Runtime_Discovery
...

Memory limits

By popular demand, support for memory limits has been reinstated. This functionality had been removed after the switch to CUDA memory pools, as the memory pool allocator does not yet support memory limits. Awaiting improvements by NVIDIA, we have added functionality to impose memory limits from the Julia side, in the form of two environment variables:

JULIA_CUDA_SOFT_MEMORY_LIMIT: This is an advisory limit, used to configure the memory pool, which will result in the pool being shrunk down to the requested limit at every synchronization point. That means that the pool may temporarily grow beyond the limit. This limit is unavailable when disabling memory pools (with JULIA_CUDA_MEMORY_POOL=none).
JULIA_CUDA_HARD_MEMORY_LIMIT: This is a hard limit, checked before every allocation. Doing so is relatively expensive, so it is recommended to use the soft limit instead.

The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: 100M, 50%, 1.5GiB, 10000.

CUSPARSE improvements

Thanks to the work of @amontoison, the CUSPARSE interface has undergone many improvements:

Better support of the CuSparseMatrixCOO format with, in particular, the addition of CuSparseMatrixCOO * CuVector and CuSparseMatrixCOO * CuMatrix products;
Routines specialized for -, +, * operations between sparse matrices (CuSparseMatrixCOO, CuSparseMatrixCSC and CuSparseMatrixCSR) have been interfaced;
New generic routines for backward and forward sweeps with sparse triangular matrices are now used by \;
CuMatrix * CuSparseVector and CuMatrix * CuSparseMatrix products have been added;
Conversions between sparse and dense matrices have been updated for using more recent and optimized routines;
High-level Julia functions for the new set of sparse BLAS 1 routines such as dot products between CuSparseVector;
Add missing dispatchs for mul! and ldiv! functions;
Interfacing of almost all new CUSPARSE routines added by the CUDA toolkits v"11.x".

Other changes

Removal of the CUDNN, CUTENSOR, CUTENSORNET and CUSTATEVEC submodules: These have been moved into their own packages, respectively cuDNN.jl, cuTENSOR.jl, cuTensorNet.jl and cuStateVec.jl (note the change in capitalization, now following NVIDIA's naming scheme);
Removal of the NVTX submodule: NVTX.jl should be used instead, which is a more complete implementation of the NVTX API;
Support for CUDA 11.8 (support for CUDA 12.0 is being worked on);
Support for Julia 1.9.

Backport releases

Because CUDA.jl 4.0 is a breaking release, two additional releases have been made that backport bugfixes and select features:

CUDA.jl 3.12.1 and 3.12.2: backports of bugfixes since 3.12
CUDA.jl 3.13.0: additionally adding the memory limit functionality

Technical preview: Programming Apple M1 GPUs in Julia with Metal.jl

Fri, 24 Jun 2022 00:00:00 +0000

Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 GPUs. The back-end is built on the same foundations that make up existing GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody who's already programmed GPUs in Julia. In the following post I'll demonstrate some of that functionality and explain how it works.

But first, note that Metal.jl is under heavy development: The package is considered experimental for now, as we're still working on squashing bugs and adding essential functionality. We also haven't optimized for performance yet. If you're interesting in using Metal.jl, please consider contributing to its development! Most of the package is written in Julia, and checking-out the source code is a single Pkg.develop away :-)

Quick start

Start by getting a hold of the upcoming Julia 1.8, launch it, and enter the package manager by pressing ]:

julia> ]pkg> add Metal
  Installed Metal

Installation is as easy as that, and we'll automatically download the necessary binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the versioninfo() method for some details on the toolchain:

julia> using Metaljulia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.11 device:
- Apple M1 Pro (64.000 KiB allocated)

And there we go! You'll note here that I'm using the upcoming macOS 13 (Ventura); this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren't technical limitations. In fact, Metal.jl does work on e.g. macOS Monterey with an Intel GPU, but it's an untested combination that may suffer from bugs.

Array programming

Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the MtlArray type that can be used to manage memory and perform GPU computations:

# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
 0.158752  0.836366
 0.535798  0.153554# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
 0.473325  0.261202
 0.167333  0.471702# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
 0.473325  0.261202
 0.167333  0.471702

Beyond these simple operations, Julia's higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:

julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
 1.15276  0.584146julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
 -2.0472   -1.25332
 -2.96594  -2.60351

Much of this functionality comes from the GPUArrays.jl package, which provides vendor-neutral implementations of common array operations. As a result, MtlArray is already pretty capable, and should be usable with realistic array-based applications.

Kernel programming

Metal.jl's array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:

# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
  i = thread_position_in_grid_1d()
  if i <= length(array)
    @inbounds array[i] = value
  end
  return
enda = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)# verify
@assert all(isequal(42), Array(a))

As can be seen here, we've opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that's similar to Julia's existing back-ends. Some key differences:

we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, ...;
all types of arguments (buffers, indirect buffers, value-typed inputs) are transparently converted to a GPU-compatible structure^[1];
global (task-bound) state is used to keep track of the active device and a queue;
compute pipeline set-up and command encoding is hidden behind a single macro.

Behind the scenes, we compile Julia to LLVM IR and use a tiny LLVM back-end (based on @a2flo's libfloor) that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using @device_code_metal:

julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)

[header]
program_count: 1
...[program]
name: julia_memset_kernel
type: kernel
...

target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"; the (rewritten) kernel function:
;  - %value argument passed by reference
;  - %thread_position_in_grid argument added
;  - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
    { i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
    i64 addrspace(1)* %value,
    i32 %thread_position_in_grid) {
  ...
  %9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
  ...
  ret void
}; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
              i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}; other metadata not shown, for brevity

Shout-out to @max-Hawkins for exploring Metal code generation during his internship at Julia Computing!

Metal APIs in Julia

Lacking an Objective C or C++ FFI, we interface with the Metal libraries using a shim C library. Most users won't have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we've designed for the Metal APIs:

julia> dev = MtlDevice(1)
MtlDevice:
  name:             Apple M1 Pro
  lowpower:         false
  headless:         false
  removable:        false
  unified memory:   truejulia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
  type:             MtHeapTypeAutomatic
  storageMode:      MtStorageModePrivate
  size:             0julia> desc.size = 16384
16384julia> heap = MtlHeap(dev, desc)
MtlHeap:
  type:                 MtHeapTypeAutomatic
  size:                 16384
  usedSize:             0
  currentAllocatedSize: 16384# etc

These wrappers are based on @PhilipVinc's excellent work on MetalCore.jl, which formed the basis for (and has been folded into) Metal.jl.

What's next?

The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:

performance optimization
integration with Metal Performance Shaders
integration / documentation for use with Xcode tools
fleshing out the array abstraction based on user feedback

Please consider helping out with any of these! Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you're not certain, or have any questions, please drop by the #gpu channel on the JuliaLang Slack, ask questions on our Discourse, or chat to us during the GPU office hours every other Monday.

If you encounter any bugs, feel free to let us know on the Metal.jl issue tracker. For information on upcoming releases, subscribe to this website's blog where we post about significant developments in Julia's GPU ecosystem.

[1]	This relies on Metal 3 from macOS 13, which introduced bindless argument

buffers, as we didn't fully figure out how to reliably encode arbitrarily-nested indirect buffers in argument encoder metadata.

oneAPI.jl status update

Wed, 06 Apr 2022 00:00:00 +0000

It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs (and other accelerators) using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.

`@atomic` intrinsics

oneAPI.jl now supports atomic operations, which are required to implement a variety of parallel algorithms. Low-level atomic functions (atomic_add!, atomic_xchg!, etc) are available as unexported methods in the oneAPI module:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.atomic_add!(pointer(a), Int32(1))
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 256

Note that these methods are only available for those types that are supported by the underlying OpenCL intrinsics. For example, the atomic_add! from above can only be used with Int32 and UInt32 inputs.

Most users will instead rely on the higher-level @atomic macro, which can be easily put in front of many array operations to make them behave atomically. To avoid clashing with the new @atomic macro in Julia 1.7, this macro is also unexported:

a = oneArray(Int32[0])function kernel(a)
    oneAPI.@atomic a[1] += Int32(1)
    return
end@oneapi items=256 kernel(a)
@test Array(a)[1] == 512

When used with operations that are supported by OpenCL, this macro will lower to calls like atomic_add!. For other operations, a compare-and-exchange loop will be used. Note that for now, this is still restricted to 32-bit operations, as we do not support the cl_khr_int64_base_atomics extension for 64-bit atomics.

Initial integration with vendor libraries

One significant missing features is the integration with vendor libraries like oneMKL. These integrations are required to ensure good performance for important operations like matrix multiplication, which currently fall-back to generic implementations in Julia that may not always perform as good.

To improve this situation, we are working on a wrapper library that allows us to integrate with oneMKL and other oneAPI and SYCL libraries. Currently, only matrix multiplication is supported, but once the infrastructural issues are worked out we expect to quickly support many more operations.

If you need support for specific libraries, please have a look at this PR. As the API surface is significant, we will need help to extend the wrapper library and integrate it with high-level Julia libraries like LinearAlgebra.jl.

Correctness issues

In porting existing Julia GPU applications to oneAPI.jl, we fixed several issues that caused correctness issues when executing code on Intel GPUs:

when the garbage collector frees GPU memory, it now blocks until all outstanding commands (which may include uses of said memory) are completes
the barrier function to synchronize threads is now marked as convert to avoid LLVM miscompilations

Note that if you are using Tiger Lake hardware, there is currently a known issue in the back-end Intel compiler that affects oneAPI.jl, causing correctness issues that can be spotted by running the oneAPI.jl test suite.

Future work

To significantly improve usability of oneAPI.jl, we will add support to the KernelAbstraction.jl package. This library is used by many other packages for adding GPU acceleration to algorithms that cannot be easily expressed using only array operations. As such, support for oneAPI.jl will make it possible to use your oneAPI GPUs with all of these packages.

CUDA.jl 3.5-3.8

Fri, 28 Jan 2022 00:00:00 +0000

CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.

Copies between devices

Typically, when sending data between devices you need to stage through the CPU. CUDA.jl now does this automatically, making it possible to directly copy between CuArrays on different devices:

julia> device!(0);julia> a = CUDA.rand(2,2)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119julia> device!(1);julia> b = CUDA.zeros(2,2);julia> copyto!(b, a)
2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.440147  0.986939
 0.622901  0.698119

When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:

julia> src = CuDevice(0)
CuDevice(0): NVIDIA A100-PCIE-40GBjulia> dst = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> can_access_peer(src, dst)
false

In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports true:

julia> src = CuDevice(1)
CuDevice(1): Tesla V100-PCIE-32GBjulia> dst = CuDevice(2)
CuDevice(2): Tesla V100-PCIE-16GBjulia> can_access_peer(src, dst)
true

Thanks to @kshyatt for help with this change!

Helper function to use `compute-sanitizer`

The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the compute sanitizer. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and provides a helper function to restart Julia under the compute-sanitizer. Let's demonstrate, and trigger a memory error to show what the compute sanitizer can detect:

julia> using CUDAjulia> CUDA.run_compute_sanitizer()
Re-starting your active Julia session...========= COMPUTE-SANITIZER
julia> using CUDAjulia> unsafe_wrap(CuArray, pointer(CuArray([1])), 2) .= 1
========= Invalid __global__ write of size 8 bytes
=========     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892(CuKernelContext, CuDeviceArray, Broadcasted>, _identity, Broadcasted>, Int64)
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0xa64000008 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes

Other tools are available too, e.g. racecheck for detecting races or synccheck for finding synchronization issues. These tools can be selected using the tool keyword argument to run_compute_sanitizer.

Updated binary dependencies

As is common with every release, CUDA.jl now supports newer versions of NVIDIA's tools and libraries:

CUDA toolkit 11.5 and 11.6
CUDNN 8.3.2
CUTENSOR 1.4.0

The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like compute-sanitizer or cuda-gdb, and you need debug info (the equivalent of nvcc -G), ensure CUDA.jl can use the latest version of the CUDA toolkit.

To make it easier to use the latest supported toolkit, CUDA.jl now implements CUDA's so-called Forward Compatibility mode: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to the NVIDIA documentation for more details.

Preserving array indices

Julia's integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl's device array type now carefully preserves the type of indices so that 32-bits indices aren't unnecessarily promoted to 64-bits. With some careful kernel programming (note the use of 0x1 instead of 1 below), this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:

julia> function memset(arr, val)
           i = (blockIdx().x-0x1) * blockDim().x + threadIdx().x
           @inbounds arr[i] = val
           return
       endjulia> CUDA.code_ptx(memset, Tuple{CuDeviceArray{Float32,1,AS.Global},Float32})
.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<5>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mov.u32         %r3, %tid.x;
        mad.lo.s32      %r4, %r2, %r1, %r3;
        ld.u64          %rd2, [%rd1];
        mul.wide.s32    %rd3, %r4, 4;
        add.s64         %rd4, %rd2, %rd3;
        st.global.f32   [%rd4], %f1;
        ret;
}

On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:

.func julia_memset(.param .b64 arr, .param .b32 val) {
        .reg .f32       %f<2>;
        .reg .b32       %r<5>;
        .reg .b64       %rd<8>;        ld.param.u64    %rd1, [arr];
        ld.param.f32    %f1, [val];
        mov.u32         %r1, %ctaid.x;
        mov.u32         %r2, %ntid.x;
        mul.wide.u32    %rd2, %r2, %r1;
        mov.u32         %r3, %tid.x;
        add.s32         %r4, %r3, 1;
        cvt.u64.u32     %rd3, %r4;
        ld.u64          %rd4, [%rd1];
        add.s64         %rd5, %rd2, %rd3;
        shl.b64         %rd6, %rd5, 2;
        add.s64         %rd7, %rd4, %rd6;
        st.global.f32   [%rd7+-4], %f1;
        ret;
}

More aggressive memory management

Starting with CUDA 3.8, the memory pool used to allocate CuArrays will be configured differently: The pool will now be allowed to use all available GPU memory, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.

This behavior can be observed by calling the memory_status() function:

julia> CUDA.memory_status()
Effective GPU memory usage: 13.57% (2.001 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)julia> a = CuArray{Float32}(undef, (1024, 1024, 1024));
julia> Base.format_bytes(sizeof(a))
"4.000 GiB"julia> a = nothing
julia> GC.gc()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU (e.g., by copying memory to the CPU) would have resulted in a release of this reserved memory. This is not the case anymore:

julia> synchronize()julia> CUDA.memory_status()
Effective GPU memory usage: 40.59% (5.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (4.000 GiB reserved)

If you still want to release this memory, you can call the reclaim() function:

julia> CUDA.reclaim()julia> CUDA.memory_status()
Effective GPU memory usage: 13.48% (1.988 GiB/14.751 GiB)
Memory pool usage: 0 bytes (0 bytes reserved)

With interactive Julia sessions, this function is called periodically so that the GPU's memory isn't held on to unnecessarily. Otherwise it shouldn't be necessary to call this function, as memory is freed automatically when it is needed.

Minor changes and improvements

Bitonic sort is now used instead of quicksort (by @xaellison).
CuDeviceArray now stores the length of the array, greatly speeding up indexing with high-dimensional arrays.
Device intrinsics cannot be called on the CPU anymore, protecting against segfaults when something isn't dispatching correctly.
Support for Multi-GPU instances has been improved, providing the parent_uuid function to look up the UUID of the parent device.
randn and randexp are now supported in kernel code, which should help with initial support of Distributions.jl-based operations.

CUDA.jl 3.4

Fri, 13 Aug 2021 00:00:00 +0000

The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl 3.0 is now the default fallback, and support for memory pools other than the CUDA stream-ordered one has been removed.

Streamlined atomic operations

In preparation of integrating with the new standard @atomic macro introduced in Julia 1.7, we have streamlined the capabilities of atomic operations in CUDA.jl. The API is now split into two levels: low-level atomic_ methods for atomic functionality that's directly supported by the hardware, and a high-level @atomic macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:

julia> a = CuArray([1]);julia> function kernel(a)
         CUDA.@atomic a[] <<= 1
         return
       endjulia> @cuda threads=16 kernel(a)julia> a
1-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 65536julia> 1<<16
65536

The only requirement is that the types being used are supported by CUDA.atomic_cas!. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.

Note that on Julia 1.7 and higher, CUDA.jl does not export the @atomic macro anymore to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use CUDA.@atomic as in the example above.

Arrays with unified memory

You may have noticed that the CuArray type in the example above included an additional parameter, Mem.DeviceBuffer. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it's now possible to allocate arrays backed by unified buffers that can be used on multiple devices:

julia> a = cu([0]; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 0julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1julia> device!(1)julia> a .+= 1
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 2

Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.

New default random number generator

CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been added in CUDA.jl 3.3, and the generator is now used as the default fallback when CURAND does not support the requested element types.

Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:

julia> using BenchmarkTools
julia> cuda_rng = CUDA.RNG();
julia> gpuarrays_rng = GPUArrays.default_rng(CuArray);
julia> a = CUDA.zeros(1024,1024);julia> @benchmark CUDA.@sync rand!($cuda_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.040 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.04%
 Time  (median):     18.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.604 μs ± 34.734 μs  ┊ GC (mean ± σ):  1.17% ±  0.99%         ▃▆█▇▇▅▄▂▁
  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  17 μs           Histogram: frequency by time        24.1 μs  @benchmark CUDA.@sync rand!($gpuarrays_rng, $a)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  72.489 μs …  2.790 ms  ┊ GC (min … max): 0.00% … 98.44%
 Time  (median):     74.479 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   81.211 μs ± 61.598 μs  ┊ GC (mean ± σ):  0.67% ±  1.40%  █                                                           ▁
  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
  72.5 μs      Histogram: log(frequency) by time       443 μs <

julia> using RNGTest
julia> test_cuda_rng = RNGTest.wrap(cuda_rng, UInt32);
julia> test_gpuarrays_rng = RNGTest.wrap(gpuarrays_rng, UInt32);julia> RNGTest.smallcrushTestU01(test_cuda_rng)
 All tests were passedjulia> RNGTest.smallcrushTestU01(test_gpuarrays_rng)
 The following tests gave p-values outside [0.001, 0.9990]:       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps
  2  Collision                        eps
  3  Gap                              eps
  4  SimpPoker                       1.0e-4
  5  CouponCollector                  eps
  6  MaxOft                           eps
  7  WeightDistrib                    eps
 10  RandomWalk1 M                   6.0e-4
 ----------------------------------------------
 (eps  means a value < 1.0e-300):

Removal of old memory pools

With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have removed support for those Julia-managed memory pools (i.e., binned, split and simple). You can now only use the cuda memory pool, or use no pool at all by setting the JULIA_CUDA_MEMORY_POOL environment variable to none.

Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.

Also note that the new stream-ordered allocator has turned out incompatible with legacy cuIpc APIs as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application's allocation pattern benefits from a memory pool.

Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.

Device capability-dependent kernel code

Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.

Currently, the device capability and PTX ISA version are embedded and made available using respectively the compute_capability and ptx_isa_version functions. A simplified version number type, constructable using the sv"..." string macro, can be used to test against these properties. For example:

julia> function kernel(a)
           a[] = compute_capability() >= sv"6.0" ? 1 : 2
           return
       end
kernel (generic function with 1 method)julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_1({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 1.000000e+00, float addrspace(1)* %2, align 4
  ret void
}julia> capability(device!(1))
v"3.5.0"julia> CUDA.code_llvm(kernel, Tuple{CuDeviceVector{Float32, AS.Global}})
define void @julia_kernel_2({ i8 addrspace(1)*, i64, [1 x i64] }* %0) {
top:
  %1 = bitcast { i8 addrspace(1)*, i64, [1 x i64] }* %0 to float addrspace(1)**
  %2 = load float addrspace(1)*, float addrspace(1)** %1, align 8
  store float 2.000000e+00, float addrspace(1)* %2, align 4
  ret void
}

The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.

Other changes

Support for CUDA 11.4 Update 1
Improved thread safety [1] [2]

CUDA.jl 3.3

Thu, 10 Jun 2021 00:00:00 +0000

There have been several releases of CUDA.jl in the past couple of months, with many bugfixes and many exciting new features to improve GPU programming in Julia: CuArray now supports isbits Unions, CUDA.jl can emit debug info for use with NVIDIA tools, and changes to the compiler make it even easier to use the latest version of the CUDA toolkit.

`CuArray` support for isbits Unions

Unions are a way to represent values of one type or another, e.g., a value that can be an integer or a floating point. If all possible element types of a Union are so-called bitstypes, which can be stored contiguously in memory, the Union of these types can be stored contiguously too. This kind of optimization is implemented by the Array type, which can store such "isbits Unions" inline, as opposed to storing a pointer to a heap-allocated box. For more details, refer to the Julia documentation.

With CUDA.jl 3.3, the CuArray GPU array type now supports this optimization too. That means you can safely allocate CuArrays with isbits union element types and perform GPU-accelerated operations on then:

julia> a = CuArray([1, nothing, 3])
3-element CuArray{Union{Nothing, Int64}, 1}:
 1
  nothing
 3julia> findfirst(isnothing, a)
2

It is also safe to pass these CuArrays to a kernel and use unions there:

julia> function kernel(a)
         i = threadIdx().x
         if a[i] !== nothing
           a[i] += 1
         end
         return
       endjulia> @cuda threads=3 kernel(a)julia> a
3-element CuArray{Union{Nothing, Int64}, 1}:
 2
  nothing
 4

This feature is especially valuable to represent missing values, and is an important step towards GPU support for DataFrames.jl.

Debug and location information

Another noteworthy addition is the support for emitting debug and location information. The debug level, set by passing -g to the julia executable, determines how much info is emitted. The default of level 1 only enables location information instructions which should not impact performance. Passing -g0 disables this, while passing -g2 also enables the output of DWARF debug information and compiles in debug mode.

Location information is useful for a variety of reasons. Many tools, like the NVIDIA profilers, use it corelate instructions to source code:

Debug information can be used to debug compiled code using cuda-gdb:

$ cuda-gdb --args julia -g2 examples/vadd.jl
(cuda-gdb) set cuda break_on_launch all
(cuda-gdb) run
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
macro expansion () at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
74                  Base.llvmcall(($ir,$fn), $rettyp, $argtyp, $(args.args...))(cuda-gdb) bt
#0  macro expansion () at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
#1  macro expansion () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#2  _index () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#3  blockIdx_x () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:56
#4  blockIdx () at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:76
#5  julia_vadd<<<(1,1,1),(12,1,1)>>> (a=..., b=..., c=...) at .julia/dev/CUDA/examples/vadd.jl:6(cuda-gdb) f 5
#5  julia_vadd<<<(1,1,1),(12,1,1)>>> (a=..., b=..., c=...) at .julia/dev/CUDA/examples/vadd.jl:6
6           i = (blockIdx().x-1) * blockDim().x + threadIdx().x(cuda-gdb) l
1       using Test
2
3       using CUDA
4
5       function vadd(a, b, c)
6           i = (blockIdx().x-1) * blockDim().x + threadIdx().x
7           c[i] = a[i] + b[i]
8           return
9       end
10

Improved CUDA compatibility support

As always, new CUDA.jl releases come with updated support for the CUDA toolkit. CUDA.jl is now compatible with CUDA 11.3, as well as CUDA 11.3 Update 1. Users don't have to do anything to update to these versions, as CUDA.jl will automatically select and download the latest supported version.

Of course, for CUDA.jl to use the latest versions of the CUDA toolkit, a sufficiently recent version of the NVIDIA driver is required. Before CUDA 11.0, the driver's CUDA compatibility was a strict lower bound, and every minor CUDA release required a driver update. CUDA 11.0 comes with an enhanced compatibility option that follows semantic versioning, e.g., CUDA 11.3 can be used on an NVIDIA driver that only supports up to CUDA 11.0. CUDA.jl now follows semantic versioning when selecting a compatible toolkit, making it easier to use the latest version of the CUDA toolkit in Julia.

For those interested: Implementing semantic versioning required the CUDA.jl compiler to use ptxas instead of the driver's embedded JIT to generate GPU machine code. At the same time, many parts of CUDA.jl still use the CUDA driver APIs, so it's always recommended to keep your NVIDIA driver up-to-date.

High-level graph APIs

To overcome the cost of launching kernels, CUDA makes it possible to build computational graphs, and execute those graphs with less overhead than the underlying operations. In CUDA.jl we provide easy access to the APIs to record and execute these graphs:

A = CUDA.zeros(Int, 1)# ensure the operation is compiled
A .+= 1# capture
graph = capture() do
    A .+= 1
end
@test Array(A) == [1]   # didn't change anything# instantiate and launch
exec = instantiate(graph)
CUDA.launch(exec)
@test Array(A) == [2]# update and instantiate/launch again
graph′ = capture() do
    A .+= 2
end
update(exec, graph′)
CUDA.launch(exec)
@test Array(A) == [4]

This sequence of operations is common enough that we provide a high-level @captured macro wraps that automatically records, updates, instantiates and launches the graph:

A = CUDA.zeros(Int, 1)for i in 1:2
    @captured A .+= 1
end
@test Array(A) == [2]

Minor changes and features

CUDA.jl now supports @atomic multiplication and division (by @yuehhua)
Several statistics functions have been implemented (by @berquist)
The device-side random number generator in is now based on Philox2x, greatly improving quality of randomness (passing BigCrush) while allowing calls to rand() from divergent threads.
Dependent libraries like CUDNN and CUTENSOR are now only downloaded and initialized when they are used.
The synchronize() function in now first spins before yielding and sleeping, to improve the latency of short-running operations.
Several additional operations are now supported on Float16 inputs, such as CUSPARSE and CUBLAS operations, and various math intrinsics.
Kepler support (compute capability 3.5) has been reinstated for the time being.

CUDA.jl 3.0

Fri, 09 Apr 2021 00:00:00 +0000

CUDA.jl 3.0 is a significant, semi-breaking release that features greatly improved multi-tasking and multi-threading, support for CUDA 11.2 and its new memory allocator, compiler tooling for GPU method overrides, device-side random number generation and a completely revamped cuDNN interface.

Improved multi-tasking and multi-threading

Before this release, CUDA operations were enqueued on a single global stream, and many of these operations (like copying memory, or synchronizing execution) were fully blocking. This posed difficulties when using multiple tasks to perform independent operations: Blocking operations prevent all tasks from making progress, and using the same stream introduces unintended dependencies on otherwise independend operations. CUDA.jl now uses private streams for each Julia task, and avoids blocking operations where possible, enabling task-based concurrent execution. It is also possible to use different devices on each task, and there is experimental support for executing those tasks from different threads.

A ~~picture~~ snippet of code is worth a thousand words, so let's demonstrate using a computation that uses both a library function (GEMM from CUBLAS) and a native Julia broadcast kernel:

using CUDA, LinearAlgebrafunction compute(a,b,c)
    mul!(c, a, b)
    broadcast!(sin, c, c)
    synchronize()
    c
end

To execute multiple invocations of this function concurrently, we can simply use Julia's task-based programming interfaces and wrap each call to compute in an @async block. Then, we synchronize execution again by wrapping in a @sync block:

function iteration(a,b,c)
    results = Vector{Any}(undef, 2)
    NVTX.@range "computation" @sync begin
        @async begin
            results[1] = compute(a,b,c)
        end
        @async begin
            results[2] = compute(a,b,c)
        end
    end
    NVTX.@range "comparison" Array(results[1]) == Array(results[2])
end

The calls to the @range macro from NVTX, a submodule of CUDA.jl, will visualize the different phases of execution when we profile our program. We now invoke our function using some random data:

function main(N=1024)
    a = CUDA.rand(N,N)
    b = CUDA.rand(N,N)
    c = CUDA.rand(N,N)    # make sure this data can be used by other tasks!
    synchronize()    # warm-up
    iteration(a,b,c)
    GC.gc(true)    NVTX.@range "main" iteration(a,b,c)
end

The snippet above illustrates one breaking aspect of this release: Because each task uses its own stream, you now need to synchronize when re-using data in another task. Although it is unlikely that any user code was relying on the old behavior, it is technically a breaking change, and as such we are bumping the major version of the CUDA.jl package.

If we profile these our program using NSight Systems, we can see how the execution of both calls to compute was overlapped:

The region highlighted in green was spent enqueueing operations from the CPU, which includes the call to synchronize(). This used to be a blocking operation, whereas now it only synchronizes the task-local stream while yielding to the Julia scheduler so that it can continue execution on another task. For synchronizing the entire device, use the new device_synchronize() function.

The remainder of computation was then spent executing kernels. Here, execution was overlapped, but that obviously depends on the exact characteristics of the computations and your GPU. Also note that copying to and from the CPU is always going to block for some time, unless the memory was page-locked. CUDA.jl now supports locking memory like that using the pin function; for more details refer to the CUDA.jl documentation on tasks and threads.

CUDA 11.2 and stream-ordered allocations

CUDA.jl now also fully supports CUDA 11.2, and it will default to using that version of the toolkit if your driver supports it. The release came with several new features, such as the new stream-ordered memory allocator. Without going into details, it is now possible to asynchonously allocate memory, obviating much of the need to cache those allocations in a memory pool. Initial benchmarks have shown nice speed-ups from using this allocator, while lowering memory pressure and thus reducing invocations of the Julia garbage collector.

When using CUDA 11.2, CUDA.jl will default to the CUDA-backed memory pool and disable its own caching layer. If you want to compare performance, you can still use the old allocator and caching memory pool by setting the JULIA_CUDA_MEMORY_POOL environment variable to, e.g. binned. On older versions of CUDA, the binned pool is still used by default.

GPU method overrides

With the new AbstractInterpreter functionality in Julia 1.6, it is now much easier to further customize the Base compiler. This has enabled us to develop a mechanism for overriding methods with GPU-specific counterparts. It used to be required to explicitly pick CUDA-specific versions, e.g. CUDA.sin, because the Base version performed some GPU-incompatible operation. This was problematic as it did not compose with generic code, and the CUDA-specific versions often lacked support for specific combinations of argument types (for example, CUDA.sin(::Complex) was not supported).

With CUDA 3.0, it is possible to define GPU-specific methods that override an existing definition, without requiring a new function type. For now, this functionality is private to CUDA.jl, but we expect to make it available to other packages starting with Julia 1.7.

This functionality has unblocked many issues, as can be seen in the corresponding pull request. It is now no longer needed to prefix a call with the CUDA module to ensure a GPU-compatible version is used. Furthermore, it also protects users from accidentally calling GPU intrinsics, as doing so will now result in an error instead of a crash:

julia> CUDA.saturate(1f0)
ERROR: This function is not intended for use on the CPU
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] saturate(x::Float32)
   @ CUDA ~/Julia/pkg/CUDA/src/device/intrinsics.jl:23
 [3] top-level scope
   @ REPL[10]:1

Device-side random number generation

As an illustration of the value of GPU method overrides, CUDA.jl now provides a device-side random number generator that is accessible by simply calling rand() from a kernel:

julia> function kernel()
         @cushow rand()
         return
       end
kernel (generic function with 1 method)julia> @cuda kernel()
rand() = 0.668274

This works by overriding the Random.default_rng() method, and providing a GPU-compatible random number generator: Building on exploratory work by @S-D-R, the current generator is a maximally equidistributed combined Tausworthe RNG that shares 32-bytes of random state across threads in a warp for performance. The generator performs well, but does not pass the Crush battery of tests, so PRs are welcome here to improve the implementation!

Note that for host-side operations, e.g. rand!(::CuArray), the generator is not yet used by default. Instead, we use CURAND whenever possible, and fall back to the slower but more full-featured GPUArrays.jl-generator in other cases.

Revamped cuDNN interface

Finally, the cuDNN wrappers have been completely revamped by @denizyuret. The goal of the redesign is to more faithfully map the cuDNN API to more natural Julia functions, so that packages like Knet.jl or NNlib.jl can more easily use advanced cuDNN features without having to resort to low-level C calls. For more details, refer to the design document. As part of this redesign, the high-level wrappers of CUDNN have been moved to a subpackage of NNlib.jl.

CUDA.jl 2.4 and 2.5

Fri, 08 Jan 2021 00:00:00 +0000

CUDA.jl v2.4 and v2.5 are two almost-identical feature releases, respectively for Julia 1.5 and 1.6. These releases feature a greatly improved findmin and findmax kernels, an improved interface for kernel introspection, support for CUDA 11.2, and of course many bug fixes.

Improved `findmin` and `findmax` kernels

Thanks to @tkf and @Ellipse0934, CUDA.jl now uses a single-pass kernel for finding the minimum or maximum item in a CuArray. This fixes compatibility with NaN-valued elements, while on average improving performance. Depending on the rank, shape and size of the array these improvements vary from a minor regression to order-of-magnitude improvements.

New kernel introspection interface

It is now possible to obtain a compiled-but-not-launched kernel by passing the launch=false keyword to @cuda. This is useful when you want to reflect, e.g., query the amount of registers, or other kernel properties:

julia> kernel = @cuda launch=false identity(nothing)
CUDA.HostKernel{identity,Tuple{Nothing}}(...)julia> CUDA.registers(kernel)
4

The old API is still available, and will even be extended in future versions of CUDA.jl for the purpose of compiling device functions (not kernels):

julia> kernel = cufunction(identity, Tuple{Nothing})
CUDA.HostKernel{identity,Tuple{Nothing}}(...)

Support for CUDA 11.2

CUDA.jl now supports the latest version of CUDA, version 11.2. Because CUDNN and CUTENSOR are not compatible with this release yet, CUDA.jl won't automatically switch to it unless you explicitly request so:

julia> ENV["JULIA_CUDA_VERSION"] = "11.2"
"11.2"julia> using CUDAjulia> CUDA.versioninfo()
CUDA toolkit 11.2.0, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Alternatively, if you disable use of artifacts through JULIA_CUDA_USE_BINARYBUILDER=false, CUDA 11.2 can be picked up from your local system.

Future developments

Due to upstream compiler changes, CUDA.jl 2.4 is expected to be the last release compatible with Julia 1.5. Patch releases are still possible, but are not automatic: If you need a specific bugfix from a future CUDA.jl release, create an issue or PR to backport the change.

Introducing: oneAPI.jl

Thu, 05 Nov 2020 00:00:00 +0000

We're proud to announce the first version of oneAPI.jl, a Julia package for programming accelerators with the oneAPI programming model. It is currently available for select Intel GPUs, including common integrated ones, and offers a similar experience to CUDA.jl.

The initial version of this package, v0.1, consists of three key components:

wrappers for the oneAPI Level Zero interfaces;
a compiler for Julia source code to SPIR-V IR;
and an array interface for convenient data-parallel programming.

In this post, I'll briefly describe each of these. But first, some essentials.

Installation

oneAPI.jl is currently only supported on 64-bit Linux, using a sufficiently recent kernel, and requires Julia 1.5. Furthermore, it currently only supports a limited set of Intel GPUs: Gen9 (Skylake, Kaby Lake, Coffee Lake), Gen11 (Ice Lake), and Gen12 (Tiger Lake).

If your Intel CPU has an integrated GPU supported by oneAPI, you can just go ahead and install the oneAPI.jl package:

pkg> add oneAPI

That's right, no additional drivers required! oneAPI.jl ships its own copy of the Intel Compute Runtime, which works out of the box on any (sufficiently recent) Linux kernel. The initial download, powered by Julia's artifact subsystem, might take a while to complete. After that, you can import the package and start using its functionality:

julia> using oneAPIjulia> oneAPI.versioninfo()
Binary dependencies:
- NEO_jll: 20.42.18209+0
- libigc_jll: 1.0.5186+0
- gmmlib_jll: 20.3.2+0
- SPIRV_LLVM_Translator_jll: 9.0.0+1
- SPIRV_Tools_jll: 2020.2.0+1Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.11 driver:
- 00007fee-06cb-0a10-1642-ca9f01000000 (v1.0.0, API v1.0.0)1 device:
- Intel(R) Graphics Gen9

The `oneArray` type

Similar to CUDA.jl's CuArray type, oneAPI.jl provides an array abstraction that you can use to easily perform data parallel operations on your GPU:

julia> a = oneArray(zeros(2,3))
2×3 oneArray{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0julia> a .+ 1
2×3 oneArray{Float64,2}:
 1.0  1.0  1.0
 1.0  1.0  1.0julia> sum(ans; dims=2)
2×1 oneArray{Float64,2}:
 3.0
 3.0

This functionality builds on the GPUArrays.jl package, which means that a lot of operations are supported out of the box. Some are still missing, of course, and we haven't carefully optimized for performance either.

Kernel programming

The above array operations are made possible by a compiler that transforms Julia source code into SPIR-V IR for use with oneAPI. Most of this work is part of GPUCompiler.jl. In oneAPI.jl, we use this compiler to provide a kernel programming model:

julia> function vadd(a, b, c)
           i = get_global_id()
           @inbounds c[i] = a[i] + b[i]
           return
       endjulia> a = oneArray(rand(10));julia> b = oneArray(rand(10));julia> c = similar(a);julia> @oneapi items=10 vadd(a, b, c)julia> @test Array(a) .+ Array(b) == Array(c)
Test Passed

Again, the @oneapi macro resembles @cuda from CUDA.jl. One of the differences with the CUDA stack is that we use OpenCL-style built-ins, like get_global_id instead of threadIdx and barrier instead of sync_threads. Other familiar functionality, e.g. to reflect on the compiler, is available as well:

julia> @device_code_spirv @oneapi vadd(a, b, c)
; CompilerJob of kernel vadd(oneDeviceArray{Float64,1,1},
;                            oneDeviceArray{Float64,1,1},
;                            oneDeviceArray{Float64,1,1})
; for GPUCompiler.SPIRVCompilerTarget; SPIR-V
; Version: 1.0
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 46
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
               OpCapability Float64
               OpCapability Int64
               OpCapability Int8
          %1 = OpExtInstImport "OpenCL.std"
               OpMemoryModel Physical64 OpenCL
               OpEntryPoint Kernel
               ...
               OpReturn
               OpFunctionEnd

Level Zero wrappers

To interface with the oneAPI driver, we use the Level Zero API. Wrappers for this API is available under the oneL0 submodule of oneAPI.jl:

julia> using oneAPI.oneL0julia> drv = first(drivers())
ZeDriver(00000000-0000-0000-1642-ca9f01000000, version 1.0.0)julia> dev = first(devices(drv))
ZeDevice(GPU, vendor 0x8086, device 0x1912): Intel(R) Graphics Gen9

This is a low-level interface, and importing this submodule should not be required for the vast majority of users. It is only useful when you want to perform very specific operations, like submitting an certain operations to the command queue, working with events, etc. In that case, you should refer to the upstream specification; The wrappers in the oneL0 module closely mimic the C APIs.

Status

Version 0.1 of oneAPI.jl forms a solid base for future oneAPI developments in Julia. Thanks to the continued effort of generalizing the Julia GPU support in packages like GPUArrays.jl and GPUCompiler.jl, this initial version is already much more usable than early versions of CUDA.jl or AMDGPU.jl ever were.

That said, there are crucial parts missing. For one, oneAPI.jl does not integrate with any of the vendor libraries like oneMKL or oneDNN. That means several important operations, e.g. matrix-matrix multiplication, will be slow. Hardware support is also limited, and the package currently only works on Linux.

If you want to contribute to oneAPI.jl, or run into problems, check out the GitHub repository at JuliaGPU/oneAPI.jl. For questions, please use the Julia Discourse forum under the GPU domain and/or in the #gpu channel of the Julia Slack.

CUDA.jl 2.1

Fri, 30 Oct 2020 00:00:00 +0000

CUDA.jl v2.1 is a bug-fix release, with one new feature: support for cubic texture interpolations. The release also partly reverts a change from v2.0: reshape, reinterpret and contiguous views now return a CuArray again.

Generalized texture interpolations

CUDA's texture hardware only supports nearest-neighbour and linear interpolation, for other modes one is required to perform the interpolation by hand. In CUDA.jl v2.1 we are generalizing the texture interpolation API so that it is possible to use both hardware-backed and software-implemented interpolation modes in exactly the same way:

# N is the dimensionality (1, 2 or 3)
# T is the element type (needs to be supported by the texture hardware)# source array
src = rand(T, fill(10, N)...)# indices we want to interpolate
idx = [tuple(rand(1:0.1:10, N)...) for _ in 1:10]# upload to the GPU
gpu_src = CuArray(src)
gpu_idx = CuArray(idx)# create a texture array for optimized fetching
# this is required for N=1, optional for N=2 and N=3
gpu_src = CuTextureArray(gpu_src)# interpolate using a texture
gpu_dst = CuArray{T}(undef, size(gpu_idx))
gpu_tex = CuTexture(gpu_src; interpolation=CUDA.NearestNeighbour())
broadcast!(gpu_dst, gpu_idx, Ref(gpu_tex)) do idx, tex
    tex[idx...]
end# back to the CPU
dst = Array(gpu_dst)

Here, we can change the interpolation argument to CuTexture to either NearestNeighbour or LinearInterpolation, both supported by the hardware, or CubicInterpolation which is implemented in software (building on the hardware-supported linear interpolation).

Partial revert of array wrapper changes

In CUDA.jl v2.0, we changed the behavior of several important array operations to reuse available wrappers in Base: reshape started returning a ReshapedArray, view now returned a SubArray, and reinterpret was reworked to use ReinterpretArray. These changes were made to ensure maximal compatibility with Base's array type, and to simplify the implementation in CUDA.jl and GPUArrays.jl.

However, this change turned out to regress the time to precompile and load CUDA.jl. Consequently, the change has been reverted, and these wrappers are now implemented as part of the CuArray type again. Note however that we intend to revisit this change in the future. It is therefore recommended to use the DenseCuArray type alias for methods that need a CuArray backed by contiguous GPU memory. For strided CuArrays, i.e. non-contiguous views, you should use the StridedCuArray alias.

CUDA.jl 2.0

Fri, 02 Oct 2020 00:00:00 +0000

Today we're releasing CUDA.jl 2.0, a breaking release with several new features. Highlights include initial support for Float16, a switch to CUDA's new stream model, a much-needed rework of the sparse array support and support for CUDA 11.1.

The release now requires Julia 1.5, and assumes a GPU with compute capability 5.0 or higher (although most of the package will still work with an older GPU).

Low- and mixed-precision operations

With NVIDIA's latest GPUs featuring more and more low-precision operations, CUDA.jl now starts to support these data types. For example, the CUBLAS wrappers can be used with (B)Float16 inputs (running under JULIA_DEBUG=CUBLAS to illustrate the called methods) thanks to the cublasGemmEx API call:

julia> mul!(CUDA.zeros(Float32,2,2),
            cu(rand(Float16,2,2)),
            cu(rand(Float16,2,2)))I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_16F(2)
i!  Btype: type=cudaDataType_t; val=CUDA_R_16F(2)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)2×2 CuArray{Float32,2}:
 0.481284  0.561241
 1.12923   1.04541

julia> using BFloat16sjulia> mul!(CUDA.zeros(BFloat16,2,2),
            cu(BFloat16.(rand(2,2))),
            cu(BFloat16.(rand(2,2))))I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  Btype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_16BF(14)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F(68)2×2 CuArray{BFloat16,2}:
 0.300781   0.71875
 0.0163574  0.0241699

Alternatively, CUBLAS can be configured to automatically down-cast 32-bit inputs to Float16. This is now exposed through a task-local CUDA.jl math mode:

julia> CUDA.math_mode!(CUDA.FAST_MATH; precision=:Float16)julia> mul!(CuArray(zeros(Float32,2,2)),
            CuArray(rand(Float32,2,2)),
            CuArray(rand(Float32,2,2)))I! cuBLAS (v11.0) function cublasStatus_t cublasGemmEx(...) called:
i!  Atype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  Btype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  Ctype: type=cudaDataType_t; val=CUDA_R_32F(0)
i!  computeType: type=cublasComputeType_t; val=CUBLAS_COMPUTE_32F_FAST_16F(74)2×2 CuArray{Float32,2}:
 0.175258  0.226159
 0.511893  0.331351

As part of these changes, CUDA.jl now defaults to using tensor cores. This may affect accuracy; use math mode PEDANTIC if you want the old behavior.

Work is under way to extend these capabilities to the rest of CUDA.jl, e.g., the CUDNN wrappers, or the native kernel programming capabilities.

New default stream semantics

In CUDA.jl 2.0 we're switching to CUDA's simplified stream programming model. This simplifies working with multiple streams, and opens up more possibilities for concurrent execution of GPU operations.

Multi-stream programming

In the old model, the default stream (used by all GPU operations unless specified otherwise) was a special stream whose commands could not be executed concurrently with commands on regular, explicitly-created streams. For example, if we interleave kernels executed on a dedicated stream with ones on the default one, execution was serialized:

using CUDAN = 1 << 20function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
endnum_streams = 8for i in 1:num_streams
    stream = CuStream()    data = CuArray{Float32}(undef, N)    @cuda blocks=1 threads=64 stream=stream kernel(data, N)    @cuda kernel(data, 0)
end

In the new model, default streams are regular streams and commands issued on them can execute concurrently with those on other streams:

Multi-threading

Another consequence of the new stream model is that each thread gets its own default stream (accessible as CuStreamPerThread()). Together with Julia's threading capabilities, this makes it trivial to group independent work in tasks, benefiting from concurrent execution on the GPU where possible:

using CUDAN = 1 << 20function kernel(x, n)
    tid = threadIdx().x + (blockIdx().x-1) * blockDim().x
    for i = tid:blockDim().x*gridDim().x:n
        x[i] = CUDA.sqrt(CUDA.pow(3.14159f0, i))
    end
    return
endThreads.@threads for i in 1:Threads.nthreads()
    data = CuArray{Float32}(undef, N)
    @cuda blocks=1 threads=64 kernel(data, N)
    synchronize(CuDefaultStream())
end

With the old model, execution would have been serialized because the default stream was the same across threads:

Future improvements will make this behavior configurable, such that users can use a different default stream per task.

Sparse array clean-up

As part of CUDA.jl 2.0, the sparse array support has been refactored, bringing them in line with other array types and their expected behavior. For example, the custom switch2 methods have been removed in favor of calls to convert and array constructors:

julia> using SparseArrays
julia> using CUDA, CUDA.CUSPARSEjulia> CuSparseMatrixCSC(CUDA.rand(2,2))
2×2 CuSparseMatrixCSC{Float32} with 4 stored entries:
  [1, 1]  =  0.124012
  [2, 1]  =  0.791714
  [1, 2]  =  0.487905
  [2, 2]  =  0.752466julia> CuSparseMatrixCOO(sprand(2,2, 0.5))
2×2 CuSparseMatrixCOO{Float64} with 3 stored entries:
  [1, 1]  =  0.183183
  [2, 1]  =  0.966466
  [2, 2]  =  0.064101julia> CuSparseMatrixCSR(ans)
2×2 CuSparseMatrixCSR{Float64} with 3 stored entries:
  [1, 1]  =  0.183183
  [2, 1]  =  0.966466
  [2, 2]  =  0.064101

Initial support for the COO sparse matrix type has also been added, along with more better support for sparse matrix-vector multiplication.

Support for CUDA 11.1

This release also features support for the brand-new CUDA 11.1. As there is no compatible release of CUDNN or CUTENSOR yet, CUDA.jl won't automatically select this version, but you can force it to by setting the JULIA_CUDA_VERSION environment variable to 11.1:

julia> ENV["JULIA_CUDA_VERSION"] = "11.1"julia> using CUDAjulia> CUDA.versioninfo()
CUDA toolkit 11.1.0, artifact installationLibraries:
- CUDNN: missing
- CUTENSOR: missing

Minor changes

Many other changes are part of this release:

Views, reshapes and array reinterpretations are now represented by the Base array wrappers, simplifying the CuArray type definition.
Various optimizations to CUFFT and CUDNN library wrappers.
Support for LinearAlgebra.reflect! and rotate!
Initial support for calling CUDA libraries with strided inputs

Paper: Flexible Performant GEMM Kernels on GPUs

Mon, 28 Sep 2020 00:00:00 +0000

General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA's Tensor Cores. In this paper we show how it is possible to program these accelerators from Julia, and present abstractions and interfaces that allow to do so efficiently without sacrificing performance.

A pre-print of the paper has been published on arXiv: arXiv:2009.12263.
The source code can be found on GitHub: thomasfaingnaert/GemmKernels.jl.

With the APIs from GemmKernels.jl, it is possible to instantiate GEMM kernels that perform in the same ball park as, and sometimes even outperform state-of-the-art libraries like CUBLAS and CUTLASS. For example, performing a mixed-precision multiplication of two 16-bit matrixes into a 32-bit accumulator (on different combinations of layouts):

The APIs are also highly flexible and allow customization of each step, e.g., to apply the activation function max(x, 0) for implementing a rectified linear unit (ReLU):

a = CuArray(rand(Float16, (M, K)))
b = CuArray(rand(Float16, (K, N)))
c = CuArray(rand(Float32, (M, N)))
d = similar(c)conf = GemmKernels.get_config(
    gemm_shape = (M = M, N = N, K = K),
    operator = Operator.WMMAOp{16, 16, 16},
    global_a_layout = Layout.AlignedColMajor{Float16},
    global_c_layout = Layout.AlignedColMajor{Float32})GemmKernels.matmul(
    a, b, c, d, conf;
    transform_regs_to_shared_d = Transform.Elementwise(x -> max(x, 0)))

The GemmKernels.jl framework is written entirely in Julia, demonstrating the high-performance GPU programming capabilities of this language, but at the same time keeping the research accessible and easy to modify or repurpose by other Julia developers.

CUDA.jl 1.3 - Multi-device programming

Sat, 18 Jul 2020 00:00:00 +0000

Today we're releasing CUDA.jl 1.3, with several new features. The most prominent change is support for multiple GPUs within a single process.

Multi-GPU programming

With CUDA.jl 1.3, you can finally use multiple CUDA GPUs within a single process. To switch devices you can call device!, query the current device with device(), or reset it using device_reset!():

julia> collect(devices())
9-element Array{CuDevice,1}:
 CuDevice(0): Tesla V100-PCIE-32GB
 CuDevice(1): Tesla V100-PCIE-32GB
 CuDevice(2): Tesla V100-PCIE-32GB
 CuDevice(3): Tesla V100-PCIE-32GB
 CuDevice(4): Tesla V100-PCIE-16GB
 CuDevice(5): Tesla P100-PCIE-16GB
 CuDevice(6): Tesla P100-PCIE-16GB
 CuDevice(7): GeForce GTX 1080 Ti
 CuDevice(8): GeForce GTX 1080 Tijulia> device!(5)julia> device()
CuDevice(5): Tesla P100-PCIE-16GB

Let's define a kernel to show this really works:

julia> function kernel()
           dev = Ref{Cint}()
           CUDA.cudaGetDevice(dev)
           @cuprintln("Running on device $(dev[])")
           return
       endjulia> @cuda kernel()
Running on device 5julia> device!(0)julia> device()
CuDevice(0): Tesla V100-PCIE-32GBjulia> @cuda kernel()
Running on device 0

Memory allocations, like CuArrays, are implicitly bound to the device they were allocated on. That means you should take care to only use an array when the owning device is active, or you will run into errors:

julia> device()
CuDevice(0): Tesla V100-PCIE-32GBjulia> a = CUDA.rand(1)
1-element CuArray{Float32,1}:
 0.6322775julia> device!(1)julia> a
ERROR: CUDA error: an illegal memory access was encountered

Future improvements might make the array type device-aware.

Multitasking and multithreading

Dovetailing with the support for multiple GPUs, is the ability to use these GPUs on separate Julia tasks and threads:

julia> device!(0)julia> @sync begin
         @async begin
           device!(1)
           println("Working with $(device()) on $(current_task())")
           yield()
           println("Back to device $(device()) on $(current_task())")
         end
         @async begin
           device!(2)
           println("Working with $(device()) on $(current_task())")
         end
       end
Working with CuDevice(1) on Task @0x00007fc9e6a48010
Working with CuDevice(2) on Task @0x00007fc9e6a484f0
Back to device CuDevice(1) on Task @0x00007fc9e6a48010julia> device()
CuDevice(0): Tesla V100-PCIE-32GB

Each task has its own local GPU state, such as the device it was bound to, handles to libraries like CUBLAS or CUDNN (which means that each task can configure libraries independently), etc.

Minor features

CUDA.jl 1.3 also features some minor changes:

Reinstated compatibility with Julia 1.3
Support for CUDA 11.0 Update 1
Support for CUDNN 8.0.2

Known issues

Several operations on sparse arrays have been broken since CUDA.jl 1.2, due to the deprecations that were part of CUDA 11. The next version of CUDA.jl will drop support for CUDA 10.0 or older, which will make it possible to use new cuSPARSE APIs and add back missing functionality.

CUDA.jl 1.1

Tue, 07 Jul 2020 00:00:00 +0000

CUDA.jl 1.1 marks the first feature release after merging several CUDA packages into one. It raises the minimal Julia version to 1.4, and comes with support for the impending 1.5 release.

CUDA.jl replacing CuArrays/CUDAnative.jl

As announced a while back, CUDA.jl is now the new package for programming CUDA GPUs in Julia, replacing CuArrays.jl, CUDAnative.jl, CUDAdrv.jl and CUDAapi.jl. The merged package should be a drop-in replacement: All existing functionality has been ported, and almost all exported functions are still there. Applications like Flux.jl or the DiffEq.jl stack are being updated to support this change.

CUDA 11 support

With CUDA.jl 1.1, we support the upcoming release of the CUDA toolkit. This only applies to locally-installed versions of the toolkit, i.e., you need to specify JULIA_CUDA_USE_BINARYBUILDER=false in your environment to pick up the locally-installed release candidate of the CUDA toolkit. New features, like the third-generation tensor cores and its extended type support, or any new APIs, are not yet natively supported by Julia code.

NVIDIA Management Library (NVML)

CUDA.jl now integrates with the NVIDIA Management Library, or NVML. With this library, it's possible to query information about the system, any GPU devices, their topology, etc.:

julia> using CUDAjulia> dev = first(NVML.devices())
CUDA.NVML.Device(Ptr{Nothing} @0x00007f987c7c6e38)julia> NVML.uuid(dev)
UUID("b8d5e790-ea4d-f962-e0c3-0448f69f2e23")julia> NVML.name(dev)
"Quadro RTX 5000"julia> NVML.power_usage(dev)
37.863julia> NVML.energy_consumption(dev)
65330.292

Experimental: Texture support

It is now also possible to use the GPU's hardware texture support from Julia, albeit using a fairly low-level and still experimental API (many thanks to @cdsousa for the initial development). As a demo, let's start with loading a sample image:

julia> using Images, TestImages, ColorTypes, FixedPointNumbers
julia> img = RGBA{N0f8}.(testimage("lighthouse"))

We use RGBA since CUDA's texture hardware only supports 1, 2 or 4 channels. This support is also currently limited to "plain" types, so let's reinterpret the image:

julia> img′ = reinterpret(NTuple{4,UInt8}, img)

Now we can upload this image to the array, using the CuTextureArray type for optimized storage (normal CuArrays are supported too), and bind it to a CuTexture object that we can pass to a kernel:

julia> texturearray = CuTextureArray(img′)julia> texture = CuTexture(texturearray; normalized_coordinates=true)
512×768 4-channel CuTexture(::CuTextureArray) with eltype NTuple{4,UInt8}

Let's write and a kernel that warps this image. Since we specified normalized_coordinates=true, we index the texture using values in [0,1]:

function warp(dst, texture)
    tid = threadIdx().x + (blockIdx().x - 1) * blockDim().x
    I = CartesianIndices(dst)
    @inbounds if tid <= length(I)
        i,j = Tuple(I[tid])
        u = Float32(i-1) / Float32(size(dst, 1)-1)
        v = Float32(j-1) / Float32(size(dst, 2)-1)
        x = u + 0.02f0 * CUDA.sin(30v)
        y = v + 0.03f0 * CUDA.sin(20u)
        dst[i,j] = texture[x,y]
    end
    return
end

The size of the output image determines how many elements we need to process. This needs to be translated to a number of threads and blocks, keeping in mind device and kernel characteristics. We automate this using the occupancy API:

julia> outimg_d = CuArray{eltype(img′)}(undef, 500, 1000);julia> function configurator(kernel)
           config = launch_configuration(kernel.fun)           threads = Base.min(length(outimg_d), config.threads)
           blocks = cld(length(outimg_d), threads)           return (threads=threads, blocks=blocks)
       endjulia> @cuda config=configurator warp(outimg_d, texture)

Finally, we fetch and visualize the output:

julia> outimg = Array(outimg_d)julia> save("imgwarp.png", reinterpret(eltype(img), outimg))

Minor features

The test-suite is now parallelized, using up-to JULIA_NUM_THREADS processes:

$ JULIA_NUM_THREADS=4 julia -e 'using Pkg; Pkg.test("CUDA");'                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                   (2) |     2.52 |   0.00 |  0.0 |       0.00 |   115.00 |   0.05 |  1.8 |     153.13 |   546.27 |
apiutils                         (4) |     0.55 |   0.00 |  0.0 |       0.00 |   115.00 |   0.02 |  4.0 |      75.86 |   522.36 |
codegen                          (4) |    14.81 |   0.36 |  2.5 |       0.00 |   157.00 |   0.62 |  4.2 |    1592.28 |   675.15 |
...
gpuarrays/mapreduce essentials   (2) |   113.52 |   0.01 |  0.0 |       3.19 |   641.00 |   2.61 |  2.3 |    8232.84 |  2449.35 |
gpuarrays/mapreduce (old tests)  (5) |   138.35 |   0.01 |  0.0 |     130.20 |   507.00 |   2.94 |  2.1 |    8615.15 |  2353.62 |
gpuarrays/mapreduce derivatives  (3) |   180.52 |   0.01 |  0.0 |       3.06 |   229.00 |   3.44 |  1.9 |   12262.67 |  1403.39 |Test Summary: |  Pass  Broken  Total
  Overall     | 11213       3  11216
    SUCCESS
    Testing CUDA tests passed

A copy of Base.versioninfo() is available to report on the CUDA toolchain and any devices:

julia> CUDA.versioninfo()
CUDA toolkit 10.2.89, artifact installation
CUDA driver 11.0.0
NVIDIA driver 450.36.6Libraries:
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 11.0.0+450.36.6
- CUDNN: 7.6.5 (for CUDA 10.2.0)
- CUTENSOR: 1.1.0 (for CUDA 10.2.0)Toolchain:
- Julia: 1.5.0-rc1.0
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_751 device(s):
- Quadro RTX 5000 (sm_75, 14.479 GiB / 15.744 GiB available)

CUTENSOR artifacts have been upgraded to version 1.1.0.

Benchmarking infrastructure based on the Codespeed project has been set-up at speed.juliagpu.org to keep track of the performance of various operations.

CUDAnative.jl 3.0 and CuArrays.jl 2.0

Wed, 25 Mar 2020 00:00:00 +0000

This post is located at /cudanative_3.0-cuarrays_2.0/

This release of the Julia CUDA stack contains some exciting new features: automatic installation of CUDA using artifacts, full support for GPU method redefinitions, and experimental support for multitasking and multithreading. The release is technically breaking, but most end-users should not be affected.

API changes

Changes to certain APIs require these releases to be breaking, however, most users should not be affected and chances are you can just bump your Compat entries without any additional changes. Flux.jl users will have to wait a little longer though, as the package uses non-public APIs that have changed and requires an update.

Artifacts

CUDA and its dependencies will now be automatically installed using artifacts generated by BinaryBuilder.jl. This greatly improves usability, and only requires a functioning NVIDIA driver:

julia> ENV["JULIA_DEBUG"] = "CUDAnative"julia> using CUDAnativejulia> CUDAnative.version()
┌ Debug: Trying to use artifacts...
└ @ CUDAnative CUDAnative/src/bindeps.jl:52
┌ Debug: Using CUDA 10.2.89 from an artifact at /depot/artifacts/...
└ @ CUDAnative CUDAnative/src/bindeps.jl:108
v"10.2.89"

Use of a local installation is still possible by setting the environment variable JULIA_CUDA_USE_BINARYBUILDER to false. For more details, refer to the documentation.

Relevant PRs: CUDAnative.jl#492 and CuArrays.jl#490

Method redefinitions

CUDAnative 3.0 now fully supports method redefinitions, commonly referred to as Julia issue #265, and makes it possible to use interactive programming tools like Revise.jl:

julia> child() = 0
julia> parent() = (@cuprintln(child()); return)
julia> @cuda parent()
0julia> parent() = (@cuprintln(child() + 1); return)
julia> @cuda parent()
1
julia> child() = 1
julia> @cuda parent()
2

Relevant PRs: CUDAnative.jl#581

Experimental: Multitasking and multithreading

With CUDAnative 3.0 and CuArrays 2.0 you can now use Julia tasks and threads to organize your code. In combination with CUDA streams, this makes it possible to execute kernels and other GPU operations in parallel:

@sync begin
    function my_expensive_kernel()
        return
    end
    @async @cuda stream=CuStream() my_expensive_kernel()
    @async @cuda stream=CuStream() my_expensive_kernel()
end

Every task, whether it runs on a separate thread or not, can work with a different device, as well as independently work with CUDA libraries like CUBLAS and CUFFT.

Note that this support is experimental, and lacks certain features to be fully effective. For one, the CuArrays memory allocator is not device-aware, and it is currently not possible to configure the CUDA stream for operations like map or broadcast.

Relevant PRs: CUDAnative.jl#609 and CuArrays.jl#645

Minor changes

GPU kernels are now name-mangled like C++, which offers better integration with NVIDIA tools (CUDAnative.jl#559).

A better N-dimensional mapreducedim! kernel, properly integrating with all Base interfaces (CuArrays.jl#602 and GPUArrays#246).

A CuIterator type for batching arrays to the GPU (by @jrevels, CuArrays.jl#467).

Integration with Base's 5-arg mul! (by @haampie, CuArrays.jl#641 and GPUArrays#253).

Integration with Cthulhu.jl for interactive inspection of generated code (CUDAnative.jl#597).

Known issues

With a release as big as this one there's bound to be some bugs, e.g., with the installation of artifacts on exotic systems, or due to the many changes to make the libraries thread-safe. If you need absolute stability, please wait for a point release.

There are also some known issues. CUDAnative is currently not compatible with Julia 1.5 due to Base compiler changes (julia#34993), the new mapreducedim! kernel appears to be slower in some cases (CuArrays.jl#611), and there are some remaining thread-safety issues when using the non-default memory pool (CuArrays.jl#647).

New website for JuliaGPU

Thu, 12 Dec 2019 00:00:00 +0000

This post is located at /new_site/

Welcome to the new landing page for the JuliaGPU organization. This website serves as an introduction to the several packages for programming GPUs in Julia, with pointers to relevant resources for new users.

The sources for this website are hosted at GitHub and generated using Hugo, feel free to open an issue or pull request if you think it could be improved.

JuliaGPU

cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support

Breaking changes

Native for loops

Floating-point mode: ct.@fpmode

Keyword arguments for operations

Experimental host abstractions

Debugging with print/println

Minor changes

Performance improvements

Upcoming webinar

Metal.jl 1.6: Initial MPSGraph Support

Initial MPSGraph support

Minor Changes

CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more

Broadcasting for CuSparseVector

Minor Changes

CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers

Reworking CuRef for asynchronous CUBLAS

A Julia-level allocator cache

Minor changes

OpenCL.jl 0.10: Now with native Julia kernels

Native Julia kernels

Breaking API changes

JLL-based OpenCL drivers

Work towards OpenCL.jl 1.0

GPUArrays v11: Port to KernelAbstractions.jl

Vendor-neutral kernel DSL

Back-end package versions

Metal.jl 1.4: Improved random numbers

Metal.rand and friends

Other improvements since the last blog post

Future work

CUDA.jl 5.5: Maintenance release

New features

CUDA.jl 5.4: Memory management mayhem

Eager garbage collection

Tracked memory allocations

Unified memory iteration

Other changes

oneAPI.jl 1.5: Ponte Vecchio support and oneMKL improvements

Intel Ponte Vecchio

oneMKL wrappers

Minor changes

CUDA.jl 5.2 and 5.3: Maintenance releases

Profiler improvements

Kernel launch debugging

Sorting improvements

Unified memory fixes

Software updates

Future releases

CUDA.jl 5.1: Unified memory and cooperative groups

Unified memory

Cooperative groups

Other updates

CUDA.jl 5.0: Integrated profiler and task synchronization changes

Integrated profiler

Synchronization on worker threads

Local toolkit discovery

Raised minimum requirements

Other changes

Profiling oneAPI.jl applications with VTune

Set-up

Hello VTune!

Instrumenting the application

Kernel details

Working remotely

Give it a try!

Metal.jl 0.2: Metal Performance Shaders

Metal Performance Shaders

GPU profiling support

Other improvements

Future work

oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9

oneMKL integration

Intel Arc support

Other changes

CUDA.jl 4.0

JLLs for CUDA artifacts

Memory limits

Native `for` loops

Floating-point mode: `ct.@fpmode`

Debugging with `print`/`println`

Broadcasting for `CuSparseVector`

Reworking `CuRef` for asynchronous CUBLAS

`Metal.rand` and friends

`@atomic` intrinsics

Helper function to use `compute-sanitizer`

`CuArray` support for isbits Unions

Improved `findmin` and `findmax` kernels

The `oneArray` type