cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency


Tim Besard

cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.

Performance: matching cuTile Python

Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):

KernelJuliaPythonΔ
Vector Addition845 GB/s846 GB/s=
Matrix Transpose812 GB/s814 GB/s=
Layer Norm fwd983 GB/s716 GB/s+37%
Layer Norm bwd248 GB/s251 GB/s-1%
Matrix Multiplication47.5 TFLOPS43.5 TFLOPS+9%
Batch Matrix Multiply34.0 TFLOPS30.8 TFLOPS+10%
FFT (3-stage Cooley-Tukey)529 μs554 μs+5%
Mixture of Experts27.0 TFLOPS20.1 TFLOPS+34%
Attention (FMHA, causal)103.6 TFLOPS63.4 TFLOPS+63%
Softmax (TMA)849 GB/s857 GB/s-1%
Softmax (Chunked)1684 GB/s1640 GB/s+3%

Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.

CUDA.jl integration: @cuda backend=cuTile

Until v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:

using CUDA, cuTile
import cuTile as ct

function vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
              c::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
                                ct.load(b; index=pid, shape=(128,)))
    return
end

a = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)

@cuda backend=cuTile blocks=8 vadd(a, b, c)

Time-to-first-launch

Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:

Benchmark 1: julia -e 'using CUDACore;
                       @cuda identity(nothing)'
  Time (mean ± σ):      1.882 s ±  0.012 s    [User: 2.554 s, System: 0.305 s]
  Range (min … max):    1.867 s …  1.906 s    10 runs

Benchmark 2: julia -e 'using CUDACore, cuTile;
                       @cuda backend=cuTile identity(nothing)'
  Time (mean ± σ):      1.840 s ±  0.009 s    [User: 2.488 s, System: 0.329 s]
  Range (min … max):    1.827 s …  1.859 s    10 runs

Array slicing

view and @view now derive sub-range TileArrays from existing ones:

function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
                    i::Int32, j::Int32)
    sub = @view A[i:j, :]                         # sub-range TileArray
    t = ct.load(sub; index=(1, 1), shape=(8, 8))
    ct.store(B; index=(1, 1), tile=t)
    return
end

@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))

Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, ...) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.

Random number generation

cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:

function noise!(out::ct.TileArray{Float32,1})
    pid = ct.bid(1)
    t = randn(Float32, (128,))                 # in-kernel randn
    ct.store(out; index=pid, tile=t)
    return
end

@cuda backend=cuTile blocks=cld(N, 128) noise!(A)

rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.

The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:

A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A)                                 # fill via fused tile kernel
B = rand(rng, Float64, 16)                     # out-of-place

Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.

What's next

If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.

There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.