cuTile.jl 0.3: CUDA.jl integration, and even better performance & latency
Tim Besard
cuTile.jl v0.3 integrates with CUDA.jl, making it even easier to write and run CUDA Tile kernels in Julia. Performance has also been greatly improved, closing the gap with cuTile Python on every benchmark we ship. Added features include a random number generator, and support for array slicing.
Performance: matching cuTile Python
Three months ago, several of our benchmarks lagged cuTile Python by 5–15%. Today, cuTile.jl matches or outperforms cuTile Python on every kernel we ship. The headline numbers (RTX 5080, tileiras 13.2.51):
| Kernel | Julia | Python | Δ |
|---|---|---|---|
| Vector Addition | 845 GB/s | 846 GB/s | = |
| Matrix Transpose | 812 GB/s | 814 GB/s | = |
| Layer Norm fwd | 983 GB/s | 716 GB/s | +37% |
| Layer Norm bwd | 248 GB/s | 251 GB/s | -1% |
| Matrix Multiplication | 47.5 TFLOPS | 43.5 TFLOPS | +9% |
| Batch Matrix Multiply | 34.0 TFLOPS | 30.8 TFLOPS | +10% |
| FFT (3-stage Cooley-Tukey) | 529 μs | 554 μs | +5% |
| Mixture of Experts | 27.0 TFLOPS | 20.1 TFLOPS | +34% |
| Attention (FMHA, causal) | 103.6 TFLOPS | 63.4 TFLOPS | +63% |
| Softmax (TMA) | 849 GB/s | 857 GB/s | -1% |
| Softmax (Chunked) | 1684 GB/s | 1640 GB/s | +3% |
Most of the gains come from extending the IR-level optimization pipeline introduced in v0.2 with a new dataflow framework that now powers several analyses and transformations.
CUDA.jl integration: @cuda backend=cuTile
Until v0.3, launching a cuTile kernel meant calling cuTile.launch(...) directly. cuTile.jl now plugs into CUDA.jl's existing @cuda macro as a first-class backend, making it much easier to launch cuTile.jl kernels:
using CUDA, cuTile
import cuTile as ct
function vadd(a::ct.TileArray{Float32,1}, b::ct.TileArray{Float32,1},
c::ct.TileArray{Float32,1})
pid = ct.bid(1)
ct.store(c; index=pid, tile=ct.load(a; index=pid, shape=(128,)) +
ct.load(b; index=pid, shape=(128,)))
return
end
a = CUDA.rand(Float32, 1024)
b = CUDA.rand(Float32, 1024)
c = CUDA.zeros(Float32, 1024)
@cuda backend=cuTile blocks=8 vadd(a, b, c)
Time-to-first-launch
Compiling a cuTile kernel goes through several stages: Julia type inference, our IR rewriting passes, Tile IR bytecode emission, and finally tileiras-driven CUBIN generation. None of these are fast. Significant effort in v0.3 went into reducing the time-to-first-launch, and the latency is now comparable to a typical CUDA.jl kernel launch on the same hardware:
Benchmark 1: julia -e 'using CUDACore;
@cuda identity(nothing)'
Time (mean ± σ): 1.882 s ± 0.012 s [User: 2.554 s, System: 0.305 s]
Range (min … max): 1.867 s … 1.906 s 10 runs
Benchmark 2: julia -e 'using CUDACore, cuTile;
@cuda backend=cuTile identity(nothing)'
Time (mean ± σ): 1.840 s ± 0.009 s [User: 2.488 s, System: 0.329 s]
Range (min … max): 1.827 s … 1.859 s 10 runs
Array slicing
view and @view now derive sub-range TileArrays from existing ones:
function copy_rows!(A::ct.TileArray{Float32,2}, B::ct.TileArray{Float32,2},
i::Int32, j::Int32)
sub = @view A[i:j, :] # sub-range TileArray
t = ct.load(sub; index=(1, 1), shape=(8, 8))
ct.store(B; index=(1, 1), tile=t)
return
end
@cuda backend=cuTile copy_rows!(A, B, Int32(3), Int32(10))
Each index must be : or a UnitRange; other forms (StepRange, scalar indexes, CartesianIndex, ...) are currently rejected at compile time. The result is itself a TileArray, and can be passed to ct.load / ct.store (or sliced again, for nested views). The new divisibility analysis sees through the slicing chain so contiguous-axis fast paths are preserved, while literal slice sizes fold to compile-time-constant shape operands.
Random number generation
cuTile.jl now ships a tile-vectorized Philox2x32-7 RNG, both as in-kernel intrinsics and as a host-side cuTile.RNG handle for filling CuArrays. The kernel API mirrors Base.Random:
function noise!(out::ct.TileArray{Float32,1})
pid = ct.bid(1)
t = randn(Float32, (128,)) # in-kernel randn
ct.store(out; index=pid, tile=t)
return
end
@cuda backend=cuTile blocks=cld(N, 128) noise!(A)
rand covers all of Int{8,16,32,64}, UInt{8,16,32,64}, Float16, BFloat16, Float32, and Float64; randn (via Box-Muller, sharing its uniforms with the existing rand path) and randexp (via -log(U)) cover the four floating-point types. ct.DeviceRNG() opens an independent stream inside a kernel; Random.seed! re-seeds.
The host-side cuTile.RNG integrates with Random.rand! / Random.randn! / Random.randexp! and auto-advances its counter, so consecutive fills produce disjoint streams:
A = CUDACore.zeros(Float32, 1 << 20)
rng = ct.RNG(42)
randn!(rng, A) # fill via fused tile kernel
B = rand(rng, Float64, 16) # out-of-place
Performance of both the in-kernel and host-side APIs is excellent, matching or exceeding the performance of cuRAND and GPUArrays.jl' new generator.
What's next
If you've been watching cuTile.jl from a distance: now's a good time to try it out: add cuTile from the Julia REPL, or grab the examples to see how the moving parts fit together.
There is a webinar scheduled on May 12, 2026 at 1 PM ET, where Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar, covering the design of CUDA Tile, how cuTile.jl is built, and several relevant examples. Click here to sign up.