cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support
Tim Besard
cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.
The release is showcased by two new examples that exercise many of the features described below: a fused Mixture of Experts kernel with token routing via gather/scatter, and a Flash Multi-Head Attention implementation with online softmax and causal masking.
Breaking changes
ct.whereremoved: useifelse.(cond, x, y)(standard Julia broadcast);FP mode kwargs removed: per-operation
rounding_modeandflush_to_zerokwargs on reductions/scans replaced withct.@fpmodeblocks (see below);Matmul batch dimensions:
muladdnow uses trailing batch dims(M, K, B...)matching Julia convention, instead of leading(B, M, K).
Native for loops
Previously, cuTile.jl required a while-loop workaround for iteration. Starting with v0.2, standard Julia for loops work directly:
for k in Int32(1):K_tiles
a = ct.load(A; index=(pid_m, k), shape=(TILE_M, TILE_K))
b = ct.load(B; index=(k, pid_n), shape=(TILE_K, TILE_N))
acc = muladd(a, b, acc)
end
The compiler recognizes Julia's iterator protocol and lowers for i in
start:stop and for i in start:step:stop to Tile IR ForOp directly.
Floating-point mode: ct.@fpmode
A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:
ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin
s = sum(tile; dims=1)
x = exp2.(tile)
end
Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.
Keyword arguments for operations
Most cuTile operations now use keyword arguments, aligning with cuTile Python's API and making call sites more readable:
load/store:index,shape, andtileare now kwargs (ct.load(arr; index=pid, shape=(M, N))). Positional syntax still works.arange: now works without a type, defaulting toInt32. Usedtypefor other types (e.g.ct.arange(16; dtype=Int64)).gather/scatter: newmask,padding_value, andcheck_boundskwargs. User masks areAND'd with automatic bounds masks;check_bounds=falseskips bounds comparisons when indices are known safe.Atomics: all atomic operations accept
check_boundsto optionally skip the bounds mask.allow_tma: default changed fromtruetonothing(compiler decides).
Experimental host abstractions
cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the ct.Tiled wrapper type, which represents a tiled view of an array.
Broadcasting fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:
ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)
# Or via the convenience macro (wraps all arrays automatically):
ct.@. C = A + sin(B)
mapreduce on Tiled arrays generates a tiled reduction kernel:
mapreduce(identity, +, ct.Tiled(A); dims=1)
These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default CuArray operations in CUDA.jl.
Debugging with print/println
You can now use standard Julia print and println inside kernels:
function debug_kernel(A, tile_size::Int)
pid = ct.bid(1)
tile = ct.load(A; index=pid, shape=(tile_size,))
println("Block ", pid, ": sum=", sum(tile; dims=1))
return
end
String constants, scalars, and tiles can be mixed freely. String interpolation ("x=$x") is also supported.
Minor changes
Atomics:
atomic_max,atomic_min,atomic_or,atomic_and,atomic_xorjoin the existingatomic_cas,atomic_xchg, andatomic_add;fill/zeros/onesoverlays: standard Base constructors now work inside kernels (e.g.zeros(Float32, 64, 64)), thanks to @AntonOresten;isnan: works via a single unordered float self-comparison, avoiding Julia's bit-manipulation fallback;Debug info: source file and line information is now embedded in Tile IR bytecode;
Julia 1.13: support for the upcoming Julia release.
Performance improvements
A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.
The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:
# FMA fusion: mulf + addf → fma
@rewrite addf(mulf(~a, ~b), ~c) => fma(~a, ~b, ~c)
# pow(x, 2) → x * x
@rewrite pow(~x, broadcast(constant(2.0))) => mulf(~x, ~x)
Built on this, the pipeline includes:
Algebraic simplification cancels matching arithmetic pairs like
(x + 1) - 1 → x, even when reshapes or broadcasts sit in between. This eliminates the overhead of Julia's 1-based indexing normalization;Comparison strength reduction canonicalizes patterns like
(x + 1) <= yintox < y, collapsing the two-instructionarange-plus-compare that results from Julia's 1-basedarangemask idiom down to a single comparison. This alone reducedlayernorm's SASS from 10036 to 3253 instructions;Pow2 strength reduction replaces
pow(x, 2)withx * x, eliminating the expensivepowtranscendental inlayernorm's variance computation;LICM hoists loop-invariant operations out of loop bodies;
Constant folding and propagation evaluates compile-time-known arithmetic and tracks constants through the IR for further optimizations;
Alias-aware token ordering uses alias analysis to identify independent memory operations on different arrays, avoiding unnecessary serialization that previously blocked instruction-level parallelism;
Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10% of their Python counterparts, and some are even faster. For up to date performance comparisons, see the cuTile.jl README.
Upcoming webinar
On May 12, 2026 at 1 PM ET, Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar. We will cover the package's design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what's next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the JuliaHub event.