cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support

Apr 8, 2026
Tim Besard

cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.

The release is showcased by two new examples that exercise many of the features described below: a fused Mixture of Experts kernel with token routing via gather/scatter, and a Flash Multi-Head Attention implementation with online softmax and causal masking.

Breaking changes

ct.where removed: use ifelse.(cond, x, y) (standard Julia broadcast);
FP mode kwargs removed: per-operation rounding_mode and flush_to_zero kwargs on reductions/scans replaced with ct.@fpmode blocks (see below);
Matmul batch dimensions: muladd now uses trailing batch dims (M, K, B...) matching Julia convention, instead of leading (B, M, K).

Native `for` loops

Previously, cuTile.jl required a while-loop workaround for iteration. Starting with v0.2, standard Julia for loops work directly:

for k in Int32(1):K_tiles
    a = ct.load(A; index=(pid_m, k), shape=(TILE_M, TILE_K))
    b = ct.load(B; index=(k, pid_n), shape=(TILE_K, TILE_N))
    acc = muladd(a, b, acc)
end

The compiler recognizes Julia's iterator protocol and lowers for i in start:stop and for i in start:step:stop to Tile IR ForOp directly.

Floating-point mode: `ct.@fpmode`

A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:

ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin
    s = sum(tile; dims=1)
    x = exp2.(tile)
end

Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.

Keyword arguments for operations

Most cuTile operations now use keyword arguments, aligning with cuTile Python's API and making call sites more readable:

load/store: index, shape, and tile are now kwargs (ct.load(arr; index=pid, shape=(M, N))). Positional syntax still works.
arange: now works without a type, defaulting to Int32. Use dtype for other types (e.g. ct.arange(16; dtype=Int64)).
gather/scatter: new mask, padding_value, and check_bounds kwargs. User masks are AND'd with automatic bounds masks; check_bounds=false skips bounds comparisons when indices are known safe.
Atomics: all atomic operations accept check_bounds to optionally skip the bounds mask.
allow_tma: default changed from true to nothing (compiler decides).

Experimental host abstractions

cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the ct.Tiled wrapper type, which represents a tiled view of an array.

Broadcasting fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:

ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)

# Or via the convenience macro (wraps all arrays automatically):
ct.@. C = A + sin(B)

mapreduce on Tiled arrays generates a tiled reduction kernel:

mapreduce(identity, +, ct.Tiled(A); dims=1)

These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default CuArray operations in CUDA.jl.

Debugging with `print`/`println`

You can now use standard Julia print and println inside kernels:

function debug_kernel(A, tile_size::Int)
    pid = ct.bid(1)
    tile = ct.load(A; index=pid, shape=(tile_size,))
    println("Block ", pid, ": sum=", sum(tile; dims=1))
    return
end

String constants, scalars, and tiles can be mixed freely. String interpolation ("x=$x") is also supported.

Minor changes

Atomics: atomic_max, atomic_min, atomic_or, atomic_and, atomic_xor join the existing atomic_cas, atomic_xchg, and atomic_add;
fill/zeros/ones overlays: standard Base constructors now work inside kernels (e.g. zeros(Float32, 64, 64)), thanks to @AntonOresten;
isnan: works via a single unordered float self-comparison, avoiding Julia's bit-manipulation fallback;
Debug info: source file and line information is now embedded in Tile IR bytecode;
Julia 1.13: support for the upcoming Julia release.

Performance improvements

A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.

The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:

# FMA fusion: mulf + addf → fma
@rewrite addf(mulf(~a, ~b), ~c) => fma(~a, ~b, ~c)

# pow(x, 2) → x * x
@rewrite pow(~x, broadcast(constant(2.0))) => mulf(~x, ~x)

Built on this, the pipeline includes:

Algebraic simplification cancels matching arithmetic pairs like (x + 1) - 1 → x, even when reshapes or broadcasts sit in between. This eliminates the overhead of Julia's 1-based indexing normalization;
Comparison strength reduction canonicalizes patterns like (x + 1) <= y into x < y, collapsing the two-instruction arange-plus-compare that results from Julia's 1-based arange mask idiom down to a single comparison. This alone reduced layernorm's SASS from 10036 to 3253 instructions;
Pow2 strength reduction replaces pow(x, 2) with x * x, eliminating the expensive pow transcendental in layernorm's variance computation;
LICM hoists loop-invariant operations out of loop bodies;
Constant folding and propagation evaluates compile-time-known arithmetic and tracks constants through the IR for further optimizations;
Alias-aware token ordering uses alias analysis to identify independent memory operations on different arrays, avoiding unnecessary serialization that previously blocked instruction-level parallelism;

Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10% of their Python counterparts, and some are even faster. For up to date performance comparisons, see the cuTile.jl README.

Upcoming webinar

On May 12, 2026 at 1 PM ET, Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar. We will cover the package's design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what's next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the JuliaHub event.