cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support


Tim Besard

cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA's tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.

The release is showcased by two new examples that exercise many of the features described below: a fused Mixture of Experts kernel with token routing via gather/scatter, and a Flash Multi-Head Attention implementation with online softmax and causal masking.

Breaking changes

Native for loops

Previously, cuTile.jl required a while-loop workaround for iteration. Starting with v0.2, standard Julia for loops work directly:

for k in Int32(1):K_tiles
    a = ct.load(A; index=(pid_m, k), shape=(TILE_M, TILE_K))
    b = ct.load(B; index=(k, pid_n), shape=(TILE_K, TILE_N))
    acc = muladd(a, b, acc)
end

The compiler recognizes Julia's iterator protocol and lowers for i in start:stop and for i in start:step:stop to Tile IR ForOp directly.

Floating-point mode: ct.@fpmode

A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:

ct.@fpmode rounding_mode=ct.Rounding.Approx flush_to_zero=true begin
    s = sum(tile; dims=1)
    x = exp2.(tile)
end

Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.

Keyword arguments for operations

Most cuTile operations now use keyword arguments, aligning with cuTile Python's API and making call sites more readable:

Experimental host abstractions

cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the ct.Tiled wrapper type, which represents a tiled view of an array.

Broadcasting fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:

ct.Tiled(C) .= ct.Tiled(A) .+ ct.Tiled(B)

# Or via the convenience macro (wraps all arrays automatically):
ct.@. C = A + sin(B)

mapreduce on Tiled arrays generates a tiled reduction kernel:

mapreduce(identity, +, ct.Tiled(A); dims=1)

These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default CuArray operations in CUDA.jl.

Debugging with print/println

You can now use standard Julia print and println inside kernels:

function debug_kernel(A, tile_size::Int)
    pid = ct.bid(1)
    tile = ct.load(A; index=pid, shape=(tile_size,))
    println("Block ", pid, ": sum=", sum(tile; dims=1))
    return
end

String constants, scalars, and tiles can be mixed freely. String interpolation ("x=$x") is also supported.

Minor changes

Performance improvements

A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.

The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:

# FMA fusion: mulf + addf → fma
@rewrite addf(mulf(~a, ~b), ~c) => fma(~a, ~b, ~c)

# pow(x, 2) → x * x
@rewrite pow(~x, broadcast(constant(2.0))) => mulf(~x, ~x)

Built on this, the pipeline includes:

Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10% of their Python counterparts, and some are even faster. For up to date performance comparisons, see the cuTile.jl README.

Upcoming webinar

On May 12, 2026 at 1 PM ET, Tim Besard (JuliaHub) and Andy Terrel (NVIDIA) will present cuTile.jl in a joint webinar. We will cover the package's design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what's next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the JuliaHub event.