Metal.jl 1.10: Linear algebra, FFTs, and a faster runtime
Christian Guinard, Tim Besard
Metal.jl 1.10 is a big release. It adds native matrix multiplication, GPU-accelerated linear solvers and FFTs, BFloat16 support, and MPS-backed reductions, scans and sorting. The runtime also got considerably faster and leaner, and there is a new in-process profiler.
Before getting into the new features, one thing to flag up front: Metal.jl 1.10 requires macOS 14 or later, up from macOS 13. On older systems the package now refuses to initialize, and Metal.functional() returns false. The supported range is macOS 14 through 26, on Julia 1.10 through 1.13.
Tied to that requirement is a change in how kernels are compiled. Previously Metal.jl pinned a conservative baseline (AIR 2.5 / metallib v1.2.6) regardless of the host. Since Metal.jl only ever compiles for the machine it runs on, it now emits the newest AIR, MSL and metallib versions the host macOS supports, exactly like Apple's offline metal compiler does. That unlocks newer language features for free: AIR 2.6 / Metal 3.1 on macOS 14, up to AIR 2.8 / Metal 4.0 on macOS 26 (and Metal 4.1 on the macOS 27 beta). You can see what your machine targets in versioninfo:
julia> Metal.versioninfo()
macOS 26.6.0, Darwin 25.6.0
Toolchain:
- Julia: 1.12.6
- LLVM: 18.1.7
- Metal: 4.0 (MSL), 2.8 (AIR), 1.2.9 (metallib)
Julia packages:
- Metal.jl: 1.10.0
- GPUArrays: 11.5.8
- GPUCompiler: 1.22.7
- KernelAbstractions: 0.9.42
- ObjectiveC: 6.0.0
- LLVM: 9.10.0
- LLVMDowngrader_jll: 0.8.1+0
1 device:
- Apple M3 Pro (14 GPU cores, 80.000 KiB allocated; Apple9, Metal4 family)
Native matrix multiplication
Up to now, every A * B on an MtlArray went straight to Apple's vendor libraries (Metal Performance Shaders or MPSGraph). That works well on large matrices, but it leaves us at the mercy of the vendor: there are eltypes MPS does not support, small matrices pay a steep launch overhead, and bugs like the M1/M2 matmul NaN issue are out of our hands. Anything unsupported fell back to GPUArrays' generic implementation, resulting in poor performance.
Metal.jl 1.10 ships its own native GEMM kernels. You pick a backend through the Metal.matmul_alg scoped value, which defaults to :auto:
:scalaris a per-element tiled kernel that handles any Metal-supported eltype (integers, complex, BFloat16) and any transpose or offset. It's the universal fallback.:simdis asimdgroup_matrixkernel forFloat16/Float32(and BFloat16) with Float32 accumulation:tensoris a Metal 4tensor_ops::matmul2dkernel, available on Metal 4-capable devices running macOS 26+.:nativepicks the best of the three, per device and per operand.:auto(the default) tries the vendor libraries first, then falls back to:native.
Linear solvers
Closing a long-standing request, many more standard LinearAlgebra operations on Float32/Float16 MtlMatrixes now run on the GPU through MPS-backed solvers. That covers \, lu, cholesky (including on Symmetric/Hermitian wrappers), triangular solves, and inv/det/logdet:
julia> using Metal, LinearAlgebra
julia> A = MtlArray(rand(Float32, 512, 512) + 512I);
julia> b = MtlArray(rand(Float32, 512));
julia> x = A \ b; # MPS LU solve, on the GPU
julia> norm(Array(A) * Array(x) - Array(b)) # residual, at the Float32 noise floor
4.4064095f-6
julia> M = MtlArray(rand(Float32, 256, 256));
julia> logdet(cholesky(Symmetric(M'M + I))) # cholesky factorization, also on the GPU
638.44586f0
BFloat16
BFloat16 arrays now run natively on the GPU as well:
julia> using Metal, BFloat16s
julia> a = MtlArray(BFloat16[1.5, 2.5, 3.5])
3-element MtlVector{BFloat16, Metal.PrivateStorage}:
1.5
2.5
3.5
julia> sum(a .* BFloat16(2))
BFloat16(15.0f0)
All Julia versions are supported, but before Julia 1.13 operations involving scalar BFloat16 values (e.g. a .+ BFloat16(1)) may be slower because they go through a software emulation path in BFloats.jl.
FlashAttention example
To tie the new building blocks together, there is a FlashAttention example that spells out scaled dot-product attention in four different ways, one per programming model Metal.jl exposes:
with plain array operations (
*, broadcasting,maximum,sum,exp);with MPSGraph's fused
scaledDotProductAttentionop;with a hand-written kernel built on
MtlSimdgroupMatrix{Float16,8,8};and with a fused kernel using the Metal 4
tensor_ops::matmul2dprimitives.
It's a good read if you want to see how the simdgroup and tensor intrinsics look in practice; you'll find it in examples/flashattention.jl.
Reductions, scans and sorting
Reductions, prefix scans and sorting now route through MPSGraph when it makes sense. This speeds up reductions and scans, and introduces support for sorting:
julia> sort(MtlVector(Int16[5, -3, 2, 9, -7, 0]))
6-element MtlVector{Int16, Metal.PrivateStorage}:
-7
-3
0
2
5
9
julia> accumulate(max, MtlVector(Int32[1, 3, 2, 5, 4]))
5-element MtlVector{Int32, Metal.PrivateStorage}:
1
3
3
5
5
Neural-network primitives
Metal.jl 1.10 also wraps the core MPSGraph neural-network primitives: softmax and logsoftmax, 2D convolution, and max/mean pooling, each with its gradient. These are wired up as the Metal backend for NNlib.jl, so once that release lands, Flux models gain GPU acceleration on Apple hardware through the functions you already use (conv, maxpool, softmax, …) rather than any Metal-specific API.
FFTs
On the back of the MPSGraph work, Metal.jl now supports FFTs through the AbstractFFTs.jl interface:
julia> using Metal, AbstractFFTs
julia> x = MtlArray(rand(ComplexF32, 2048, 2048));
julia> y = fft(x); # just works
julia> Array(ifft(y)) ≈ Array(x)
true
julia> p = plan_fft(x); # reusable plans, too
julia> Array(p * x) ≈ Array(y)
true
Real transforms (rfft/irfft), transforms along specific dimensions, and batched transforms are all supported. Running on the GPU is a large win over a CPU FFT, even one backed by AppleAccelerate. The following are timings on a 30-core M2 Max:
| Size | CPU (FFTW) | CPU (FFTW + AppleAccelerate) | GPU (Metal) | speedup vs. Accelerate |
|---|---|---|---|---|
| 512×512 | 4.2 ms | 766.2 µs | 173.4 µs | 4.4× |
| 1024×1024 | 19.7 ms | 3.7 ms | 246.3 µs | 15× |
| 2048×2048 | 99.5 ms | 20.8 ms | 588.4 µs | 35× |
| 4096×4096 | 580.1 ms | 99.0 ms | 2.5 ms | 39× |
A faster, leaner runtime
A lot of work in this cycle went into the cost of getting work onto the GPU and back.
Batched command submission. Metal.jl used to create, encode and commit a fresh command buffer for every single launch. It now keeps one command buffer open and submits launches into it, flushing on synchronization or other triggers. That amortizes the per-launch command-buffer overhead, which is the dominant cost for workloads built out of many small kernels.
Non-blocking synchronization. Synchronization was ported from CUDA.jl to a spin-then-yield scheme instead of blocking inside Metal. The primary motivation is correctness (a blocked main thread can deadlock against a Metal callback that needs to do I/O), but it is also dramatically faster on the fast paths:
| Scenario | before | after | speedup |
|---|---|---|---|
synchronize() on a queue that never ran work | 15.87 µs | 0.19 µs | ~86× |
synchronize() when the queue is idle | 15.55 µs | 0.37 µs | ~42× |
small kernel + synchronize() in a tight loop | 359 µs | 149 µs | ~2.4× |
GC under memory pressure. Because MtlArray buffers are allocated by Metal, Julia's garbage collector can't see them, and on a unified-memory Mac that means it happily lets you allocate until the system starts paging and freezes. Metal.jl now reads the memory pressure straight from Metal and triggers an incremental GC when usage gets high (above 75% normally, lower on synchronization points where the pause is hidden behind a wait anyway), rate-limited so it never spends more than a small fraction of wall-clock time collecting.
Cheaper object lifetimes. The hand-rolled retain/release/finalizer bookkeeping for Metal objects was replaced with ObjectiveC.jl's automatic reference counting, removing a few hundred lines of fiddly code and simplifying per-launch bookkeeping .
Faster large copies. Shared-storage GPU→GPU copies used to always go through a CPU memcpy. For large arrays it's faster to use a GPU blit, so copies above 32 MB now switch to that path (small copies stay on memcpy, where the API overhead would dominate):
| Size | before (CPU memcpy) | after | speedup |
|---|---|---|---|
| 64 MB | 3.28 ms | 1.19 ms | 2.8× |
| 256 MB | 6.54 ms | 2.08 ms | 3.1× |
| 1024 MB | 21.55 ms | 5.98 ms | 3.6× |
Separately, copies larger than 4 GiB no longer silently fail; they are chunked into pieces Metal can handle.
Time to first kernel. A real precompilation workload plus some despecialization brought the time to a first kernel down significantly:
$ julia -e 'using Metal; a = MtlArray([1, 2, 3]); @time a .+ 1'
0.161035 seconds (178.09 k allocations: 8.552 MiB, 52.46% compilation time: 22% of which was recompilation)
Compare that to the previous version of Metal.jl:
$ julia -e 'using Metal; a = MtlArray([1, 2, 3]); @time a .+ 1'
8.133787 seconds (33.91 M allocations: 1.636 GiB, 3.64% gc time, 99.25% compilation time: 1% of which was recompilation)
A profiler that doesn't need Xcode
Timing a single kernel with BenchmarkTools is easy enough, but understanding where time goes in a larger program used to mean reaching for Xcode's Instruments. Metal.jl 1.10 adds an in-process profiler, Metal.@profile, that captures the GPU operations Metal.jl submits and prints a summary, no Xcode required:
julia> a = Metal.rand(Float32, 1024, 1024); b = similar(a); c = similar(a);
julia> b .= a .+ 1f0; c .= sqrt.(b); Metal.synchronize(); # warm up
julia> Metal.@profile begin
b .= a .+ 1f0
c .= sqrt.(b)
end
Profiled over 58.7 ms.
Host-side activity: 42 Objective-C calls taking 118.0 µs (0.20% of wall-clock)
┌──────────┬────────────┬───────┬──────────────────────────────────────────┐
│ Time (%) │ Total time │ Calls │ Name │
├──────────┼────────────┼───────┼──────────────────────────────────────────┤
│ 0.06% │ 33.29 µs │ 2 │ [MTLCommandBuffer commit] │
│ 0.04% │ 22.17 µs │ 2 │ [MTLCommandQueue commandBuffer] │
│ 0.04% │ 22.0 µs │ 2 │ [MTLCommandBuffer computeCommandEncoder] │
│ ... │ ... │ ... │ ... │
└──────────┴────────────┴───────┴──────────────────────────────────────────┘
Device-side activity: GPU was busy 831.75 µs (1.42% of wall-clock)
┌──────────┬────────────┬───────┬───────────────────────────┬──────────────┐
│ Time (%) │ Total time │ Calls │ Time distribution │ Name │
├──────────┼────────────┼───────┼───────────────────────────┼──────────────┤
│ 1.42% │ 831.75 µs │ 2 │ 415.87 µs ± 220.62 │ broadcast_2d │
└──────────┴────────────┴───────┴───────────────────────────┴──────────────┘
The host table groups the Objective-C calls, the device table groups kernels and blits, and the slowest operations are color-highlighted. Pass trace=true for a chronological timeline (with threadgroup, occupancy and threadgroup-memory columns) instead of a summary, and use Metal.@bprofile to benchmark a snippet by running it repeatedly. The old Xcode-based capture is still there under Metal.@profile external=true. One caveat worth knowing: MPS and MPSGraph operations, including the default matmul backend, submit their own command buffers and don't show up in the integrated trace yet, so reach for the external profiler to inspect those.
Better debugging
Device-side printing. On macOS 15+, you can now print from inside a kernel, built on Apple's os_log. There's @mtlprintf, plus the friendlier @mtlprint, @mtlprintln, and @mtlshow:
julia> function device_println()
@mtlprintln("Hello, world!")
return
end
julia> @metal device_println();
Hello, world!
It also wires up KernelAbstractions' @print, so the same works in KA kernels.
Richer exceptions. When a kernel throws, say a bounds error, that used to surface as an opaque failure. Device exceptions are now reported back to the host as a KernelException carrying the actual cause:
julia> function oob(a)
a[2] = 1f0 # a has length 1
return
end
julia> a = MtlArray(zeros(Float32, 1));
julia> @metal threads=1 oob(a)
ERROR: KernelException: A BoundsError was thrown
Launching with debug_level=2 adds a full device-side stacktrace. The detailed machinery only kicks in at the higher debug level, so the common case stays fast.
Device-side allocation. A minimal device-side malloc means kernels that need dynamic allocation (notably exception-throwing code and some broadcasts) now compile and run where they previously failed outright.
New intrinsics
The warp-level primitive set is now much more complete. Metal.jl 1.10 adds the indexed simd_shuffle/simd_shuffle_xor shuffles, the simd_ballot/simd_vote_all/simd_vote_any voting intrinsics, and the full set of quad-group (4-thread) equivalents: quad_shuffle, quad_ballot, quad_vote_all, and friends. There are also UInt16 variants of every thread- and grid-indexing intrinsic (thread_position_in_grid_i16() and so on) for when 16-bit indices are enough.
Please refer to the Metal Shading Language Specification to verify where indexing types must match for a kernel to be valid.
Other improvements
Metal.jl 1.10 includes plenty more:
The default host RNG switched to the GPUArrays counter-based generator, matching CUDA.jl. It is faster for
randn, fixes the NaN issue, and supports types MPS can't (Float16, complex types, etc). The MPS generator is still available viaMetal.mps_rng().unsafe_wrapcan now build multi-dimensionalMtlArrays from a pointer without a copy.
One breaking fix to be aware of: launching a kernel with a grid dimension larger than typemax(UInt32) used to silently truncate. It now raises an error instead, so use grid-stride loops for kernels that need to cover arrays larger than that.
As always, update to the latest version to get these improvements, and check out the changelog for the full list.