Technical preview: Programming Apple M1 GPUs in Julia with Metal.jl

Jun 24, 2022
Tim Besard

Julia has gained a new GPU back-end: Metal.jl, for working with Apple's M1 GPUs. The back-end is built on the same foundations that make up existing GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody who's already programmed GPUs in Julia. In the following post I'll demonstrate some of that functionality and explain how it works.

But first, note that Metal.jl is under heavy development: The package is considered experimental for now, as we're still working on squashing bugs and adding essential functionality. We also haven't optimized for performance yet. If you're interesting in using Metal.jl, please consider contributing to its development! Most of the package is written in Julia, and checking-out the source code is a single Pkg.develop away :-)

Quick start

Start by getting a hold of the upcoming Julia 1.8, launch it, and enter the package manager by pressing ]:

julia> ]

pkg> add Metal
  Installed Metal

Installation is as easy as that, and we'll automatically download the necessary binary artifacts (a C wrapper for the Metal APIs, and an LLVM back-end). Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the versioninfo() method for some details on the toolchain:

julia> using Metal

julia> Metal.versioninfo()
macOS 13.0.0, Darwin 21.3.0

Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.1

1 device:
- Apple M1 Pro (64.000 KiB allocated)

And there we go! You'll note here that I'm using the upcoming macOS 13 (Ventura); this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren't technical limitations. In fact, Metal.jl does work on e.g. macOS Monterey with an Intel GPU, but it's an untested combination that may suffer from bugs.

Array programming

Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the MtlArray type that can be used to manage memory and perform GPU computations:

# allocate + initialize
julia> a = MtlArray(rand(Float32, 2, 2))
2×2 MtlArray{Float32, 2}:
 0.158752  0.836366
 0.535798  0.153554

# perform some GPU-accelerated operations
julia> b = a * a
2×2 MtlArray{Float32, 2}:
 0.473325  0.261202
 0.167333  0.471702

# back to the CPU
julia> Array(b)
2×2 Matrix{Float32}:
 0.473325  0.261202
 0.167333  0.471702

Beyond these simple operations, Julia's higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:

julia> mapreduce(sin, +, a; dims=1)
1×2 MtlArray{Float32, 2}:
 1.15276  0.584146

julia> cos.(a .+ 2) .* 3
2×2 MtlArray{Float32, 2}:
 -2.0472   -1.25332
 -2.96594  -2.60351

Much of this functionality comes from the GPUArrays.jl package, which provides vendor-neutral implementations of common array operations. As a result, MtlArray is already pretty capable, and should be usable with realistic array-based applications.

Kernel programming

Metal.jl's array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:

# a simple kernel that sets elements of an array to a value
function memset_kernel(array, value)
  i = thread_position_in_grid_1d()
  if i <= length(array)
    @inbounds array[i] = value
  end
  return
end

a = MtlArray{Float32}(undef, 512)
@metal threads=512 grid=2 memset_kernel(a, 42)

# verify
@assert all(isequal(42), Array(a))

As can be seen here, we've opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that's similar to Julia's existing back-ends. Some key differences:

we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, ...;
all types of arguments (buffers, indirect buffers, value-typed inputs) are transparently converted to a GPU-compatible structure^[1];
global (task-bound) state is used to keep track of the active device and a queue;
compute pipeline set-up and command encoding is hidden behind a single macro.

Behind the scenes, we compile Julia to LLVM IR and use a tiny LLVM back-end (based on @a2flo's libfloor) that (re)writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using @device_code_metal:

julia> @device_code_metal @metal threads=512 grid=2 memset_kernel(a, 42)

[header]
program_count: 1
...

[program]
name: julia_memset_kernel
type: kernel
...

target datalayout = "..."
target triple = "air64-apple-macosx13.0.0"

; the (rewritten) kernel function:
;  - %value argument passed by reference
;  - %thread_position_in_grid argument added
;  - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel(
    { i8 addrspace(1)*, [1 x i64] } addrspace(1)* %array,
    i64 addrspace(1)* %value,
    i32 %thread_position_in_grid) {
  ...
  %9 = tail call float @air.convert.f.f32.s.i64(i64 %7)
  ...
  ret void
}

; minimal required argument metadata
!air.kernel = !{!10}
!10 = !{void ({ i8 addrspace(1)*, [1 x i64] } addrspace(1)*,
              i64 addrspace(1)*, i32)* @julia_memset_kernel, !11, !12}
!12 = !{!13, !14, !15}
!13 = !{i32 0, !"air.buffer", !"air.location_index", i32 0, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 16, !"air.arg_type_align_size", i32 8}
!14 = !{i32 1, !"air.buffer", !"air.location_index", i32 1, i32 1,
       !"air.read_write", !"air.address_space", i32 1,
       !"air.arg_type_size", i32 8, !"air.arg_type_align_size", i32 8}
!15 = !{i32 0, !"air.thread_position_in_grid"}

; other metadata not shown, for brevity

Shout-out to @max-Hawkins for exploring Metal code generation during his internship at Julia Computing!

Metal APIs in Julia

Lacking an Objective C or C++ FFI, we interface with the Metal libraries using a shim C library. Most users won't have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we've designed for the Metal APIs:

julia> dev = MtlDevice(1)
MtlDevice:
  name:             Apple M1 Pro
  lowpower:         false
  headless:         false
  removable:        false
  unified memory:   true

julia> desc = MtlHeapDescriptor()
MtlHeapDescriptor:
  type:             MtHeapTypeAutomatic
  storageMode:      MtStorageModePrivate
  size:             0

julia> desc.size = 16384
16384

julia> heap = MtlHeap(dev, desc)
MtlHeap:
  type:                 MtHeapTypeAutomatic
  size:                 16384
  usedSize:             0
  currentAllocatedSize: 16384

# etc

These wrappers are based on @PhilipVinc's excellent work on MetalCore.jl, which formed the basis for (and has been folded into) Metal.jl.

What's next?

The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:

performance optimization
integration with Metal Performance Shaders
integration / documentation for use with Xcode tools
fleshing out the array abstraction based on user feedback

Please consider helping out with any of these! Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you're not certain, or have any questions, please drop by the #gpu channel on the JuliaLang Slack, ask questions on our Discourse, or chat to us during the GPU office hours every other Monday.

If you encounter any bugs, feel free to let us know on the Metal.jl issue tracker. For information on upcoming releases, subscribe to this website's blog where we post about significant developments in Julia's GPU ecosystem.

[1]	This relies on Metal 3 from macOS 13, which introduced bindless argument

buffers, as we didn't fully figure out how to reliably encode arbitrarily-nested indirect buffers in argument encoder metadata.