<?xml version="1.0" encoding="UTF-8"?>

<rss version="2.0"
  xmlns:content="http://purl.org/rss/1.0/modules/content/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
  xmlns:media="http://search.yahoo.com/mrss/"
  xmlns:atom="http://www.w3.org/2005/Atom"
  xmlns:georss="http://www.georss.org/georss">

  <channel>
    <title><![CDATA[JuliaGPU]]></title>
    <link>https://juliagpu.org</link>
    <description><![CDATA[High-performance GPU programming in a high-level language.]]></description>
    <generator>Franklin.jl -- https://github.com/tlienart/Franklin.jl</generator>
    <atom:link
      href="https://juliagpu.org/post/index.xml"
      rel="self"
      type="application/rss+xml" />


<item>
  <title><![CDATA[cuTile.jl 0.2: New features, improved performance, and Julia 1.13 support]]></title>
  <link>https://juliagpu.org/post/2026-04-08-cutile_0.2/index.html</link>
  <guid>https://juliagpu.org/2026-04-08-cutile_0.2/</guid>
  <description><![CDATA[cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA&#39;s tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.]]></description>  
  
  <content:encoded><![CDATA[
<p>cuTile.jl v0.2 is the first major update of the Julia package for writing GPU kernels using NVIDIA&#39;s tile-based programming model. This release adds many new features, supports more of the Julia language, and greatly improves performance. We will be presenting about it in a joint webinar with NVIDIA on May 12.</p>
<p>The release is showcased by two new examples that exercise many of the features described below: a fused <a href="https://github.com/JuliaGPU/cuTile.jl/blob/main/examples/moe.jl">Mixture of Experts</a> kernel with token routing via <code>gather</code>/<code>scatter</code>, and a <a href="https://github.com/JuliaGPU/cuTile.jl/blob/main/examples/fmha.jl">Flash Multi-Head Attention</a> implementation with online <code>softmax</code> and causal masking.</p>
<h2 id="breaking_changes">Breaking changes</h2>
<ul>
<li><p><strong><code>ct.where</code> removed</strong>: use <code>ifelse.&#40;cond, x, y&#41;</code> &#40;standard Julia broadcast&#41;;</p>
</li>
<li><p><strong>FP mode kwargs removed</strong>: per-operation <code>rounding_mode</code> and <code>flush_to_zero</code> kwargs on reductions/scans replaced with <code>ct.@fpmode</code> blocks &#40;see below&#41;;</p>
</li>
<li><p><strong>Matmul batch dimensions</strong>: <code>muladd</code> now uses trailing batch dims <code>&#40;M, K, B...&#41;</code> matching Julia convention, instead of leading <code>&#40;B, M, K&#41;</code>.</p>
</li>
</ul>
<h2 id="native_for_loops">Native <code>for</code> loops</h2>
<p>Previously, cuTile.jl required a <code>while</code>-loop workaround for iteration. Starting with v0.2, standard Julia <code>for</code> loops work directly:</p>
<pre><code class="language-julia">for k in Int32&#40;1&#41;:K_tiles
    a &#61; ct.load&#40;A; index&#61;&#40;pid_m, k&#41;, shape&#61;&#40;TILE_M, TILE_K&#41;&#41;
    b &#61; ct.load&#40;B; index&#61;&#40;k, pid_n&#41;, shape&#61;&#40;TILE_K, TILE_N&#41;&#41;
    acc &#61; muladd&#40;a, b, acc&#41;
end</code></pre>
<p>The compiler recognizes Julia&#39;s iterator protocol and lowers <code>for i in
start:stop</code> and <code>for i in start:step:stop</code> to Tile IR <code>ForOp</code> directly.</p>
<h2 id="floating-point_mode_ctfpmode">Floating-point mode: <code>ct.@fpmode</code></h2>
<p>A new scoped macro controls floating-point rounding and flush-to-zero for all operations in its body, matching how FP modes work in hardware:</p>
<pre><code class="language-julia">ct.@fpmode rounding_mode&#61;ct.Rounding.Approx flush_to_zero&#61;true begin
    s &#61; sum&#40;tile; dims&#61;1&#41;
    x &#61; exp2.&#40;tile&#41;
end</code></pre>
<p>Blocks can be nested, with inner blocks inheriting unspecified settings from the enclosing scope.</p>
<h2 id="keyword_arguments_for_operations">Keyword arguments for operations</h2>
<p>Most cuTile operations now use keyword arguments, aligning with cuTile Python&#39;s API and making call sites more readable:</p>
<ul>
<li><p><strong><code>load</code>/<code>store</code></strong>: <code>index</code>, <code>shape</code>, and <code>tile</code> are now kwargs &#40;<code>ct.load&#40;arr; index&#61;pid, shape&#61;&#40;M, N&#41;&#41;</code>&#41;. Positional syntax still works.</p>
</li>
<li><p><strong><code>arange</code></strong>: now works without a type, defaulting to <code>Int32</code>. Use <code>dtype</code> for other types &#40;e.g. <code>ct.arange&#40;16; dtype&#61;Int64&#41;</code>&#41;.</p>
</li>
<li><p><strong><code>gather</code>/<code>scatter</code></strong>: new <code>mask</code>, <code>padding_value</code>, and <code>check_bounds</code> kwargs. User masks are <code>AND</code>&#39;d with automatic bounds masks; <code>check_bounds&#61;false</code> skips bounds comparisons when indices are known safe.</p>
</li>
<li><p><strong>Atomics</strong>: all atomic operations accept <code>check_bounds</code> to optionally skip the bounds mask.</p>
</li>
<li><p><strong><code>allow_tma</code></strong>: default changed from <code>true</code> to <code>nothing</code> &#40;compiler decides&#41;.</p>
</li>
</ul>
<h2 id="experimental_host_abstractions">Experimental host abstractions</h2>
<p>cuTile.jl now provides a limited set of host-level APIs that generate cuTile kernels automatically, without writing explicit kernel code. They are exposed using the <code>ct.Tiled</code> wrapper type, which represents a tiled view of an array.</p>
<p><strong>Broadcasting</strong> fuses an entire expression into a single cuTile kernel, with tile sizes chosen automatically:</p>
<pre><code class="language-julia">ct.Tiled&#40;C&#41; .&#61; ct.Tiled&#40;A&#41; .&#43; ct.Tiled&#40;B&#41;# Or via the convenience macro &#40;wraps all arrays automatically&#41;:
ct.@. C &#61; A &#43; sin&#40;B&#41;</code></pre>
<p><strong><code>mapreduce</code></strong> on <code>Tiled</code> arrays generates a tiled reduction kernel:</p>
<pre><code class="language-julia">mapreduce&#40;identity, &#43;, ct.Tiled&#40;A&#41;; dims&#61;1&#41;</code></pre>
<p>These APIs are experimental and may not persist in their current form. The goal is to eventually fold them into the default <code>CuArray</code> operations in CUDA.jl.</p>
<h2 id="debugging_with_printprintln">Debugging with <code>print</code>/<code>println</code></h2>
<p>You can now use standard Julia <code>print</code> and <code>println</code> inside kernels:</p>
<pre><code class="language-julia">function debug_kernel&#40;A, tile_size::Int&#41;
    pid &#61; ct.bid&#40;1&#41;
    tile &#61; ct.load&#40;A; index&#61;pid, shape&#61;&#40;tile_size,&#41;&#41;
    println&#40;&quot;Block &quot;, pid, &quot;: sum&#61;&quot;, sum&#40;tile; dims&#61;1&#41;&#41;
    return
end</code></pre>
<p>String constants, scalars, and tiles can be mixed freely. String interpolation &#40;<code>&quot;x&#61;&#36;x&quot;</code>&#41; is also supported.</p>
<h2 id="minor_changes">Minor changes</h2>
<ul>
<li><p><strong>Atomics</strong>: <code>atomic_max</code>, <code>atomic_min</code>, <code>atomic_or</code>, <code>atomic_and</code>, <code>atomic_xor</code> join the existing <code>atomic_cas</code>, <code>atomic_xchg</code>, and <code>atomic_add</code>;</p>
</li>
<li><p><strong><code>fill</code>/<code>zeros</code>/<code>ones</code> overlays</strong>: standard Base constructors now work inside kernels &#40;e.g. <code>zeros&#40;Float32, 64, 64&#41;</code>&#41;, thanks to <a href="https://github.com/AntonOresten">@AntonOresten</a>;</p>
</li>
<li><p><strong><code>isnan</code></strong>: works via a single unordered float self-comparison, avoiding Julia&#39;s bit-manipulation fallback;</p>
</li>
<li><p><strong>Debug info</strong>: source file and line information is now embedded in Tile IR bytecode;</p>
</li>
<li><p><strong>Julia 1.13</strong>: support for the upcoming Julia release.</p>
</li>
</ul>
<h2 id="performance_improvements">Performance improvements</h2>
<p>A major change in cuTile.jl 0.2 is under the hood: a new multi-pass optimization pipeline that significantly improves the quality of generated Tile IR. In v0.1, the compiler emitted IR almost directly from the structured Julia code; now, a series of passes transform and simplify it before bytecode emission.</p>
<p>The foundation is a declarative IR rewrite infrastructure inspired by MLIR that makes it easy to express pattern-matched transformations:</p>
<pre><code class="language-julia"># FMA fusion: mulf &#43; addf → fma
@rewrite addf&#40;mulf&#40;~a, ~b&#41;, ~c&#41; &#61;&gt; fma&#40;~a, ~b, ~c&#41;# pow&#40;x, 2&#41; → x * x
@rewrite pow&#40;~x, broadcast&#40;constant&#40;2.0&#41;&#41;&#41; &#61;&gt; mulf&#40;~x, ~x&#41;</code></pre>
<p>Built on this, the pipeline includes:</p>
<ul>
<li><p><strong>Algebraic simplification</strong> cancels matching arithmetic pairs like <code>&#40;x &#43; 1&#41; -
  1 → x</code>, even when reshapes or broadcasts sit in between. This eliminates the overhead of Julia&#39;s 1-based indexing normalization;</p>
</li>
<li><p><strong>Comparison strength reduction</strong> canonicalizes patterns like <code>&#40;x &#43; 1&#41; &lt;&#61; y</code> into <code>x &lt; y</code>, collapsing the two-instruction <code>arange</code>-plus-compare that results from Julia&#39;s 1-based <code>arange</code> mask idiom down to a single comparison. This alone reduced <code>layernorm</code>&#39;s SASS from 10036 to 3253 instructions;</p>
</li>
<li><p><strong>Pow2 strength reduction</strong> replaces <code>pow&#40;x, 2&#41;</code> with <code>x * x</code>, eliminating the expensive <code>pow</code> transcendental in <code>layernorm</code>&#39;s variance computation;</p>
</li>
<li><p><strong>LICM</strong> hoists loop-invariant operations out of loop bodies;</p>
</li>
<li><p><strong>Constant folding and propagation</strong> evaluates compile-time-known arithmetic and tracks constants through the IR for further optimizations;</p>
</li>
<li><p><strong>Alias-aware token ordering</strong> uses alias analysis to identify independent memory operations on different arrays, avoiding unnecessary serialization that previously blocked instruction-level parallelism;</p>
</li>
</ul>
<p>Thanks to these improvements, all examples from cuTile Python that have been ported to Julia using cuTile.jl perform within 10&#37; of their Python counterparts, and some are even faster. For up to date performance comparisons, see the <a href="https://github.com/JuliaGPU/cuTile.jl/blob/main/README.md">cuTile.jl README</a>.</p>
<h2 id="upcoming_webinar">Upcoming webinar</h2>
<p>On <strong>May 12, 2026 at 1 PM ET</strong>, Tim Besard &#40;JuliaHub&#41; and Andy Terrel &#40;NVIDIA&#41; will present cuTile.jl in a joint webinar. We will cover the package&#39;s design, demonstrate writing GPU kernels in Julia using the tile programming model, and discuss what&#39;s next for cuTile.jl and its integration with the Julia GPU ecosystem. To sign up, see the <a href="https://juliahub.com/events/cutile.jl-for-high-performance-computing-in-julia">JuliaHub event</a>.</p>
]]></content:encoded>
    
  <pubDate>Wed, 08 Apr 2026 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Metal.jl 1.6: Initial MPSGraph Support]]></title>
  <link>https://juliagpu.org/post/2025-05-30-metal_1.6/index.html</link>
  <guid>https://juliagpu.org/2025-05-30-metal_1.6/</guid>
  <description><![CDATA[Metal.jl adds initial support for MPSGraph, with the matrix multiplication functions   wrapped, resolving some matrix multiplication issues in the previous method.]]></description>  
  
  <content:encoded><![CDATA[
<p>Metal.jl adds initial support for MPSGraph, with the matrix multiplication functions   wrapped, resolving some matrix multiplication issues in the previous method.</p>
<h2 id="initial_mpsgraph_support">Initial MPSGraph support</h2>
<p><a href="https://github.com/JuliaGPU/Metal.jl/pull/526">PR #526</a> enabled the automatic generation of wrappers for all <code>enum</code>s, <code>struct</code>s, and Objective-C objects for the frameworks that Metal.jl relies upon. This made adding support for <a href="https://developer.apple.com/documentation/metalperformanceshadersgraph?language&#61;objc">MPSGraph</a>, Apple&#39;s MLIR gpu compiler interface, realistic.</p>
<p>To try out the new framework, constructor and method wrappers necessary for matrix multiplication were added, as well as linking it to the LinearAlgebra interface to work around the <a href="https://github.com/JuliaGPU/Metal.jl/pull/381">NaN issue</a> that could show up on M1/M2 devices.</p>
<p>Lets go through a simple example doing pairwise multiplication followed by pairwise addition using MPSGraph directly:</p>
<pre><code class="language-julia">using Metal, Random
using ObjectiveC: Foundation.NSDictionary
using Metal: encode&#33;;using .MPS: MPSCommandBuffer
using .MPSGraphs: MPSGraph, placeholderTensor, MPSGraphTensorData, MPSGraphTensor, multiplicationWithPrimaryTensor, additionWithPrimaryTensorT &#61; Float32;a &#61; Metal.rand&#40;10&#41;;
b &#61; Metal.rand&#40;10&#41;;
c &#61; Metal.rand&#40;10&#41;;# To compare with the MPSGraph equivalent
res &#61; &#40;a .* b&#41; .&#43; c;graph &#61; MPSGraph&#40;&#41; # Initialize the graph# Create placeholder tensors to be used to compile our graph
placeA &#61; placeholderTensor&#40;graph, size&#40;a&#41;, T&#41;
placeB &#61; placeholderTensor&#40;graph, size&#40;b&#41;, T&#41;
placeC &#61; placeholderTensor&#40;graph, size&#40;c&#41;, T&#41;# Link the placeholder tensors to the data via a Dict
feeds &#61; Dict&#123;MPSGraphTensor, MPSGraphTensorData&#125;&#40;
    placeA &#61;&gt; MPSGraphTensorData&#40;a&#41;,
    placeB &#61;&gt; MPSGraphTensorData&#40;b&#41;,
    placeC &#61;&gt; MPSGraphTensorData&#40;c&#41;
&#41;# Add multiplication to the graph
pwisemul &#61; MPSGraphs.multiplicationWithPrimaryTensor&#40;graph, placeA, placeB&#41;# Add addition to the graph
pwiseadd &#61; MPSGraphs.additionWithPrimaryTensor&#40;graph, pwisemul, placeC&#41;# Our output tensor will be our c MtlArray
resultdict &#61; Dict&#123;MPSGraphTensor, MPSGraphTensorData&#125;&#40;
    pwiseadd &#61;&gt; feeds&#91;placeC&#93;
&#41;# Encode and run the graph
cmdbuf &#61; MPS.MPSCommandBuffer&#40;Metal.global_queue&#40;device&#40;&#41;&#41;&#41;
MPS.encode&#33;&#40;cmdbuf, graph, NSDictionary&#40;feeds&#41;, NSDictionary&#40;resultdict&#41;&#41;
Metal.commit&#33;&#40;cmdbuf&#41;
Metal.wait_completed&#40;cmdbuf&#41;# The MPSGraph result is equal to the typical way of doing things.
@assert isapprox&#40;res, c&#41;</code></pre>
<p>Clearly, for simple operations like the above example, it is a lot of extra boilerplate without much benefit, but for more complex operations, MPSGraph will optimize the graph and operations before running, reducing expensive kernel launches and remving unecessary operations.</p>
<p>Another exciting aspect of this new framework wrapper is that it is now easier to add functionality that has been long-requested. One can find MPSGraph functionality not yet in Metal.jl and write wrappers using the existing wrappers as a starting point. If anyone is interested in helping out, feel free to open a pull request or an issue on the <a href="https://github.com/JuliaGPU/Metal.jl">Metal.jl repository</a>, and we will do our best to help you get your code merged.</p>
<h2 id="minor_changes">Minor Changes</h2>
<p>Metal.jl 1.6 also includes several other useful updates:</p>
<ul>
<li><p><a href="https://github.com/JuliaGPU/Metal.jl/pull/559">Fixes</a> with using irrationals in kernels.</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/Metal.jl/pull/529">Many</a> <a href="https://github.com/JuliaGPU/Metal.jl/pull/531">improvements</a> <a href="https://github.com/JuliaGPU/Metal.jl/pull/533">and</a> <a href="https://github.com/JuliaGPU/Metal.jl/pull/544">fixes</a> <a href="https://github.com/JuliaGPU/Metal.jl/pull/582">to</a> <a href="https://github.com/JuliaGPU/Metal.jl/pull/561">intrinsics</a>.</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/Metal.jl/pull/557">Support</a> for <code>pow</code> with an <code>Integer</code> exponent .</p>
</li>
</ul>
<p>As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the <a href="https://github.com/JuliaGPU/Metal.jl/releases/tag/v1.6.0">changelog</a> for a full list of changes.</p>
]]></content:encoded>
    
  <pubDate>Fri, 30 May 2025 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Christian Guinard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.8: CuSparseVector broadcasting, CUDA 12.9, and more]]></title>
  <link>https://juliagpu.org/post/2025-05-14-cuda_5.8/index.html</link>
  <guid>https://juliagpu.org/2025-05-14-cuda_5.8/</guid>
  <description><![CDATA[CUDA.jl v5.8 brings several enhancements, most notably the introduction of broadcasting   support for &lt;code&gt;CuSparseVector&lt;/code&gt;. The release also includes support for CUDA 12.9,   and updates to key CUDA libraries like cuTENSOR, cuQuantum, and cuDNN.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl v5.8 brings several enhancements, most notably the introduction of broadcasting   support for <code>CuSparseVector</code>. The release also includes support for CUDA 12.9,   and updates to key CUDA libraries like cuTENSOR, cuQuantum, and cuDNN.</p>
<h2 id="broadcasting_for_cusparsevector">Broadcasting for <code>CuSparseVector</code></h2>
<p>A significant enhancement in CUDA.jl v5.8 is the <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2733">support for broadcasting <code>CuSparseVector</code></a>. Thanks to <a href="https://github.com/kshyatt">@kshyatt</a>, it is now possible to use sparse GPU vectors in broadcast expressions just like it was already possible with sparse matrices:</p>
<pre><code class="language-julia-repl">julia&gt; using CUDA, .CUSPARSE, SparseArraysjulia&gt; x &#61; cu&#40;sprand&#40;Float32, 10, 0.3&#41;&#41;
10-element CuSparseVector&#123;Float32, Int32&#125; with 4 stored entries:
  &#91;2&#93;  &#61;  0.459139
  &#91;3&#93;  &#61;  0.964073
  &#91;8&#93;  &#61;  0.904363
  &#91;9&#93;  &#61;  0.721723julia&gt; # a zero-preserving elementwise operation
       x .* 2
10-element CuSparseVector&#123;Float32, Int32&#125; with 4 stored entries:
  &#91;2&#93;  &#61;  0.918278
  &#91;3&#93;  &#61;  1.928146
  &#91;8&#93;  &#61;  1.808726
  &#91;9&#93;  &#61;  1.443446julia&gt; # a non-zero-preserving elementwise operation
       x .&#43; 1
10-element CuArray&#123;Float32, 1, CUDA.DeviceMemory&#125;:
 1.0
 1.4591388
 1.9640732
 1.0
 1.0
 1.0
 1.0
 1.9043632
 1.7217231
 1.0julia&gt; # combining multiple sparse inputs
       x .&#43; cu&#40;sprand&#40;Float32, 10, 0.3&#41;&#41;
10-element CuSparseVector&#123;Float32, Int32&#125; with 6 stored entries:
  &#91;1&#93;  &#61;  0.906
  &#91;2&#93;  &#61;  0.583197
  &#91;3&#93;  &#61;  0.964073
  &#91;4&#93;  &#61;  0.259103
  &#91;8&#93;  &#61;  0.904363
  &#91;9&#93;  &#61;  0.935917</code></pre>
<h2 id="minor_changes">Minor Changes</h2>
<p>CUDA.jl 5.8 also includes several other useful updates:</p>
<ul>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2772">Added support</a> for CUDA 12.9;</p>
</li>
<li><p>Subpackages <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2776">have been updated</a> to CUDNN 9.10, cuTensor 2.2, and cuQuantum 25.03;</p>
</li>
<li><p><code>CUSPARSE.gemm&#33;</code> <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2769">now supports</a> additional algorithms choices to limit memory usage;</p>
</li>
<li><p>Symbols <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2624">can now be passed</a> to CUDA kernels and stored in <code>CuArray</code>s;</p>
</li>
<li><p><code>CuTensor</code> multiplication <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2775">now preserves</a> the memory type of the input tensors;</p>
</li>
<li><p>Sparse CSR matrices <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2720">are now interfaced with</a> the SparseMatricesCSR.jl package.</p>
</li>
</ul>
<p>As always, we encourage users to update to the latest version to benefit from these improvements and bug fixes. Check out the <a href="https://github.com/JuliaGPU/CUDA.jl/releases/tag/v5.8.0">changelog</a> for a full list of changes.</p>
]]></content:encoded>
    
  <pubDate>Wed, 14 May 2025 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.6 and 5.7: Allocator cache, and asynchronous CUBLAS wrappers]]></title>
  <link>https://juliagpu.org/post/2025-03-11-cuda_5.6_5.7/index.html</link>
  <guid>https://juliagpu.org/2025-03-11-cuda_5.6_5.7/</guid>
  <description><![CDATA[CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved &lt;code&gt;CuRef&lt;/code&gt; type, which enables fully asynchronous CUBLAS calls.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl v5.6 adds support for the new GPUArrays.jl caching allocator interface, which should improve performance of repetitive, memory-heavy applications. CUDA.jl v5.7 brings a greatly improved <code>CuRef</code> type, which enables fully asynchronous CUBLAS calls.</p>
<h2 id="reworking_curef_for_asynchronous_cublas">Reworking <code>CuRef</code> for asynchronous CUBLAS</h2>
<p>The <code>CuRef</code> type is similar to Julia&#39;s <code>Ref</code>, a boxed value, often used with C APIs. In CUDA.jl v5.7, we&#39;ve made several changes to this type. First of all, we&#39;ve aligned its API much more closely with the <code>Ref</code> type from Base, e.g, adding <code>getindex</code> and <code>setindex&#33;</code> methods, which should make it more familiar to users:</p>
<pre><code class="language-julia-repl">julia&gt; box &#61; CuRef&#40;1&#41;
CuRefValue&#123;Int64&#125;&#40;1&#41;julia&gt; box&#91;&#93;
1julia&gt; box&#91;&#93; &#61; 2
2julia&gt; box
CuRefValue&#123;Int64&#125;&#40;2&#41;</code></pre>
<p>We also <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2645">optimized and improved</a> the <code>CuRef</code> implementation. As part of that work, we <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2625">removed the eager synchronization when copying from unpinned memory</a>. This was done to make it possible for Julia code to execute when waiting for the memory copy to start. However, it turns out that certain &#40;small&#41; copies, such as those performed by <code>CuRef</code>, can be performed without having to wait for the copy to start. By removing eager synchronization from those copies, <code>CuRef</code> objects can now be constructed fully asynchronously, i.e., without having to wait for the GPU to be ready.</p>
<p>Building on these changes, <a href="https://github.com/kshyatt">@kshyatt</a> has <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2616">switched our CUBLAS wrappers</a> over to using GPU-based <code>CuRef</code> boxes for scalar inputs instead of host-based <code>Ref</code> boxes. Although this increases the complexity of invoking CUBLAS APIs – the allocation of <code>CuRef</code> boxes requires CUDA API calls whereas a <code>Ref</code> box is much cheaper to allocate – this results in the API behaving asynchronously, whereas before every CUBLAS API taking scalar inputs would have resulted in a so-called &quot;bubble&quot; waiting for the GPU to finish executing.</p>
<h2 id="a_julia-level_allocator_cache">A Julia-level allocator cache</h2>
<p>To help with the common issue of running out of GPU memory, or to reduce the cost of CUDA.jl hitting the GC too often, <a href="https://github.com/pxl-th">@pxl-th</a> <a href="https://github.com/JuliaGPU/GPUArrays.jl/pull/576">has added a reusable caching allocator</a> to GPUArrays.jl, which CUDA.jl now supports and integrates with.</p>
<p>The idea is simple: GPU allocations made in a <code>GPUArrays.@cached</code> block are recorded in a <code>cache</code>, and when the block is exited the allocations are made available for reuse. Only when the cache goes out of scope, or when you call <code>unsafe_free&#33;</code> on it, the allocations will be fully freed. This is useful when you have a repetitive workload that performs the same allocations over and over again, such as in a machine learning training loop:</p>
<pre><code class="language-julia">cache &#61; GPUArrays.AllocCache&#40;&#41;
for epoch in 1:1000
    GPUArrays.@cached cache begin
        # dummy workload
        sin.&#40;CUDA.rand&#40;Float32, 1024^3&#41;&#41;
    end
end# wait for &#96;cache&#96; to be collected, or optionally eagerly free the memory
GPUArrays.unsafe_free&#33;&#40;cache&#41;</code></pre>
<p>Even though CUDA already has a caching allocator, the Julia-level caching mechanism may still improve performance by lowering pressure on the GC and reducing fragmentation of the underlying allocator. For example, the above snippet only performs two memory allocations that require 8 GiB, instead of 2000 allocations totalling 8 TiB &#40;&#33;&#41; of GPU memory.</p>
<p>The cherry on top is that the caching interface is generic, implemented in GPUArrays.jl, and available to all GPU back-ends that are compatible with v11.2.</p>
<h2 id="minor_changes">Minor changes</h2>
<ul>
<li><p>Device-to-host copies <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2648">now eagerly synchronize</a> to improve concurrent execution.</p>
</li>
<li><p>On multi-GPU systems, unified memory <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2626">is not automatically prefetched anymore</a> when launching kernels, making it possible to process a single array on multiple devices.</p>
</li>
<li><p>A change to <code>CuDeviceArray</code> <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2621">should allow eliding additional bounds checks</a> in code that already performs a manual bounds check &#40;such as KernelAbstractions.jl code&#41;</p>
</li>
<li><p>CUDA toolkit 12.8 <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2634">is now supported</a>, <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2620">as well as Jetson Orin</a> devices.</p>
</li>
<li><p>It is now possible to <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2624">pass symbols to kernels</a>.</p>
</li>
<li><p>CUBLAS: <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2642">Support for Givens rotation methods</a>.</p>
</li>
<li><p>CUSPARSE: Support for <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2639">using CuSparseMatrixBSR with generic <code>mm&#33;</code></a>.</p>
</li>
<li><p>Windows support for NVTX <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2665">has been fixed</a>.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Tue, 11 Mar 2025 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[OpenCL.jl 0.10: Now with native Julia kernels]]></title>
  <link>https://juliagpu.org/post/2025-01-13-opencl_0.10/index.html</link>
  <guid>https://juliagpu.org/2025-01-13-opencl_0.10/</guid>
  <description><![CDATA[Version 0.10 of OpenCL.jl is a significant release that adds support for native Julia kernels. This necessitated a major overhaul of the package&#39;s internals, bringing the package in line with modern Julia GPU programming practices.]]></description>  
  
  <content:encoded><![CDATA[
<p>Version 0.10 of OpenCL.jl is a significant release that adds support for native Julia kernels. This necessitated a major overhaul of the package&#39;s internals, bringing the package in line with modern Julia GPU programming practices.</p>
<h2 id="native_julia_kernels">Native Julia kernels</h2>
<p>The highlight of this release is the addition of <strong>a compiler that makes it possible to write OpenCL kernels in Julia</strong> instead of having to use OpenCL C and accompanying string-based APIs. Let&#39;s illustrate using the typical <code>vadd</code> vector-additional example, which starts by generating some data and uploading it to the GPU:</p>
<pre><code class="language-julia">using OpenCLdims &#61; &#40;2,&#41;
a &#61; round.&#40;rand&#40;Float32, dims&#41; * 100&#41;
b &#61; round.&#40;rand&#40;Float32, dims&#41; * 100&#41;
c &#61; similar&#40;a&#41;d_a &#61; CLArray&#40;a&#41;
d_b &#61; CLArray&#40;b&#41;
d_c &#61; CLArray&#40;c&#41;</code></pre>
<p>The typical way to write a kernel is to use a string with OpenCL C code, which is then compiled and executed on the GPU. This is done as follows:</p>
<pre><code class="language-julia">const source &#61; &quot;&quot;&quot;
   __kernel void vadd&#40;__global const float *a,
                      __global const float *b,
                      __global float *c&#41; &#123;
      int i &#61; get_global_id&#40;0&#41;;
      c&#91;i&#93; &#61; a&#91;i&#93; &#43; b&#91;i&#93;;
    &#125;&quot;&quot;&quot;prog &#61; cl.Program&#40;; source&#41; |&gt; cl.build&#33;
kern &#61; cl.Kernel&#40;prog, &quot;vadd&quot;&#41;len &#61; prod&#40;dims&#41;
clcall&#40;kern, Tuple&#123;Ptr&#123;Float32&#125;, Ptr&#123;Float32&#125;, Ptr&#123;Float32&#125;&#125;,
       d_a, d_b, d_c; global_size&#61;&#40;len,&#41;&#41;</code></pre>
<p>With the new GPUCompiler.jl-based compiler, you can now write the kernel in Julia just like with our other back-ends:</p>
<pre><code class="language-julia">function vadd&#40;a, b, c&#41;
    i &#61; get_global_id&#40;&#41;
    @inbounds c&#91;i&#93; &#61; a&#91;i&#93; &#43; b&#91;i&#93;
    return
endlen &#61; prod&#40;dims&#41;
@opencl global_size&#61;len vadd&#40;d_a, d_b, d_c&#41;</code></pre>
<p>This is of course a much more natural way to write kernels, and it also allows for OpenCL.jl to be plugged into the rest of the JuliaGPU ecosystem. Concretely, OpenCL.jl now implements the GPUArrays.jl interface, enabling lots of vendor-neutral functionality, and also provides a KernelAbstractions.jl back-end for use with the plenty of libraries that build on top of KernelAbstractions.jl.</p>
<p>There is no free lunch, though, and <strong>the native compiler functionality currently relies on your OpenCL driver supporting SPIR-V</strong>. This is sadly not a common feature, e.g., neither NVIDIA or ADM&#39;s OpenCL drivers support it, only Intel&#39;s. But if you are stuck with a driver that does not support SPIR-V, there is still hope: SPIR-V can be compiled back to OpenCL C, using the experimental <a href="https://github.com/kpet/spirv2clc"><code>spirv2clc</code></a>. If you are interested, check out <a href="https://github.com/JuliaGPU/OpenCL.jl/issues/234">this issue</a> and feel free to reach out.</p>
<h2 id="breaking_api_changes">Breaking API changes</h2>
<p>Existing users of OpenCL.jl will of course have noticed that even the string-based example above uses a different API than before. In order to support the new compiler, and bring OpenCL.jl in line with modern Julia programming practices, we have <strong>significantly overhauled the package&#39;s internals as well as some external APIs</strong>.</p>
<p>The most significant high-level changes include:</p>
<ul>
<li><p>Memory management is now done using <code>CLArray</code>, backed by Shared Virtual Memory &#40;SVM&#41;, instead of opaque buffers. Raw buffers are still supported, but not compatible with native kernel execution &#40;because they can not be converted to a pointer&#41;.</p>
</li>
<li><p>Kernels are called using the new <code>clcall</code> function, which performs automatic conversion of objects much like how <code>ccall</code> works.</p>
</li>
</ul>
<p>At the lower-level &#40;of the <code>cl</code> submodule&#41;, the changes are more extensive:</p>
<ul>
<li><p>Context, device and queue arguments have been removed from most APIs, and are now stored in task-local storage. These values can be queried &#40;<code>cl.platform&#40;&#41;</code>, <code>cl.device&#40;&#41;</code>, etc&#41; and set &#40;<code>cl.platform&#33;&#40;platform&#41;</code>, <code>cl.device&#33;&#40;device&#41;</code>, etc&#41; as needed.</p>
</li>
<li><p>As part of the above change, questionable APIs like <code>cl.create_some_context&#40;&#41;</code> and <code>cl.devices&#40;&#41;</code> have been removed;</p>
</li>
<li><p>The <code>Buffer</code> API has been completely reworked. It now only provides low-level functionality, such as <code>unsafe_copyto&#33;</code> or <code>unsafe_map&#33;</code>, while high-level functionality like <code>copy&#33;</code> is implemented for the CLArray type;</p>
</li>
<li><p>The <code>cl.info</code> method, and the <code>getindex</code> overloading to access properties of OpenCL objects, have been replaced by <code>getproperty</code> overloading on the objects themselves &#40;e.g., <code>cl.info&#40;dev, :name&#41;</code> and <code>dev&#91;:name&#93;</code> are now simply <code>dev.name</code>&#41;;</p>
</li>
<li><p>The blocking <code>cl.launch</code> has been replaced by a nonblocking <code>cl.call</code>, while also removing the <code>getindex</code>-overloading shorthand. However, it&#39;s recommended to use the newly-added <code>cl.clcall</code> function, which takes an additional tuple type argument and performs automatic conversions of arguments to those types. This makes it possible to pass a <code>CLArray</code> to an OpenCL C function expecting Buffer-backed pointers, for example.</p>
</li>
<li><p>Argument conversion has been removed; the user should make sure Julia arguments passed to kernels match the OpenCL argument types &#40;i.e., no empty types, 4-element tuples for a 3-element <code>float3</code> arguments&#41;.</p>
</li>
<li><p>The <code>to_host</code> function has been replaced by simply calling <code>Array</code> on the <code>CLArray</code>.</p>
</li>
<li><p>Queue and execution capabilities of a device are now to be queried using dedicated functions, <code>cl.queue_properties</code> and <code>cl.exec_capabilities</code>.</p>
</li>
</ul>
<p>Working towards the first stable version of this package, we anticipate having to make even more breaking changes. However, we want to get the current changes out there to get feedback from the community. If some of the removed functionality is crucial to your workflow, feel free to reach out and we can discuss how to best support it in the future.</p>
<h2 id="jll-based_opencl_drivers">JLL-based OpenCL drivers</h2>
<p>Another significant change is the <strong>integration with OpenCL drivers built and provided using Julia&#39;s BinaryBuilder infrastructure</strong>. Over time, this should simplify the installation of OpenCL drivers by avoiding the need to install global drivers. For now, the only driver provided as a JLL is a CPU driver based on the <a href="https://portablecl.org/">Portable Computing Language &#40;PoCL&#41; library</a>. This driver can be used by simply installing and loading <code>pocl_jll</code> before you start using OpenCL.jl:</p>
<pre><code class="language-julia-repl">julia&gt; using OpenCL, pocl_jlljulia&gt; OpenCL.versioninfo&#40;&#41;
OpenCL.jl version 0.10.0Toolchain:
 - Julia v1.11.2
 - OpenCL_jll v2024.5.8&#43;1Available platforms: 1
 - Portable Computing Language
   OpenCL 3.0, PoCL 6.0  Apple, Release, RELOC, SPIR-V, LLVM 16.0.6jl, SLEEF, DISTRO, POCL_DEBUG
   · cpu &#40;fp16, fp64, il&#41;</code></pre>
<p>Notice the <code>il</code> capability reported by <code>OpenCL.versioninfo&#40;&#41;</code>, indicating that PoCL supports SPIR-V and can thus be used with the new native Julia kernel compiler. In fact, this is one of the goals of reworking OpenCL.jl: to provide a CPU fallback implementation for use with Julia GPU libraries.</p>
<h2 id="work_towards_opencljl_10">Work towards OpenCL.jl 1.0</h2>
<p>This release is a significant step towards a stable 1.0 release of OpenCL.jl, bringing the package in line with our other Julia GPU-backends. Our focus is on improving OpenCL.jl in order to support a CPU fallback back-end for KernelAbstractions.jl based on PoCL. If you are a user of OpenCL.jl, or are interested in using the package in the future, please test out this release with your application and/or driver, and provide feedback on the changes we&#39;ve made. Pull requests are greatly appreciated, and we are happy to help you get started with contributing to the package.</p>
]]></content:encoded>
    
  <pubDate>Mon, 13 Jan 2025 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[GPUArrays v11: Port to KernelAbstractions.jl]]></title>
  <link>https://juliagpu.org/post/2025-01-07-gpuarrays-11/index.html</link>
  <guid>https://juliagpu.org/2025-01-07-gpuarrays-11/</guid>
  <description><![CDATA[The latest version of GPUArrays.jl involved a port of all vendor-neutral kernels to KernelAbstractions.jl. This should make it easier to add new functionality and improve the performance of existing kernels.]]></description>  
  
  <content:encoded><![CDATA[
<p>The latest version of GPUArrays.jl involved a port of all vendor-neutral kernels to KernelAbstractions.jl. This should make it easier to add new functionality and improve the performance of existing kernels.</p>
<h2 id="vendor-neutral_kernel_dsl">Vendor-neutral kernel DSL</h2>
<p>Back in the day, we created GPUArrays.jl to avoid having to write separate kernels for each GPU back-end, by relying on a very simple vendor-neutral domain-specific language &#40;DSL&#41; that could be translated very easily to the back-end&#39;s native kernel language. As a simple example, the following kernel was used to compute the adjoint of a vector:</p>
<pre><code class="language-julia">function LinearAlgebra.adjoint&#33;&#40;B::AbstractGPUMatrix, A::AbstractGPUVector&#41;
    gpu_call&#40;B, A&#41; do ctx, B, A
        idx &#61; @linearidx A
        @inbounds B&#91;1, idx&#93; &#61; adjoint&#40;A&#91;idx&#93;&#41;
        return
    end
    return B
end</code></pre>
<p>This DSL was designed almost a decade ago, by <a href="https://github.com/SimonDanisch">Simon Danisch</a>, and has served us well&#33; Since then, KernelAbstractions.jl has been developed by <a href="https://github.com/vchuravy/">Valentin Churavy</a>, providing a more principled and powerful DSL. With many application developers switching to KernelAbstractions.jl, it was time to port GPUArrays.jl to this new DSL as well.</p>
<p>Thanks to the tireless work by <a href="https://github.com/leios">James Schloss</a>, <strong>GPUArrays.jl v11 now uses KernelAbstractions.jl for all vendor-neutral kernels</strong>. The aforementioned <code>adjoint&#33;</code> kernel now looks like this:</p>
<pre><code class="language-julia">function LinearAlgebra.adjoint&#33;&#40;B::AbstractGPUMatrix, A::AbstractGPUVector&#41;
    @kernel function adjoint_kernel&#33;&#40;B, A&#41;
        idx &#61; @index&#40;Global, Linear&#41;
        @inbounds B&#91;1, idx&#93; &#61; adjoint&#40;A&#91;idx&#93;&#41;
    end
    adjoint_kernel&#33;&#40;get_backend&#40;A&#41;&#41;&#40;B, A; ndrange&#61;size&#40;A&#41;&#41;
    return B
end</code></pre>
<p>As shown above, the KernelAbstractions.jl DSL is very similar to the old DSL, but it provides more flexibility and power &#40;e.g., support for atomics through Atomix.jl&#41;. In addition, many more users are familiar with KernelAbstractions.jl, making it easier for them to contribute to GPUArrays.jl. A good first step here would be to port some of the vendor-specific kernels from CUDA.jl to GPUArrays.jl, making them available to all GPU back-ends. If you are interested in contributing, please reach out&#33;</p>
<p>That said, the change is not without its challenges. The added flexibility offered by KernelAbstractions.jl with respect to indexing currently results in <strong>certain kernels being slower than before</strong>, specifically when there is not much computational complexity to amortise the cost of indexing &#40;e.g., when doing very simple broadcasts&#41;. <a href="https://github.com/JuliaGPU/GPUArrays.jl/issues/565">We are working on improving this</a>, but it will take some time. Not to hold back the rest of the JuliaGPU ecosystem, we are releasing despite these performance issues. It&#39;s recommended to carefully benchmark your application after upgrading to v11, and to report any performance regressions</p>
<h2 id="back-end_package_versions">Back-end package versions</h2>
<p>As GPUArrays.jl is not a direct dependency of most applications, the update will be pulled in by the following back-end package versions &#40;some of which may not be released yet&#41;:</p>
<ul>
<li><p>CUDA.jl v5.6</p>
</li>
<li><p>Metal.jl v1.5</p>
</li>
<li><p>oneAPI.jl v2.0</p>
</li>
<li><p>AMDGPU.jl v1.1</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Tue, 07 Jan 2025 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Metal.jl 1.4: Improved random numbers]]></title>
  <link>https://juliagpu.org/post/2024-10-07-metal-1.4/index.html</link>
  <guid>https://juliagpu.org/2024-10-07-metal-1.4/</guid>
  <description><![CDATA[Metal.jl 1.4 adds higher-quality random number generators from the Metal Performance Shaders library. Some limitations apply, with a fallback to the current implementation in those situations.]]></description>  
  
  <content:encoded><![CDATA[<p> <p>Metal.jl 1.4 adds higher-quality random number generators from the Metal Performance Shaders library. Some limitations apply, with a fallback to the current implementation in those situations.</p></p>
<h2 id="metalrand_and_friends"><code>Metal.rand</code> and friends</h2>
<p>Using functionality provided by the Metal Performance Shaders &#40;MPS&#41; library, Metal.jl now comes with much improved GPU random number generators. Uniform distributions using <code>Metal.rand</code> &#40;and its in-place variant <code>Metal.rand&#33;</code>&#41; are available for all Metal-supported integer types and <code>Float32</code>. However, due to <a href="https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language&#61;objc">Metal API limitations</a>, 8-bit and 16-bit integers may fall back to the lower-quality GPUArrays.jl random number generator if their size in bytes is not a multiple of 4. Normally distributed <code>Float32</code> values can be generated for with <code>Metal.randn</code> and <code>Metal.randn&#33;</code>, while <code>Float16</code> is not supported by the MPS library and will always fall back to the GPUArrays implementation.</p>
<p>The easiest way to use these is to use the Metal convenience functions <code>Metal.rand&#91;n&#93;&#91;&#33;&#93;</code> as you would the usual functions from the Random.jl standard library:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; Metal.rand&#40;Float32, 2&#41;
2-element MtlVector&#123;Float32, Metal.PrivateStorage&#125;:
 0.95755994
 0.7110207julia&gt; Metal.randn&#33;&#40;a&#41;
2-element MtlVector&#123;Float32, Metal.PrivateStorage&#125;:
 1.7230463
 0.55636907</code></pre>
<p>However, the Random.jl methods can also be used by providing the appropriate <code>RNG</code> either from <code>MPS.default_rng&#40;&#41;</code> or <code>MPS.RNG&#40;&#41;</code> to the standard <code>Random.rand&#91;n&#93;&#91;&#33;&#93;</code> functions:</p>
<pre><code class="language-julia-repl">julia&gt; using Randomjulia&gt; rng &#61; MPS.RNG&#40;&#41;;julia&gt; Random.rand&#40;rng, 2&#41;
2-element MtlVector&#123;Float32, Metal.PrivateStorage&#125;:
 0.8941469
 0.67628527</code></pre>
<p>Seeding is done by calling <code>Metal.seed&#33;</code> for the global RNG, or <code>Random.seed&#33;</code> when working with an explicit <code>RNG</code> object.</p>
<h2 id="other_improvements_since_the_last_blog_post">Other improvements since the last blog post</h2>
<ul>
<li><p>Since v0.5: <code>MtlArray</code> storage mode has been parameterized, allowing one to create a shared storage <code>MtlArray</code> by calling <code>MtlArray&#123;eltype, ndims, Metal.SharedStorage&#125;&#40;...&#41;</code>.</p>
</li>
<li><p>Since v0.3: MPS-accelerated decompositions were added.</p>
</li>
<li><p>Various performance improvements</p>
</li>
<li><p><em>Many</em> bug fixes.</p>
</li>
</ul>
<h2 id="future_work">Future work</h2>
<p>Although Metal.jl is now in v1, there is still work to be done to make it as fast and feature-complete as possible. In particular:</p>
<ul>
<li><p>Metal.jl is now using native ObjectiveC FFI for wrapping Metal APIs. However, these wrappers have to be written manually for every piece of Objective-C code. <em>We are looking for help with improving Clang.jl and ObjectiveC.jl</em> to <a href="https://github.com/JuliaInterop/ObjectiveC.jl/issues/41">enable the automatic generation of these wrappers</a>;</p>
</li>
<li><p>The MPS wrappers are incomplete, automatic wrapper generation would greatly help with full MPS support;</p>
</li>
<li><p>To implement a full-featured KernelAbstractions.jl back-end, Metal atomic operations need to <a href="https://github.com/JuliaGPU/Metal.jl/issues/218">be hooked up to Atomix</a>;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/Metal.jl/issues/298">Full support for BFloat16 values</a>, which has been supported since Metal 3.1 &#40;macOS 14&#41;, is not yet available in Metal.jl. There is, however, a <a href="https://github.com/JuliaGPU/Metal.jl/pull/446">draft PR</a> in the works. Check it out if you&#39;re interested in helping out;</p>
</li>
<li><p>Some functionality present in CUDA.jl <a href="https://github.com/JuliaGPU/Metal.jl/issues/443">could be ported to Metal.jl to improve usability</a>.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Mon, 07 Oct 2024 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Christian Guinard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.5: Maintenance release]]></title>
  <link>https://juliagpu.org/post/2024-09-18-cuda_5.5/index.html</link>
  <guid>https://juliagpu.org/2024-09-18-cuda_5.5/</guid>
  <description><![CDATA[CUDA.jl 5.5 is a minor release that comes with a couple of small improvements and new features.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 5.5 is a minor release that comes with a couple of small improvements and new features.</p>
<p>The only important change is that the minimal required Julia version has been bumped to 1.10, in anticipation of it becoming the next LTS release.</p>
<h2 id="new_features">New features</h2>
<ul>
<li><p>Support for the upcoming Julia 1.11 release has been added, as well as for <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2461">CUDA 12.6 &#40;Update 1&#41;</a>.</p>
</li>
<li><p>Launch overhead has been reduced by <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2472">avoiding double argument conversions</a>. Note that this does not apply to kernels that are obtained using <code>@cuda launch&#61;false</code>.</p>
</li>
<li><p>CUSOLVER&#39;s dense wrappers have been improved by <a href="https://github.com/bjarthur">Ben Arthur</a>, <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2465">now caching workspace buffers</a>. This should greatly reduce the number of allocations needed for repeated calls.</p>
</li>
<li><p><a href="https://github.com/amontoison">Alexis Montoison</a> has improved the CUSPARSE wrappers, adding <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2489">conversions between sparse vectors and sparse matrices</a> that enable <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2488">a version of <code>gemv</code></a> which preserves sparsity of the inputs.</p>
</li>
<li><p>CUDA.jl&#39;s CUFFT wrappers <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2430">now support <code>Float16</code></a>, thanks to <a href="https://github.com/eschnett">Erik Schnetter</a>.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Wed, 18 Sep 2024 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.4: Memory management mayhem]]></title>
  <link>https://juliagpu.org/post/2024-05-28-cuda_5.4/index.html</link>
  <guid>https://juliagpu.org/2024-05-28-cuda_5.4/</guid>
  <description><![CDATA[CUDA.jl 5.4 comes with many memory-management related changes that should improve performance of memory-heavy applications, and make it easier to work with heterogeneous set-ups involving multiple GPUs or using both the CPU and GPU.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 5.4 comes with many memory-management related changes that should improve performance of memory-heavy applications, and make it easier to work with heterogeneous set-ups involving multiple GPUs or using both the CPU and GPU.</p>
<p>Before anything else, let&#39;s get the breaking changes out of the way. CUDA.jl v5.4 only bumps the minor version, so it should be compatible with existing codebases. However, there are a couple of API changes that, although covered by appropriate deprecation warnings, applications should be updated to:</p>
<ul>
<li><p>The <code>CUDA.Mem</code> submodule has been removed. All identifiers have been moved to the parent <code>CUDA</code> submodule, with a couple being renamed in the process:</p>
<ul>
<li><p><code>Mem.Device</code> and <code>Mem.DeviceBuffer</code> have been renamed to <code>CUDA.DeviceMemory</code> &#40;the same applies to <code>Mem.Host</code> and <code>Mem.Unified</code>&#41;;</p>
</li>
<li><p>enums from the <code>Mem</code> submodule have gained a <code>MEM</code> suffix, e.g., <code>Mem.ATTACH_GLOBAL</code> has been renamed to <code>CUDA.MEM_ATTACH_GLOBAL</code>;</p>
</li>
<li><p><code>Mem.set&#33;</code> has been renamed to <code>CUDA.memset</code>;</p>
</li>
<li><p><code>Mem.info&#40;&#41;</code> has been renamed to <code>CUDA.memory_info&#40;&#41;</code>;</p>
</li>
</ul>
</li>
<li><p><code>CUDA.memory_status&#40;&#41;</code> has been renamed to <code>CUDA.pool_status&#40;&#41;</code>;</p>
</li>
<li><p><code>CUDA.available_memory&#40;&#41;</code> has been renamed to <code>CUDA.free_memory&#40;&#41;</code>.</p>
</li>
</ul>
<p>The meat of this release is in the memory management improvements detailed below. These changes can have a significant impact of the performance of your application, so it&#39;s recommended to thoroughly test your application after upgrading&#33;</p>
<h2 id="eager_garbage_collection">Eager garbage collection</h2>
<p>Julia is a garbage collected language, which means that &#40;GPU&#41; allocations can fail because garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled this at the allocation site, detecting out-of-memory errors and triggering the GC. This was not ideal, as it could lead to significant pauses and a bloated memory usage.</p>
<p>To improve this, <strong>CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that information to trigger the GC early at appropriate times</strong>, e.g., when waiting for a kernel to finish. This should lead to more predictable performance, both by distributing the cost of garbage collection over time and by potentially masking it behind other operations.</p>
<p>For example, the following toy model implemented with Flux.jl allocates a ton of memory:</p>
<pre><code class="language-julia">using CUDA, Flux
using MLUtils: DataLoadern_obs &#61; 300_000
n_feature &#61; 1000
X &#61; rand&#40;n_feature, n_obs&#41;
y &#61; rand&#40;1, n_obs&#41;
train_data &#61; DataLoader&#40;&#40;X, y&#41; |&gt; gpu; batchsize &#61; 2048, shuffle&#61;false&#41;model &#61; Dense&#40;n_feature, 1&#41; |&gt; gpu
loss&#40;m, _x, _y&#41; &#61; Flux.Losses.mse&#40;m&#40;_x&#41;, _y&#41;
opt_state &#61; Flux.setup&#40;Flux.Adam&#40;&#41;, model&#41;
for epoch in 1:100
  Flux.train&#33;&#40;loss, model, train_data, opt_state&#41;
end</code></pre>
<p>Without eager garbage collection, this leads to expensive pauses while freeing a large amount of memory at every epoch. We can simulate this by artificially limiting the memory available to the GPU, while also disabling the new eager garbage collection feature by setting the <code>JULIA_CUDA_GC_EARLY</code> environment variable to <code>false</code> &#40;this is a temporary knob that will be removed in the future, but may be useful now for evaluating the new feature&#41;:</p>
<pre><code class="language-text">❯ JULIA_CUDA_GC_EARLY&#61;false JULIA_CUDA_HARD_MEMORY_LIMIT&#61;4GiB \
  julia --project train.jl
...
&#91; Info: Epoch 90 train time 0.031s
retry_reclaim: freed 2.865 GiB
&#91; Info: Epoch 91 train time 0.031s
&#91; Info: Epoch 92 train time 0.027s
retry_reclaim: freed 2.865 GiB
&#91; Info: Epoch 93 train time 0.03s
retry_reclaim: freed 2.873 GiB
&#91; Info: Epoch 94 train time 0.031s
retry_reclaim: freed 2.873 GiB
&#91; Info: Epoch 95 train time 0.03s
retry_reclaim: freed 2.873 GiB
&#91; Info: Epoch 96 train time 0.031s
&#91; Info: Epoch 97 train time 0.027s
retry_reclaim: freed 2.873 GiB
&#91; Info: Epoch 98 train time 0.031s
retry_reclaim: freed 2.865 GiB
&#91; Info: Epoch 99 train time 0.031s
retry_reclaim: freed 2.865 GiB
&#91; Info: Epoch 100 train time 0.031s
&#91; Info: Total time 4.307s</code></pre>
<p>With eager garbage collection enabled, more frequent but less costly pauses result in significantly improved performance:</p>
<pre><code class="language-text">❯ JULIA_CUDA_GC_EARLY&#61;true JULIA_CUDA_HARD_MEMORY_LIMIT&#61;4GiB \
  julia --project wip.jl
...
&#91; Info: Epoch 90 train time 0.031s
maybe_collect: collected 1.8 GiB
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 91 train time 0.033s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 92 train time 0.031s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 93 train time 0.031s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 94 train time 0.03s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 95 train time 0.03s
maybe_collect: collected 1.8 GiB
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 96 train time 0.033s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 97 train time 0.03s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 98 train time 0.03s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 99 train time 0.03s
maybe_collect: collected 1.8 GiB
&#91; Info: Epoch 100 train time 0.03s
&#91; Info: Total time 3.76s</code></pre>
<p>Eager garbage collection is driven by a heuristic that considers the current memory pressure, how much memory was freed during previous collections, and how much time that took. It is possible that the current implementation is not optimal, so if you encounter performance issues, please file an issue.</p>
<h2 id="tracked_memory_allocations">Tracked memory allocations</h2>
<p>When working with multiple GPUs, it is important to differentiate between the device that memory was allocated on, and the device used to execute code. Practically, this meant that users of CUDA.jl had to manually remember that allocating and using <code>CuArray</code> objects &#40;typically&#41; needed to happen with the same device active. The same is true for streams, which are used to order operations executing on a single GPU.</p>
<p>To improve this, <strong>CUDA.jl now keeps track of the device that owns the memory, and the stream last used to access it, enabling the package to &quot;do the right thing&quot; when using that memory</strong> in kernels or with library functionality. This does <strong>not</strong> mean that CUDA.jl will automatically switch the active device: We want to keep the user in control of that, as it often makes sense to access memory from another device, if your system supports it.</p>
<p>Let&#39;s break down what the implications are of this change.</p>
<p><strong>1. Using multiple GPUs</strong></p>
<p>If you have multiple GPUs, it may be possible that direct P2P access between devices is possible &#40;e.g., using NVLink, or just over PCIe&#41;. In this case, CUDA.jl will now automatically configure the system to allow such access, making it possible to seamlessly use memory allocated on one device in kernels executing on a different device:</p>
<pre><code class="language-julia">julia&gt; # Allocate memory on device 0
       device&#33;&#40;0&#41;
CuDevice&#40;0&#41;: Tesla V100-PCIE-16GB
julia&gt; a &#61; CuArray&#40;&#91;1&#93;&#41;;julia&gt; # Use on device 1
       device&#33;&#40;1&#41;
CuDevice&#40;1&#41;: Tesla V100S-PCIE-32GB
julia&gt; a .&#43; 1;</code></pre>
<p>If P2P access between devices is not possible, CUDA.jl will now raise an error instead of throwing an illegal memory access error as it did before:</p>
<pre><code class="language-julia">julia&gt; # Use on incompatible device 2
       device&#33;&#40;2&#41;
CuDevice&#40;2&#41;: NVIDIA GeForce GTX 1080 Ti
julia&gt; a .&#43; 1
ERROR: cannot take the GPU address of inaccessible device memory.You are trying to use memory from GPU 0 on GPU 2.
P2P access between these devices is not possible;
either switch to GPU 0 by calling &#96;CUDA.device&#33;&#40;0&#41;&#96;,
or copy the data to an array allocated on device 2.</code></pre>
<p>As the error message suggests, you can always copy memory between devices using the <code>copyto&#33;</code> function. In this case, CUDA.jl will fall back to staging the copy on the host when P2P access is not possible.</p>
<p><strong>2. Using multiple streams</strong></p>
<p>Streams are used to order operations executing on a single GPU. In CUDA.jl, every Julia task has its own stream, making it very easy to group independent operations together, and make it possible for the GPU to potentially overlap execution of these operations.</p>
<p>Before CUDA.jl v5.4, users had to be careful about synchronizing data used in multiple tasks. It was recommended, for example, to end every data-producing task with an explicit call to <code>synchronize&#40;&#41;</code>, or alternatively make sure to <code>device_synchronize&#40;&#41;</code> at the start of a data-consuming task. Now that CUDA.jl keeps track of the stream used to last access memory, it can automatically synchronize streams when needed:</p>
<pre><code class="language-julia"># Allocate some data
a &#61; CUDA.zeros&#40;4096, 4096&#41;
b &#61; CUDA.zeros&#40;4096, 4096&#41;
#synchronize&#40;&#41;  # No longer needed# Perform work on a task
t &#61; @async begin
  a * b
  #synchronize&#40;&#41;  # No longer needed
end# Fetch the results
c &#61; fetch&#40;t&#41;</code></pre>
<p><strong>3. Using capturing APIs</strong></p>
<p>All of the above is implemented by piggybacking on the function that converts memory objects to pointers, in the assumption that this will be the final operation before the memory is used. This is generally true, with one important exception: APIs that capture memory. For example, when recording an operation using the CUDA graph APIs, a memory address may be captured and used later without CUDA.jl being aware of it.</p>
<p>CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs may not covered yet. If you encounter issues with capturing APIs, let us know, and keep using additional synchronization calls to ensure correctness.</p>
<h2 id="unified_memory_iteration">Unified memory iteration</h2>
<p>Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and the GPU. We have now greatly <strong>improved the performance of using unified memory with CPU code that iterates over elements</strong> of a <code>CuArray</code>. Although this is typically unwanted, triggering the dreaded &quot;scalar indexing&quot; error when accessing device memory in such a way, it can be useful when incrementaly porting code to the GPU.</p>
<p>Concretely, accessing elements of a unified <code>CuArray</code> on the CPU is much faster now:</p>
<pre><code class="language-julia-repl">julia&gt; # Reference
       a &#61; &#91;1&#93;;
julia&gt; @btime &#36;a&#91;&#93;;
  1.959 ns &#40;0 allocations: 0 bytes&#41;julia&gt; b &#61; cu&#40;a; unified&#61;true&#41;;julia&gt; # Before
       @btime &#36;b&#91;&#93;
  2.617 μs &#40;0 allocations: 0 bytes&#41;;julia&gt; # After
       @btime &#36;b&#91;&#93;;
  4.140 ns &#40;0 allocations: 0 bytes&#41;</code></pre>
<p>Notice the different unit&#33; This has a massive impact on real-life performance, for example, as demonstrated by calling <code>foldl</code> which does not have a GPU-optimized implementation:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;; unified&#61;true&#41;;julia&gt; # Before
       @b foldl&#40;&#43;, a&#41;
4.210 s &#40;9 allocs: 208 bytes, without a warmup&#41;julia&gt; # After
       @b foldl&#40;&#43;, a&#41;
3.107 ms &#40;9 allocs: 208 bytes&#41;</code></pre>
<p>For completeness, doing this with regular device memory triggers a scalar indexing error:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;rand&#40;1024, 1024&#41;&#41;;julia&gt; foldl&#40;&#43;, a&#41;
ERROR: Scalar indexing is disallowed.</code></pre>
<p>These changes should make it easier to port applications to the GPU by incrementally moving parts of the codebase to the GPU without having to worry about the performance of accessing memory from the CPU. The only requirement is to use unified memory, e.g., by calling <code>cu</code> with <code>unified&#61;true</code>, or setting the CUDA.jl preference <code>default_memory</code> to use unified memory by default. However, as unified memory comes with a slight cost, and results in synchronous allocation behavior, it is still recommended to switch back to regular device memory when your application has been fully ported to the GPU.</p>
<h2 id="other_changes">Other changes</h2>
<p>To keep this post from becoming even longer, a quick rundown of other changes:</p>
<ul>
<li><p><a href="https://github.com/wsmoses">@wsmoses</a> introduced initial support for automatic differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have to differentiate through host and device code separately, and manually set up rules for crossing the host/device boundary. Now, you can differentiate through entire applications with ease;</p>
</li>
<li><p><code>CUDA.@profile</code> now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2339">automatically detects external profilers</a>, so it should not be required to specify <code>external&#61;true</code> anymore when running under NSight;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2342">Exception output has been improved</a>, only reporting a single error message instead of generating output on each thread, and better forwarding the exception type;</p>
</li>
<li><p>Cached handles from libraries <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2352">will now be freed</a> when under memory pressure;</p>
</li>
<li><p>Tegra devices <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2374">are now supported</a> by our artifacts, obviating the use of a local toolkit;</p>
</li>
<li><p>Support for <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2392">CUDA 12.5</a> has been added, as well as <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2390">initial support for Julia 1.12</a>.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Tue, 28 May 2024 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[oneAPI.jl 1.5: Ponte Vecchio support and oneMKL improvements]]></title>
  <link>https://juliagpu.org/post/2024-05-24-oneapi_1.5/index.html</link>
  <guid>https://juliagpu.org/2024-05-24-oneapi_1.5/</guid>
  <description><![CDATA[oneAPI.jl v1.5 is a significant release that brings many new features, from extended hardware support to greatly improved wrappers of the oneMLK math library.]]></description>  
  
  <content:encoded><![CDATA[
<p>oneAPI.jl v1.5 is a significant release that brings many new features, from extended hardware support to greatly improved wrappers of the oneMLK math library.</p>
<h2 id="intel_ponte_vecchio">Intel Ponte Vecchio</h2>
<p>In oneAPI.jl v1.5 we introduce support for the Intel Ponte Vecchio &#40;PVC&#41; architecture, which empowers the Xe HPC GPUs as found in the Aurora supercomputer:</p>
<pre><code class="language-julia-repl">julia&gt; oneAPI.versioninfo&#40;&#41;
Binary dependencies:
- NEO: 24.13.29138&#43;0
- libigc: 1.0.16510&#43;0
- gmmlib: 22.3.18&#43;0
- SPIRV_LLVM_Translator_unified: 0.4.0&#43;0
- SPIRV_Tools: 2023.2.0&#43;0Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.71 driver:
- 00000000-0000-0000-17d2-6b1e010371d2 &#40;v1.3.29138, API v1.3.0&#41;16 devices:
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550
- Intel&#40;R&#41; Data Center GPU Max 1550</code></pre>
<p>Apart from <a href="https://github.com/JuliaGPU/oneAPI.jl/issues/428">a handful of MKL-related issues</a>, oneAPI.jl is fully functional on PVC, and passes all tests.</p>
<h2 id="onemkl_wrappers">oneMKL wrappers</h2>
<p>Thanks to the work of <a href="https://github.com/amontoison">@amontoison</a>, oneAPI.jl now provides greatly improved wrappers of the oneMKL library. This includes support for:</p>
<ul>
<li><p>LAPACK: <code>geqrf</code>&#40;<code>_batched</code>&#41;, <code>orgqr</code>&#40;<code>_batched</code>&#41;, <code>ormqr</code>, <code>potrf</code>&#40;<code>_batched</code>&#41;, <code>potrs</code>&#40;<code>_batched</code>&#41;, <code>getrf</code>&#40;<code>_batched</code>&#41;, <code>getri</code>&#40;<code>_batched</code>&#41;, <code>gebrd</code>, <code>gesvd</code>, <code>syevd</code>, <code>heevd</code>, <code>sygvd</code>, <code>hegvd</code></p>
</li>
<li><p>Sparse arrays: <code>sparse_gemm</code>, <code>sparse_gemv</code>, <code>sparse_symv</code>, <code>sparse_trmv</code>, <code>sparse_trsv</code>, <code>sparse_optimize_gemv</code>, <code>sparse_optimize_trsv</code></p>
</li>
</ul>
<p>Where possible, these functions are integrated with standard library interfaces, e.g., making it possible to simply call <code>eigen</code>, or to multiply two <code>oneSparseMatrixCSR</code>s.</p>
<h2 id="minor_changes">Minor changes</h2>
<p>There have of course been many other changes and improvements in oneAPI.jl v1.5. For a full list, please refer to the <a href="https://github.com/JuliaGPU/oneAPI.jl/releases/tag/v1.5.0">release notes</a>, but some highlights include:</p>
<ul>
<li><p>a new launch configuration heuristic that should generally improve performance;</p>
</li>
<li><p>broadcast now preserves the buffer type &#40;host, device, or shared&#41;;</p>
</li>
<li><p>support for very large arrays that exceed the default device memory limit;</p>
</li>
<li><p>several toolchain bumps, with v1.5 using oneAPI 2024.1.0 with driver 24.13.29138.7;</p>
</li>
<li><p>minimal support for native Windows &#40;next to WSL, which is fully supported&#41;.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Fri, 24 May 2024 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.2 and 5.3: Maintenance releases]]></title>
  <link>https://juliagpu.org/post/2024-04-26-cuda_5.2_5.3/index.html</link>
  <guid>https://juliagpu.org/2024-04-26-cuda_5.2_5.3/</guid>
  <description><![CDATA[CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 5.2 and 5.3 are two minor release of CUDA.jl that mostly focus on bug fixes and minor improvements, but also come with a number of interesting new features. This blog post summarizes the changes in these releases.</p>
<h2 id="profiler_improvements">Profiler improvements</h2>
<p>CUDA.jl 5.1 introduced a new native profiler, which can be used to profile Julia GPU applications without having to use NSight Systems or other external tools. The tool has seen continued development, mostly improving its robustness, but CUDA.jl now also provides a <code>@bprofile</code> equivalent that runs your application multiple times and reports on the time distribution of individual events:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@bprofile CuArray&#40;&#91;1&#93;&#41; .&#43; 1
Profiler ran for 1.0 s, capturing 1427349 events.Host-side activity: calling CUDA APIs took 792.95 ms &#40;79.29&#37; of the trace&#41;
┌──────────┬────────────┬────────┬───────────────────────────────────────┬─────────────────────────┐
│ Time &#40;&#37;&#41; │ Total time │  Calls │ Time distribution                     │ Name                    │
├──────────┼────────────┼────────┼───────────────────────────────────────┼─────────────────────────┤
│   19.27&#37; │  192.67 ms │ 109796 │   1.75 µs ± 10.19  &#40;  0.95 ‥ 1279.83&#41; │ cuMemAllocFromPoolAsync │
│   17.08&#37; │   170.8 ms │  54898 │   3.11 µs ± 0.27   &#40;  2.15 ‥ 23.84&#41;   │ cuLaunchKernel          │
│   16.77&#37; │  167.67 ms │  54898 │   3.05 µs ± 0.24   &#40;  0.48 ‥ 16.69&#41;   │ cuCtxSynchronize        │
│   14.11&#37; │  141.12 ms │  54898 │   2.57 µs ± 0.79   &#40;  1.67 ‥ 70.57&#41;   │ cuMemcpyHtoDAsync       │
│    1.70&#37; │   17.04 ms │  54898 │ 310.36 ns ± 132.89 &#40;238.42 ‥ 5483.63&#41; │ cuStreamSynchronize     │
└──────────┴────────────┴────────┴───────────────────────────────────────┴─────────────────────────┘Device-side activity: GPU was busy for 87.38 ms &#40;8.74&#37; of the trace&#41;
┌──────────┬────────────┬───────┬───────────────────────────────────────┬────────────────────┐
│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                     │ Name               │
├──────────┼────────────┼───────┼───────────────────────────────────────┼────────────────────┤
│    6.66&#37; │   66.61 ms │ 54898 │   1.21 µs ± 0.16   &#40;  0.95 ‥ 1.67&#41;    │ kernel             │
│    2.08&#37; │   20.77 ms │ 54898 │ 378.42 ns ± 147.66 &#40;238.42 ‥ 1192.09&#41; │ &#91;copy to device&#93;   │
└──────────┴────────────┴───────┴───────────────────────────────────────┴────────────────────┘NVTX ranges:
┌──────────┬────────────┬───────┬────────────────────────────────────────┬─────────────────────┐
│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                      │ Name                │
├──────────┼────────────┼───────┼────────────────────────────────────────┼─────────────────────┤
│   98.99&#37; │  989.94 ms │ 54898 │  18.03 µs ± 49.88  &#40; 15.26 ‥ 10731.22&#41; │ @bprofile.iteration │
└──────────┴────────────┴───────┴────────────────────────────────────────┴─────────────────────┘</code></pre>
<p>By default, <code>CUDA.@bprofile</code> runs the application for 1 second, but this can be adjusted using the <code>time</code> keyword argument.</p>
<p>Display of the time distribution isn&#39;t limited to <code>CUDA.@bprofile</code>, and will also be used by <code>CUDA.@profile</code> when any operation is called more than once. For example, with the broadcasting example from above we allocate both the input <code>CuArray</code> and the broadcast result, which results in two calls to the allocator:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@profile CuArray&#40;&#91;1&#93;&#41; .&#43; 1Host-side activity:
┌──────────┬────────────┬───────┬─────────────────────────────────────┬─────────────────────────┐
│ Time &#40;&#37;&#41; │ Total time │ Calls │ Time distribution                   │ Name                    │
├──────────┼────────────┼───────┼─────────────────────────────────────┼─────────────────────────┤
│   99.92&#37; │   99.42 ms │     1 │                                     │ cuMemcpyHtoDAsync       │
│    0.02&#37; │   21.22 µs │     2 │  10.61 µs ± 6.57   &#40;  5.96 ‥ 15.26&#41; │ cuMemAllocFromPoolAsync │
│    0.02&#37; │   17.88 µs │     1 │                                     │ cuLaunchKernel          │
│    0.00&#37; │  953.67 ns │     1 │                                     │ cuStreamSynchronize     │
└──────────┴────────────┴───────┴─────────────────────────────────────┴─────────────────────────┘</code></pre>
<p>It is also not required anymore to specify <code>external&#61;true</code> when using <code>CUDA.@profile</code> in combination with a tool like NSight Systems, as CUDA.jl will automatically detect the presence of an external profiler:</p>
<pre><code class="language-julia-repl">shell&gt; nsys launch julia# warm-up
julia&gt; CuArray&#40;&#91;1&#93;&#41;.&#43;1
1-element CuArray&#123;Int64, 1, CUDA.Mem.DeviceBuffer&#125;:
 2julia&gt; CUDA.@profile CuArray&#40;&#91;1&#93;&#41;.&#43;1
&#91; Info: This Julia session is already being profiled; defaulting to the external profiler.
Capture range started in the application.
Capture range ended in the application.
Generating &#39;/tmp/nsys-report-c42f.qdstrm&#39;
&#91;1/1&#93; &#91;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;100&#37;&#93; report1.nsys-rep</code></pre>
<p>In case that detection fails, the <code>external</code> keyword argument remains available &#40;but do file an issue&#41;.</p>
<h2 id="kernel_launch_debugging">Kernel launch debugging</h2>
<p>A common issue with CUDA programming is that kernel launches may fail when exhausting certain resources, such as shared memory or registers. This typically results in a cryptic error message, but CUDA.jl will now try to diagnose launch failures and provide a more helpful error message, as suggested by <a href="https://github.com/simonbyrne">@simonbyrne</a>:</p>
<p>For example, when using more parameter memory than allowed by the architecture:</p>
<pre><code class="language-julia-repl">julia&gt; kernel&#40;x&#41; &#61; nothing
julia&gt; @cuda kernel&#40;ntuple&#40;_-&gt;UInt64&#40;1&#41;, 2^13&#41;&#41;
ERROR: Kernel invocation uses too much parameter memory.
64.016 KiB exceeds the 31.996 KiB limit imposed by sm_89 / PTX v8.2.</code></pre>
<p>Or when using an invalid launch configuration, violating a device limit:</p>
<pre><code class="language-julia-repl">julia&gt; @cuda threads&#61;2000 identity&#40;nothing&#41;
ERROR: Number of threads in x-dimension exceeds device limit &#40;2000 &gt; 1024&#41;.
caused by: CUDA error: invalid argument &#40;code 1, ERROR_INVALID_VALUE&#41;</code></pre>
<p>We also diagnose launch failures that involve kernel-specific limits, such as exceeding the number of threads that are allowed in a block &#40;e.g., because of register use&#41;:</p>
<pre><code class="language-julia-repl">julia&gt; @cuda threads&#61;1024 heavy_kernel&#40;&#41;
ERROR: Number of threads per block exceeds kernel limit &#40;1024 &gt; 512&#41;.
caused by: CUDA error: invalid argument &#40;code 1, ERROR_INVALID_VALUE&#41;</code></pre>
<h2 id="sorting_improvements">Sorting improvements</h2>
<p>Thanks to <a href="https://github.com/xaellison">@xaellison</a>, our bitonic sorting implementation now supports sorting specific dimensions, making it possible to implement <code>sortperm</code> for multi-dimensional arrays:</p>
<pre><code class="language-julia-repl">julia&gt; A &#61; cu&#40;&#91;8 7; 5 6&#93;&#41;
2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
 8  7
 5  6julia&gt; sortperm&#40;A, dims &#61; 1&#41;
2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
 2  4
 1  3julia&gt; sortperm&#40;A, dims &#61; 2&#41;
2×2 CuArray&#123;Int64, 2, Mem.DeviceBuffer&#125;:
 3  1
 2  4</code></pre>
<p>The bitonic kernel is now used for all sorting operations, in favor of the often slower quicksort implementation:</p>
<pre><code class="language-julia-repl"># before &#40;quicksort&#41;
julia&gt; @btime CUDA.@sync sort&#40;&#36;&#40;CUDA.rand&#40;1024, 1024&#41;&#41;; dims&#61;1&#41;
  2.760 ms &#40;30 allocations: 1.02 KiB&#41;# after &#40;bitonic sort&#41;
julia&gt; @btime CUDA.@sync sort&#40;&#36;&#40;CUDA.rand&#40;1024, 1024&#41;&#41;; dims&#61;1&#41;
  246.386 μs &#40;567 allocations: 13.66 KiB&#41;# reference CPU time
julia&gt; @btime sort&#40;&#36;&#40;rand&#40;Float32, 1024, 1024&#41;&#41;; dims&#61;1&#41;
  4.795 ms &#40;1030 allocations: 5.07 MiB&#41;</code></pre>
<h2 id="unified_memory_fixes">Unified memory fixes</h2>
<p>CUDA.jl 5.1 greatly improved support for unified memory, and this has continued in CUDA.jl 5.2 and 5.3. Most notably, when broadcasting <code>CuArray</code>s we now correctly preserve the memory type of the input arrays. This means that if you broadcast a <code>CuArray</code> that is allocated as unified memory, the result will also be allocated as unified memory. In case of a conflict, e.g. broadcasting a unified <code>CuArray</code> with one backed by device memory, we will prefer unified memory:</p>
<pre><code class="language-julia-repl">julia&gt; cu&#40;&#91;1&#93;; host&#61;true&#41; .&#43; 1
1-element CuArray&#123;Int64, 1, Mem.HostBuffer&#125;:
 2julia&gt; cu&#40;&#91;1&#93;; host&#61;true&#41; .&#43; cu&#40;&#91;2&#93;; device&#61;true&#41;
1-element CuArray&#123;Int64, 1, Mem.UnifiedBuffer&#125;:
 3</code></pre>
<h2 id="software_updates">Software updates</h2>
<p>Finally, we also did routine updates of the software stack, support the latest and greatest by NVIDIA. This includes support for <strong>CUDA 12.4</strong> &#40;Update 1&#41;, <strong>cuDNN 9</strong>, and <strong>cuTENSOR 2.0</strong>. This latest release of cuTENSOR is noteworthy as it revamps the API in a backwards-incompatible way, and CUDA.jl has opted to follow this change. For more details, refer to the <a href="https://docs.nvidia.com/cuda/cutensor/latest/api_transition.html">cuTENSOR 2 migration guide</a> by NVIDIA.</p>
<p>Of course, cuTENSOR.jl also provides a high-level Julia API which has been mostly unaffected by these changes:</p>
<pre><code class="language-julia">using CUDA
A &#61; CUDA.rand&#40;7, 8, 3, 2&#41;
B &#61; CUDA.rand&#40;3, 2, 2, 8&#41;
C &#61; CUDA.rand&#40;3, 3, 7, 2&#41;using cuTENSOR
tA &#61; CuTensor&#40;A, &#91;&#39;a&#39;, &#39;f&#39;, &#39;b&#39;, &#39;e&#39;&#93;&#41;
tB &#61; CuTensor&#40;B, &#91;&#39;c&#39;, &#39;e&#39;, &#39;d&#39;, &#39;f&#39;&#93;&#41;
tC &#61; CuTensor&#40;C, &#91;&#39;b&#39;, &#39;c&#39;, &#39;a&#39;, &#39;d&#39;&#93;&#41;using LinearAlgebra
mul&#33;&#40;tC, tA, tB&#41;</code></pre>
<p>This API is still quite underdeveloped, so if you are a user of cuTENSOR.jl and have to adapt to the new API, now is a good time to consider improving the high-level interface instead&#33;</p>
<h2 id="future_releases">Future releases</h2>
<p>The next release of CUDA.jl is gearing up to be a much larger release, with significant changes to both the API and internals of the package. Although the intent is to keep these changes non-breaking, it is always possible that some code will be affected in unexpected ways, so we encourage users to test the upcoming release by simply running <code>&#93; add CUDA#master</code> and report any issues.</p>
]]></content:encoded>
    
  <pubDate>Fri, 26 Apr 2024 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.1: Unified memory and cooperative groups]]></title>
  <link>https://juliagpu.org/post/2023-11-07-cuda_5.1/index.html</link>
  <guid>https://juliagpu.org/2023-11-07-cuda_5.1/</guid>
  <description><![CDATA[CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.</p>
<h2 id="unified_memory">Unified memory</h2>
<p>Unified memory is a feature of CUDA that allows the programmer to <strong>access memory from both the CPU and GPU</strong>, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.</p>
<p>CUDA.jl did already support unified memory, but only for the most basic use cases. With CUDA.jl 5.1, it is now easier to allocate unified memory, and more convenient to use that memory from the CPU:</p>
<pre><code class="language-julia-repl">julia&gt; gpu &#61; cu&#40;&#91;1., 2.&#93;; unified&#61;true&#41;
2-element CuArray&#123;Float32, 1, CUDA.Mem.UnifiedBuffer&#125;:
 1.0
 2.0julia&gt; # accessing GPU memory from the CPU
       gpu&#91;1&#93; &#61; 3;julia&gt; gpu
2-element CuArray&#123;Float32, 1, CUDA.Mem.UnifiedBuffer&#125;:
 3.0
 2.0</code></pre>
<p>Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is <strong>safe and efficient to perform scalar iteration on <code>CuArray</code>s backed by unified memory</strong>. This greatly simplifies porting applications to the GPU, as it no longer is a problem when code uses <code>AbstractArray</code> fallbacks from Base that process element by element.</p>
<p>In addition, CUDA.jl 5.1 also makes it <strong>easier to convert <code>CuArray</code>s to <code>Array</code> objects</strong>. This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK which do not support <code>CuArray</code>s:</p>
<pre><code class="language-julia-repl">julia&gt; cpu &#61; unsafe_wrap&#40;Array, gpu&#41;
2-element Vector&#123;Float32&#125;:
 3.0
 2.0julia&gt; LinearAlgebra.BLAS.scal&#33;&#40;2f0, cpu&#41;;julia&gt; gpu
2-element CuArray&#123;Float32, 1, CUDA.Mem.UnifiedBuffer&#125;:
 6.0
 4.0</code></pre>
<p>The reverse is also possible: CPU-based <code>Array</code>s can now trivially be converted to <code>CuArray</code> objects for use on the GPU, <strong>without the need to explicitly allocate unified memory</strong>. This further simplifies memory management, as it makes it possible to use the GPU inside of an existing application without having to copy data into a <code>CuArray</code>:</p>
<pre><code class="language-julia-repl">julia&gt; gpu &#61; unsafe_wrap&#40;CuArray, cpu&#41;
2-element CuArray&#123;Int64, 1, CUDA.Mem.UnifiedBuffer&#125;:
 1
 2julia&gt; CUDA.@sync gpu .&#43;&#61; 1;julia&gt; cpu
2-element Vector&#123;Int64&#125;:
 2
 3</code></pre>
<p>Note that the above methods are prefixed <code>unsafe</code> because of how they require <strong>careful management of object lifetimes</strong>: When creating an <code>Array</code> from a <code>CuArray</code>, the <code>CuArray</code> must be kept alive for as long as the <code>Array</code> is used, and vice-versa when creating a <code>CuArray</code> from an <code>Array</code>. Explicit synchronization &#40;i.e. waiting for the GPU to finish computing&#41; is also required, as CUDA.jl cannot synchronize automatically when accessing GPU memory through a CPU pointer.</p>
<p>For now, CUDA.jl still defaults to device memory for unspecified allocations. This can be changed using the <code>default_memory</code> <a href="https://github.com/JuliaPackaging/Preferences.jl">preference</a> of the CUDA.jl module, which can be set to either <code>&quot;device&quot;</code>, <code>&quot;unified&quot;</code> or <code>&quot;host&quot;</code>. When these changes have been sufficiently tested, and the remaining rough edges have been smoothed out, we may consider switching the default allocator.</p>
<h2 id="cooperative_groups">Cooperative groups</h2>
<p>Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it possible to <strong>write kernels that are more flexible than the traditional approach</strong> of differentiating computations based on thread and block indices. Instead, cooperative groups allow the programmer to use objects representing groups of threads, pass those around, and differentiate computations based on queries on those objects.</p>
<p>For example, let&#39;s port the example from the <a href="https://developer.nvidia.com/blog/cooperative-groups/">introductory NVIDIA blogpost post</a>, which provides a function to compute the sum of an array in parallel:</p>
<pre><code class="language-julia">function reduce_sum&#40;group, temp, val&#41;
    lane &#61; CG.thread_rank&#40;group&#41;    # Each iteration halves the number of active threads
    # Each thread adds its partial sum&#91;i&#93; to sum&#91;lane&#43;i&#93;
    i &#61; CG.num_threads&#40;group&#41; ÷ 2
    while i &gt; 0
        temp&#91;lane&#93; &#61; val
        CG.sync&#40;group&#41;
        if lane &lt;&#61; i
            val &#43;&#61; temp&#91;lane &#43; i&#93;
        end
        CG.sync&#40;group&#41;
        i ÷&#61; 2
    end    return val  # note: only thread 1 will return full sum
end</code></pre>
<p>When the threads of a group call this function, they cooperatively compute the sum of the values passed by each thread in the group. For example, let&#39;s write a kernel that calls this function using a group representing the current thread block:</p>
<pre><code class="language-julia">function sum_kernel_block&#40;sum::AbstractArray&#123;T&#125;,
                          input::AbstractArray&#123;T&#125;&#41; where T
    # have each thread compute a partial sum
    my_sum &#61; thread_sum&#40;input&#41;    # perform a cooperative summation
    temp &#61; CuStaticSharedArray&#40;T, 256&#41;
    g &#61; CG.this_thread_block&#40;&#41;
    block_sum &#61; reduce_sum&#40;g, temp, my_sum&#41;    # combine the block sums
    if CG.thread_rank&#40;g&#41; &#61;&#61; 1
        CUDA.@atomic sum&#91;&#93; &#43;&#61; block_sum
    end    return
endfunction thread_sum&#40;input::AbstractArray&#123;T&#125;&#41; where T
    sum &#61; zero&#40;T&#41;    i &#61; &#40;blockIdx&#40;&#41;.x-1&#41; * blockDim&#40;&#41;.x &#43; threadIdx&#40;&#41;.x
    stride &#61; blockDim&#40;&#41;.x * gridDim&#40;&#41;.x
    while i &lt;&#61; length&#40;input&#41;
        sum &#43;&#61; input&#91;i&#93;
        i &#43;&#61; stride
    end    return sum
endn &#61; 1&lt;&lt;24
threads &#61; 256
blocks &#61; cld&#40;n, threads&#41;data &#61; CUDA.rand&#40;n&#41;
sum &#61; CUDA.fill&#40;zero&#40;eltype&#40;data&#41;&#41;, 1&#41;
@cuda threads&#61;threads blocks&#61;blocks sum_kernel_block&#40;sum, data&#41;</code></pre>
<p>This style of programming makes it possible to write kernels that are safer and more modular than traditional kernels. Some CUDA features also require the use of cooperative groups, for example, asynchronous memory copies between global and shared memory are done using the <code>CG.memcpy_async</code> function.</p>
<p>With CUDA.jl 5.1, it is now possible to use a large part of these APIs from Julia. Support has been added for implicit groups &#40;with the exception of cluster groups and the deprecated multi-grid groups&#41;, all relevant queries on these groups, as well as the many important collective functions, such as <code>shuffle</code>, <code>vote</code>, and <code>memcpy_async</code>. Support for explicit groups is still missing, as are collectives like <code>reduce</code> and <code>invoke</code>. For more information, refer to <a href="https://cuda.juliagpu.org/dev/development/kernel/#Cooperative-groups">the CUDA.jl documentation</a>.</p>
<h2 id="other_updates">Other updates</h2>
<p>Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:</p>
<ul>
<li><p>Support for CUDA 12.3</p>
</li>
<li><p>Performance improvements related to memory copies, which regressed in CUDA 5.0</p>
</li>
<li><p>Improvements to the native profiler &#40;<code>CUDA.@profiler</code>&#41;, now also showing local memory usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter</p>
</li>
<li><p>Many CUSOLVER and CUSPARSE improvements by <a href="https://github.com/amontoison">@amontoison</a></p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Tue, 07 Nov 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 5.0: Integrated profiler and task synchronization changes]]></title>
  <link>https://juliagpu.org/post/2023-09-19-cuda_5.0/index.html</link>
  <guid>https://juliagpu.org/2023-09-19-cuda_5.0/</guid>
  <description><![CDATA[CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 5.0 is an major release that adds an integrated profiler to CUDA.jl, and reworks how tasks are synchronized. The release is slightly breaking, as it changes how local toolkits are handled and raises the minimum Julia and CUDA versions.</p>
<h2 id="integrated_profiler">Integrated profiler</h2>
<p>The most exciting new feature in CUDA.jl 5.0 is <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2024">the new integrated profiler</a>, which is similar to the <code>@profile</code> macro from the Julia standard library. The profiler can be used by simply prefixing any code that uses the CUDA libraries with <code>CUDA.@profile</code>:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@profile CUDA.rand&#40;1&#41;.&#43;1
Profiler ran for 268.46 µs, capturing 21 events.Host-side activity: calling CUDA APIs took 230.79 µs &#40;85.97&#37; of the trace&#41;
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬─────────────────────────┐
│ Time &#40;&#37;&#41; │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                    │
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼─────────────────────────┤
│   76.47&#37; │ 205.28 µs │     1 │ 205.28 µs │ 205.28 µs │ 205.28 µs │ cudaLaunchKernel        │
│    5.42&#37; │  14.54 µs │     2 │   7.27 µs │   5.01 µs │   9.54 µs │ cuMemAllocFromPoolAsync │
│    2.93&#37; │   7.87 µs │     1 │   7.87 µs │   7.87 µs │   7.87 µs │ cuLaunchKernel          │
│    0.36&#37; │ 953.67 ns │     2 │ 476.84 ns │    0.0 ns │ 953.67 ns │ cudaGetLastError        │
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴─────────────────────────┘Device-side activity: GPU was busy for 2.15 µs &#40;0.80&#37; of the trace&#41;
┌──────────┬───────────┬───────┬───────────┬───────────┬───────────┬──────────────────────────────
│ Time &#40;&#37;&#41; │      Time │ Calls │  Avg time │  Min time │  Max time │ Name                        ⋯
├──────────┼───────────┼───────┼───────────┼───────────┼───────────┼──────────────────────────────
│    0.44&#37; │   1.19 µs │     1 │   1.19 µs │   1.19 µs │   1.19 µs │ _Z13gen_sequencedI17curandS ⋯
│    0.36&#37; │ 953.67 ns │     1 │ 953.67 ns │ 953.67 ns │ 953.67 ns │ _Z16broadcast_kernel15CuKer ⋯
└──────────┴───────────┴───────┴───────────┴───────────┴───────────┴──────────────────────────────
                                                                                  1 column omitted
1-element CuArray&#123;Float32, 1, CUDA.Mem.DeviceBuffer&#125;:
 1.7242923</code></pre>
<p>The output shown above is a summary of what happened during the execution of the code. It is split into two sections: <strong>host-side activity</strong>, i.e., API calls to the CUDA libraries, and the resulting <strong>device-side activity</strong>. As part of each section, the output shows the time spent and the ratio to the total execution time. These ratios are important, and a good tool to quickly assess the performance of your code. For example, in the above output, we see that most of the time is spent on the host calling the CUDA libraries, and only very little time is actually spent computing things on the GPU. This indicates that the GPU is severely underutilized, which can be solved by increasing the problem size.</p>
<p>Instead of a summary, it is also possible to view a <strong>chronological trace</strong> by passing the <code>trace&#61;true</code> keyword argument:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.@profile trace&#61;true CUDA.rand&#40;1&#41;.&#43;1;
Profiler ran for 262.98 µs, capturing 21 events.Host-side activity: calling CUDA APIs took 227.21 µs &#40;86.40&#37; of the trace&#41;
┌────┬───────────┬───────────┬─────────────────────────┬────────────────────────┐
│ ID │     Start │      Time │                    Name │ Details                │
├────┼───────────┼───────────┼─────────────────────────┼────────────────────────┤
│  5 │   6.44 µs │   9.06 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│  7 │  19.31 µs │ 715.26 ns │        cudaGetLastError │ -                      │
│  8 │  22.41 µs │ 204.09 µs │        cudaLaunchKernel │ -                      │
│  9 │ 227.21 µs │    0.0 ns │        cudaGetLastError │ -                      │
│ 14 │  232.7 µs │   3.58 µs │ cuMemAllocFromPoolAsync │ 4 bytes, device memory │
│ 18 │ 250.34 µs │   7.39 µs │          cuLaunchKernel │ -                      │
└────┴───────────┴───────────┴─────────────────────────┴────────────────────────┘Device-side activity: GPU was busy for 2.38 µs &#40;0.91&#37; of the trace&#41;
┌────┬───────────┬─────────┬─────────┬────────┬──────┬────────────────────────────────────────────
│ ID │     Start │    Time │ Threads │ Blocks │ Regs │ Name                                      ⋯
├────┼───────────┼─────────┼─────────┼────────┼──────┼────────────────────────────────────────────
│  8 │ 225.31 µs │ 1.19 µs │      64 │     64 │   38 │ _Z13gen_sequencedI17curandStateXORWOWfiXa ⋯
│ 18 │ 257.73 µs │ 1.19 µs │       1 │      1 │   18 │ _Z16broadcast_kernel15CuKernelContext13Cu ⋯
└────┴───────────┴─────────┴─────────┴────────┴──────┴────────────────────────────────────────────
                                                                                  1 column omitted</code></pre>
<p>Here, we can see a list of events that the profiler captured. Each event has a unique ID, which can be used to corelate host-side and device-side events. For example, we can see that event 8 on the host is a call to <code>cudaLaunchKernel</code>, which corresponds to to the execution of a CURAND kernel on the device.</p>
<p>The integrated profiler is a great tool to quickly assess the performance of your GPU application, identify bottlenecks, and find opportunities for optimization. For complex applications, however, it is still recommended to use NVIDIA&#39;s NSight Systems or Compute profilers, which provide a more detailed, graphical view of what is happening on the GPU.</p>
<h2 id="synchronization_on_worker_threads">Synchronization on worker threads</h2>
<p>Another noteworthy change affects how tasks are synchronized. To enable concurrent execution, i.e., to make it possible for other Julia tasks to execute while waiting for the GPU to finish, CUDA.jl used to rely on so-called stream callbacks. These callbacks were a significant source of latency, at least 25us per invocation but sometimes <em>much</em> longer, and have also been slated for deprecation and eventual removal from the CUDA toolkit.</p>
<p>Instead, on Julia 1.9 and later, CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2025">now uses</a> worker threads to wait for GPU operations to finish. This mechanism is significantly faster, taking around 5us per invocation, but more importantly offers a much more reliable and predictable latency. You can observe this mechanism using the integrated profiler:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; CUDA.rand&#40;1024, 1024, 1024&#41;
julia&gt; CUDA.@profile trace&#61;true CUDA.@sync a .&#43; a
Profiler ran for 12.29 ms, capturing 527 events.Host-side activity: calling CUDA APIs took 11.75 ms &#40;95.64&#37; of the trace&#41;
┌─────┬───────────┬───────────┬────────┬─────────────────────────┐
│  ID │     Start │      Time │ Thread │                    Name │
├─────┼───────────┼───────────┼────────┼─────────────────────────┤
│   5 │   6.91 µs │  13.59 µs │      1 │ cuMemAllocFromPoolAsync │
│   9 │  36.72 µs │ 199.56 µs │      1 │          cuLaunchKernel │
│ 525 │ 510.69 µs │  11.75 ms │      2 │     cuStreamSynchronize │
└─────┴───────────┴───────────┴────────┴─────────────────────────┘</code></pre>
<p>For some users, this may still be too slow, so we have added two mechanisms that disable nonblocking synchronization and simply block the calling thread until the GPU operation finishes. The first is a global setting, which can be enabled by setting the <code>nonblocking_synchronization</code> preference to <code>false</code>, which can be done using Preferences.jl. <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2060">The second</a> is a fine-grained flag to pass to synchronization functions: <code>synchronize&#40;x; blocking&#61;true&#41;</code>, <code>CUDA.@sync blocking&#61;true
...</code>, etc. Both these mechanisms should <em>not</em> be used widely, and are only intended for use in latency-critical code, e.g., when benchmarking or profiling.</p>
<h2 id="local_toolkit_discovery">Local toolkit discovery</h2>
<p>One of the breaking changes involves <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2058">how local toolkits are discovered</a>, when opting out of the use of artifacts. Previously, this could be enabled by calling <code>CUDA.set_runtime_version&#33;&#40;&quot;local&quot;&#41;</code>, which generated a <code>version &#61; &quot;local&quot;</code> preference. We are now changing this into two separate preferences, <code>version</code> and <code>local</code>, where the <code>version</code> preference overrides the version of the CUDA toolkit, and the <code>local</code> preference independently indicates whether to use a local CUDA toolkit or not.</p>
<p>Concretely, this means that you will now need to call <code>CUDA.set_runtime_version&#33;&#40;local_toolkit&#61;true&#41;</code> to enable the use of a local toolkit. The toolkit version will be auto-detected, but can be overridden by also passing a version: <code>CUDA.set_runtime_version&#33;&#40;version; local_toolkit&#61;true&#41;</code>. This may be necessary when CUDA is not available during precompilation, e.g., on the log-in node of a cluster, or when building a container image.</p>
<h2 id="raised_minimum_requirements">Raised minimum requirements</h2>
<p>Finally, CUDA.jl 5.0 raises the minimum Julia and CUDA versions. The minimum Julia version is now 1.8, which should be enforced by the Julia package manager. The minimum CUDA toolkit version is now 11.4, but this cannot be enforced by the package manager. As a result, if you need to use an older version of the CUDA toolkit, you will need to pin CUDA.jl to v4.4 or below. <a href="https://github.com/JuliaGPU/CUDA.jl/blob/master/README.md">The README</a> will maintain a table of supported CUDA toolkit versions.</p>
<p>Most users will not be affected by this change: If you use the artifact-provided CUDA toolkit, you will automatically get the latest version supported by your CUDA driver.</p>
<h2 id="other_changes">Other changes</h2>
<ul>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2034">Support for CUDA 12.2</a>;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/2040">Memory limits</a> are now enforced by CUDA, resulting in better performance;</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1946">Support for Julia 1.10</a> &#40;with help from <a href="https://github.com/dkarrasch">@dkarrasch</a>&#41;;</p>
</li>
<li><p>Support for batched <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1975"><code>gemm</code></a>, <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1981"><code>gemv</code></a> and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/2063"><code>svd</code></a> &#40;by <a href="https://github.com/lpawela">@lpawela</a> and <a href="https://github.com/nikopj">@nikopj</a>.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Tue, 19 Sep 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Profiling oneAPI.jl applications with VTune]]></title>
  <link>https://juliagpu.org/post/2023-07-19-oneapi_profiling/index.html</link>
  <guid>https://juliagpu.org/2023-07-19-oneapi_profiling/</guid>
  <description><![CDATA[Profiling GPU applications is hard, so this post shows how to use Intel&#39;s VTune Profiler to profile GPU applications written in Julia with oneAPI.jl.]]></description>  
  
  <content:encoded><![CDATA[
<p>Profiling GPU applications is hard, so this post shows how to use Intel&#39;s VTune Profiler to profile GPU applications written in Julia with oneAPI.jl.</p>
<p>Because of the asynchronous nature of GPU execution, profiling GPU applications with Julia&#39;s tried and tested tools like <code>@profile</code> or even <code>@time</code> can be misleading: They will only show the time spent on the CPU, and will likely report that your application is spending most of its time waiting for the GPU.</p>
<p>To get a better understanding of what is happening on the GPU, we need specialized tools. In this post, we&#39;ll show how to use Intel&#39;s VTune Profiler to profile GPU applications written in Julia using oneAPI.jl.</p>
<h2 id="set-up">Set-up</h2>
<p>Start by downloading and installing the <a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html">Intel VTune Profiler</a>. This does not require administrative permissions, and will install in your home folder under the <code>intel</code> directory. On Linux, binaries will appear in <code>~/intel/oneapi/vtune/latest/bin64</code>. There are three that are particularly important:</p>
<ul>
<li><p><code>vtune</code>: a command-line tool to profile applications;</p>
</li>
<li><p><code>vtune-gui</code>: a graphical user interface to profile applications, or to visualize the results of a command-line profiling session;</p>
</li>
<li><p><code>vtune-backend</code>: a daemon that creates a web interface for VTune, which you can use to profile applications both locally and remotely.</p>
</li>
</ul>
<h2 id="hello_vtune">Hello VTune&#33;</h2>
<p>Let&#39;s start with a simple example: A Julia program that computes the sum of two arrays &#40;i.e., the <a href="https://github.com/JuliaGPU/oneAPI.jl/blob/master/examples/vadd.jl"><code>vadd</code> example</a> from the oneAPI repository&#41;:</p>
<pre><code class="language-julia">using oneAPIfunction kernel&#40;a, b, c&#41;
    i &#61; get_global_id&#40;&#41;
    @inbounds c&#91;i&#93; &#61; a&#91;i&#93; &#43; b&#91;i&#93;
    return
endfunction vadd&#40;a, b&#41;
    d_a &#61; oneArray&#40;a&#41;
    d_b &#61; oneArray&#40;b&#41;
    d_c &#61; similar&#40;d_a&#41;    @oneapi items&#61;size&#40;d_c&#41; kernel&#40;d_a, d_b, d_c&#41;
    Array&#40;d_c&#41;
endfunction main&#40;N&#61;256&#41;
    a &#61; round.&#40;rand&#40;Float32, N&#41; * 100&#41;
    b &#61; round.&#40;rand&#40;Float32, N&#41; * 100&#41;
    c &#61; vadd&#40;a, b&#41;
end
main&#40;&#41;</code></pre>
<p>We&#39;ve tweaked this example to make it more suited for profiling: We&#39;ve enclosed the main application in a function so that it gets compiled, and we&#39;ve increased the array sizes to make the GPU work harder.</p>
<p>There are several ways to profile this application. We&#39;ll start by demonstrating the command-line interface:</p>
<pre><code class="language-julia">&#36; vtune -collect gpu-offload julia vadd.jlvtune: Collection started.
vtune: Collection stopped.vtune: Using result path &#96;/home/tim/Julia/pkg/oneAPI/r000gh&#39;
    GPU Time: 0.002s
EU Array Stalled/Idle: 100.0&#37; of Elapsed time with GPU busy
 | The percentage of time when the EUs were stalled or idle is high, which has a
 | negative impact on compute-bound applications.
FPU Utilization: 0.0&#37; of Elapsed time with GPU busy
...</code></pre>
<p>This will run the application, and collect a number of GPU-related metrics. A summary is shown in the terminal, and a more detailed report will be written to a directory in the current working directory. You can open that report with the graphical user interface, possibly even on a different machine:</p>
<pre><code class="language-julia">&#36; vtune-gui r000gh</code></pre>
<h2 id="instrumenting_the_application">Instrumenting the application</h2>
<p>The trace we just collected includes the time spent compiling our application, making it difficult to analyze what is happening. To refine the trace, we can instrument our application with Intel&#39;s Instrumentation and Tracing Technology &#40;ITT&#41; APIs:</p>
<ul>
<li><p>only start the profiler when we&#39;re running code of interest;</p>
</li>
<li><p>add markers to the trace to indicate what is happening.</p>
</li>
</ul>
<p>We can interface with the ITT APIs using the <a href="https://github.com/JuliaPerf/IntelITT.jl">IntelITT.jl</a> package. Let&#39;s update our example:</p>
<pre><code class="language-julia">using oneAPI, IntelITT# same as beforefunction main&#40;N&#61;256&#41;
    a &#61; round.&#40;rand&#40;Float32, N&#41; * 100&#41;
    b &#61; round.&#40;rand&#40;Float32, N&#41; * 100&#41;
    c &#61; IntelITT.@task &quot;vadd&quot; oneAPI.@sync vadd&#40;a, b&#41;
end# warm-up
main&#40;&#41;# actual profile
IntelITT.@collect main&#40;&#41;</code></pre>
<p>Here, the <code>IntelITT.@collect</code> macro will start and stop the collection, so we should launch VTune with the <code>-start-paused</code> option:</p>
<pre><code class="language-julia">&#36; vtune -collect gpu-offload -start-paused julia vadd.jl</code></pre>
<p>In the GUI, we can now clearly see a nicely packed stream of API calls, grouped under the <code>vadd</code> task we added. Note that because API calls are asynchronous, i.e. they return immediately before the GPU has executed them, I grouped them under a <code>oneAPI.@sync</code> call so that the task not only captures the time spent on the CPU, but also the time spent on the GPU. This may not be wanted for your application.</p>
<p><img src="vtune_timeline.png" alt="VTune timeline" /></p>
<h2 id="kernel_details">Kernel details</h2>
<p>The timeline view is great for getting an application-level overview of what is happening, but once you&#39;ve isolated a kernel that doesn&#39;t perform as expected, you may want to switch from the GPU Offload to the GPU Compute Hotspots analysis. Here, you get a more detailed view of what&#39;s happening during execution on the GPU, including the memory bandwidth and execution properties:</p>
<pre><code class="language-julia">&#36; vtune -collect gpu-hotspots -start-paused julia vadd.jl</code></pre>
<p><img src="vtune_gpu_hotspots.png" alt="VTune timeline" /></p>
<p>Many of these analysis can be configured to collect more or less data, at the cost of more or less overhead.</p>
<h2 id="working_remotely">Working remotely</h2>
<p>In many cases, your local system will not have a GPU, and you will want to profile an application running on a remote system. As shown above, you can use the <code>vtune</code> CLI to create a trace and open that locally using <code>vtune-gui</code>, however there is an easier way: The <code>vtune-backend</code> daemon.</p>
<p>Start by launching the VTune back-end on the remote system:</p>
<pre><code class="language-julia">&#36; vtune-backend --enable-server-profiling --web-port 8443 --log-to-console</code></pre>
<p>If your remote system is directly reachable, you want to add <code>--allow-remote-access --base-url &quot;https://remoteServer:8443&quot;</code>. However, most people will need to set-up an SSH tunnel:</p>
<pre><code class="language-julia">&#36; ssh -L 8443:localhost:8443 remoteServer</code></pre>
<p>You can now access the VTune GUI at <code>https://localhost:8443/</code>. Note that the first time you connect, you will need to do so using the one-time URL that is shown in the terminal where you launched the <code>vtune-backend</code> daemon.</p>
<p>The web interface that <code>vtune-backend</code> provides is identical to the GUI from <code>vtune-gui</code>: Start by creating a new project, and configuring an analysis: Select the local VTune profile server, enter the path to the Julia executable along with arguments and a working directory, and select the GPU Offload analysis type:</p>
<p><img src="vtune_webui.png" alt="VTune WebUI" /></p>
<p>To start the analysis, click the big blue play button. If you use <code>IntelITT.@collect</code> to restrict the trace to the code of interest, use the second button with the pause symbol.</p>
<h2 id="give_it_a_try">Give it a try&#33;</h2>
<p>Hopefully, this guide has shed some light on how to accurately profile oneAPI.jl applications using Intel&#39;s VTune Profiler. It turns out that one package could significantly benefit from some rigorous profiling: oneAPI.jl&#33; Until now, development has focussed on correctness and usability, leaving considerable room for performance enhancements.</p>
<p>If you have access to an Intel GPU and want to gain experience profiling GPU applications with VTune, we encourage you to get involved&#33; A good starting point would be analyzing some of oneAPI.jl&#39;s array operations like <code>mapreduce</code> or <code>broadcast</code> to identify potential bottlenecks. For more information or any queries, feel free to open an issue on GitHub, or join the discussion on Slack or Discourse. Your help could make a significant difference&#33;</p>
]]></content:encoded>
    
  <pubDate>Wed, 19 Jul 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Metal.jl 0.2: Metal Performance Shaders]]></title>
  <link>https://juliagpu.org/post/2023-03-03-metal_0.2/index.html</link>
  <guid>https://juliagpu.org/2023-03-03-metal_0.2/</guid>
  <description><![CDATA[Metal.jl 0.2 marks a significant milestone in the development of the Metal.jl package. The release comes with initial support for the Metal Perform Shaders &#40;MPS&#41; framework for accelerating common operations like matrix multiplications, as well as various improvements for writing Metal kernels in Julia.]]></description>  
  
  <content:encoded><![CDATA[
<p>Metal.jl 0.2 marks a significant milestone in the development of the Metal.jl package. The release comes with initial support for the Metal Perform Shaders &#40;MPS&#41; framework for accelerating common operations like matrix multiplications, as well as various improvements for writing Metal kernels in Julia.</p>
<h2 id="metal_performance_shaders">Metal Performance Shaders</h2>
<p>Quoting the <a href="https://developer.apple.com/documentation/metalperformanceshaders">Apple documentation</a>, The Metal Performance Shaders &#40;MPS&#41; framework contains a collection of highly optimized compute and graphics shaders for use in Metal applications. With Metal.jl 0.2, we have added initial support for this framework, and used it to accelerate the matrix multiplication operation:</p>
<pre><code class="language-julia-repl">julia&gt; using Metal, LinearAlgebra, BenchmarkTools
julia&gt; n &#61; p &#61; m &#61; 2048
julia&gt; flops &#61; n*m*&#40;2p-1&#41;
17175674880julia&gt; a &#61; MtlArray&#40;rand&#40;Float32, n, p&#41;&#41;;
julia&gt; b &#61; MtlArray&#40;rand&#40;Float32, p, m&#41;&#41;;
julia&gt; c &#61; MtlArray&#40;zeros&#40;Float32, n, m&#41;&#41;;julia&gt; using LinearAlgebra
julia&gt; bench &#61; @benchmark Metal.@sync mul&#33;&#40;c, a, b&#41;
BenchmarkTools.Trial: 518 samples with 1 evaluation.
 Range &#40;min … max&#41;:  9.366 ms …  13.354 ms  ┊ GC &#40;min … max&#41;: 0.00&#37; … 0.00&#37;
 Time  &#40;median&#41;:     9.629 ms               ┊ GC &#40;median&#41;:    0.00&#37;
 Time  &#40;mean ± σ&#41;:   9.646 ms ± 192.169 μs  ┊ GC &#40;mean ± σ&#41;:  0.00&#37; ± 0.00&#37;               ▃▂▅▅▆▆▆▇█▇▇▆▅▄▄▁▁ ▁
  ▄▁▄▄▄▄▆▆▆▄▄▁▇█████████████████▄█▄▁▆▁▄▁▆▁▇▁▄▄▁▁▄▄▇▁▄▆▄▁▁▁▁▁▄ █
  9.37 ms      Histogram: log&#40;frequency&#41; by time      10.1 ms &lt; Memory estimate: 352 bytes, allocs estimate: 12.julia&gt; flops / &#40;minimum&#40;bench.times&#41;/1e9&#41;
1.83e12</code></pre>
<p>The benchmark above shows that on an 8-core M1 Pro matrix multiplication now reaches 1.8 TFLOPS &#40;out of the 2.6TFLOPS of theoretical performance&#41;. The accelerated matrix multiplication is available for a variety of input types, incuding mixed-mode operations, and as shown above is integrated with the LinearAlgebra.jl <code>mul&#33;</code> interface.</p>
<p>Of course, the MPS framework offers more than just matrix multiplication, and we expect to support more of it in the future. If you have a specific operation you would like to use from Julia, please let us know by opening an issue on the Metal.jl repository.</p>
<h2 id="gpu_profiling_support">GPU profiling support</h2>
<p>To support the development of Metal kernels, <a href="https://github.com/max-Hawkins">Max Hawkins</a> has added support for GPU profiling. Similar to how this works in CUDA.jl, you can run code under the <code>Metal.@profile</code> macro to record its execution. However, this does first require setting the <code>METAL_CAPTURE_ENABLED</code> environment flag <em>before</em> import Metal.jl:</p>
<pre><code class="language-julia-repl">julia&gt; ENV&#91;&quot;METAL_CAPTURE_ENABLED&quot;&#93; &#61; 1julia&gt; using Metaljulia&gt; a &#61; mtl&#40;rand&#40;1024, 1024&#41;&#41;
julia&gt; Metal.@profile sum&#40;a&#41;
&#91; Info: GPU frame capture saved to jl_metal.gputrace/</code></pre>
<p>The resulting capture can be opened with Xcode, presenting a timeline that&#39;s similar to other profilers:</p>
<figure>
  <img src="https://juliagpu.org/post/2023-03-03-metal_0.2/xcode.png" alt="XCode viewing a Metal.jl capture trace">
</figure><h2 id="other_improvements">Other improvements</h2>
<ul>
<li><p>Julia 1.9 is supported, but requires an up-to-date macOS version &#40;issues have been encountered on macOS 12.4&#41;;</p>
</li>
<li><p>An <code>mtl</code> function has been added for converting Julia arrays to Metal arrays, similar to the <code>cu</code> function in CUDA.jl;</p>
</li>
<li><p>Multiple GPUs are supported, and the <code>device&#33;</code> function can be used to select one;</p>
</li>
<li><p>Coverage for SIMD Group functions has been improved, so it&#39;s is now possible to use <code>simdgroup_load</code>, <code>simdgroup_store</code>, <code>simdgroup_multiply</code>, and <code>simdgroup_multiply_accumulate</code> in kernels functions.</p>
</li>
</ul>
<h2 id="future_work">Future work</h2>
<p>Although Metal.jl is now usable for a variety of applications, there is still work to be done before it can be considered production-ready. In particular:</p>
<ul>
<li><p>there are known performance issues with <code>mapreduce</code>, and other operations that realy on <code>CartesianIndices</code>;</p>
</li>
<li><p>the <code>libcmt</code> wrapper library for interfacing with the Metal APIs is cumbersome to use and improve, and we are looking into native ObjectiveC FFI instead;</p>
</li>
<li><p>the MPS wrappers are incomplete, and similar to the Metal APIs requires a replacement to <code>libcmt</code> to be improved;</p>
</li>
<li><p>support for atomic operations is missing, which is required to implement a full-featured KernelAbstractions.jl back-end.</p>
</li>
</ul>
<p>Once &#40;most of&#41; these issues are addressed, we should be able to release Metal.jl 1.0.</p>
]]></content:encoded>
    
  <pubDate>Fri, 03 Mar 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[oneAPI.jl 1.0: oneMKL, Intel Arc and Julia 1.9]]></title>
  <link>https://juliagpu.org/post/2023-02-08-oneapi_1.0/index.html</link>
  <guid>https://juliagpu.org/2023-02-08-oneapi_1.0/</guid>
  <description><![CDATA[The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel Library &#40;oneMKL&#41; to accelerate linear algebra operations on Intel GPUs. It also brings support for Julia 1.9 and Intel Arc GPUs.]]></description>  
  
  <content:encoded><![CDATA[
<p>The release of oneAPI.jl 1.0 adds integration with the oneAPI Math Kernel Library &#40;oneMKL&#41; to accelerate linear algebra operations on Intel GPUs. It also brings support for Julia 1.9 and Intel Arc GPUs.</p>
<h2 id="onemkl_integration">oneMKL integration</h2>
<p>oneAPI.jl now uses the Intel oneAPI Math Kernel Library &#40;oneMKL&#41;, automatically downloaded as part of <code>oneAPI_Support_jll.jl</code>, to accelerate a great number of BLAS and LAPACK operations on Intel GPUs. Similar to how it is implemented in our other GPU back-ends, these wrappers are available at different levels of abstraction.</p>
<p>At the lowest level, we use a C library that wraps the oneMKL C&#43;&#43; APIs. For example, the <code>oneapi::mkl::blas::column_major::gemm</code> function for matrix-matrix multiplication is wrapped by the C functions <code>onemklSgemm</code>, <code>onemklDgemm</code>, etc. These wrappers are used to implement low-level methods like <code>oneMKL.gemm&#33;</code>:</p>
<pre><code class="language-julia-repl">julia&gt; using oneAPIjulia&gt; A &#61; oneArray&#40;rand&#40;Float32, 2, 3&#41;&#41;;
2×3 oneMatrix&#123;Float32, oneAPI.oneL0.DeviceBuffer&#125;:
 0.44302   0.125576  0.859145
 0.674291  0.428346  0.0400119
julia&gt; B &#61; oneArray&#40;rand&#40;Float32, 3, 4&#41;&#41;
3×4 oneMatrix&#123;Float32, oneAPI.oneL0.DeviceBuffer&#125;:
 0.592748   0.529413   0.0323396  0.659528
 0.22489    0.0872259  0.253291   0.376519
 0.0121506  0.591135   0.706755   0.751686
julia&gt; C &#61; similar&#40;B, &#40;2, 4&#41;&#41;;julia&gt; oneMKL.gemm&#33;&#40;&#39;N&#39;, &#39;N&#39;, true, A, B, true, C&#41;
2×4 oneMatrix&#123;Float32, oneAPI.oneL0.DeviceBuffer&#125;:
 0.301279  0.753365  0.65334   0.985274
 0.496501  0.417994  0.158581  0.63607julia&gt; Array&#40;C&#41; ≈ Array&#40;A&#41; * Array&#40;B&#41;
true</code></pre>
<p>Of course, these low-level functions aren&#39;t very user-friendly, so we also integrate with Julia&#39;s standard libraries where possible:</p>
<pre><code class="language-julia-repl">julia&gt; A &#61; oneArray&#40;rand&#40;Float32, 2, 3&#41;&#41;;
julia&gt; B &#61; oneArray&#40;rand&#40;Float32, 3, 4&#41;&#41;;julia&gt; using LinearAlgebra
julia&gt; C &#61; A * B;julia&gt; Array&#40;C&#41; ≈ Array&#40;A&#41; * Array&#40;B&#41;
true</code></pre>
<p>The most frequently used oneMKL BLAS functions have been wrapped and integrated with Julia’s standard linear algebra libraries. If you run into a missing function, please file a request to add it, or take a look at the source and contribute to oneAPI.jl&#33; The current state of the wrappers should make it easy to extend their functionality, as well as form a good basis for integrating with other libraries like oneDNN.</p>
<h2 id="intel_arc_support">Intel Arc support</h2>
<p>The new Arc series of discrete Intel GPUs are now fully supported by oneAPI.jl. These GPUs offer a significant performance improvement over their integrated predecessors:</p>
<pre><code class="language-julia-repl">julia&gt; using oneAPI
julia&gt; oneAPI.versioninfo&#40;&#41;
1 device:
- Intel&#40;R&#41; Arc&#40;TM&#41; A770 Graphics &#91;0x56a0&#93;julia&gt; T &#61; Float32;
julia&gt; n &#61; p &#61; m &#61; 2048;
julia&gt; a &#61; oneArray&#40;rand&#40;T, n, p&#41;&#41;;
julia&gt; b &#61; oneArray&#40;rand&#40;T, p, m&#41;&#41;;
julia&gt; c &#61; oneArray&#40;zeros&#40;T, n, m&#41;&#41;;julia&gt; using BenchmarkTools, LinearAlgebra
julia&gt; bench &#61; @benchmark oneAPI.@sync mul&#33;&#40;c, a, b&#41;
BenchmarkTools.Trial: 1510 samples with 1 evaluation.
 Range &#40;min … max&#41;:  3.233 ms …  3.791 ms  ┊ GC &#40;min … max&#41;: 0.00&#37; … 0.00&#37;
 Time  &#40;median&#41;:     3.298 ms              ┊ GC &#40;median&#41;:    0.00&#37;
 Time  &#40;mean ± σ&#41;:   3.308 ms ± 48.426 μs  ┊ GC &#40;mean ± σ&#41;:  0.00&#37; ± 0.00&#37;        ▁▃▄▇█▅▄▃▂   ▁▁▁
  ▁▁▃▃▅▇██████████████████▇▇▇▅▆▄▅▅▄▂▃▂▂▂▂▂▂▁▂▂▂▁▂▁▂▁▂▂▂▂▁▁▂▂ ▃
  3.23 ms        Histogram: frequency by time        3.47 ms &lt; Memory estimate: 272 bytes, allocs estimate: 11.julia&gt; flops &#61; n*m*&#40;2p-1&#41;
17175674880julia&gt; flops / &#40;minimum&#40;bench.times&#41;/1e9&#41;
5.3131281169900205e12</code></pre>
<p>For example, here we&#39;re getting over 5 TFlops of Float32 performance, which is over 10x faster than the Intel Xe Graphics G7 we had been previously using for oneAPI.jl development. At the same time, the A770 used above should be able to deliver close to 20 TFlops, so there&#39;s still room for improvement in our software stack.</p>
<p>To use oneAPI.jl with an Arc series GPU, you need to run Linux 6.2. At the time of writing, that kernel is still in beta, so refer to your distribution&#39;s documentation for how to install it. For example, on Arch Linux you can use the <a href="https://aur.archlinux.org/packages/linux-mainline"><code>linux-mainline</code> package from the AUR</a>, Ubuntu has the <a href="https://wiki.ubuntu.com/Kernel/MainlineBuilds"><code>kernel-ppa</code> archive</a>, Fedora provides the <a href="https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories"><code>stable-rc</code> repository</a>, etc.</p>
<h2 id="other_changes">Other changes</h2>
<ul>
<li><p>Support for Julia 1.9 has been added.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Wed, 08 Feb 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 4.0]]></title>
  <link>https://juliagpu.org/post/2023-02-01-cuda_4.0/index.html</link>
  <guid>https://juliagpu.org/2023-02-01-cuda_4.0/</guid>
  <description><![CDATA[CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 4.0 is a breaking release that introduces the use of JLLs to provide the CUDA toolkit. This makes it possible to compile other binary libaries against the CUDA runtime, and use them together with CUDA.jl. The release also brings CUSPARSE improvements, the ability to limit memory use, and many bug fixes and performance improvements.</p>
<h2 id="jlls_for_cuda_artifacts">JLLs for CUDA artifacts</h2>
<p>While CUDA.jl has been using binary artifacts for a while, it was manually managing installation and selection of them, i.e., not by using standardised JLL packages. This complicated use of the artifacts by other packages, and made it difficult to build other binary packages against the CUDA runtime.</p>
<p>With CUDA.jl 4.0, we now use JLLs to load the CUDA driver and runtime. Specifically, there are two JLLs in play: <code>CUDA_Driver_jll</code> and <code>CUDA_Runtime_jll</code>. The former is responsible for loading the CUDA driver library &#40;possibly upgrading it using a forward-compatible version&#41;, and determining the CUDA version that your set-up supports:</p>
<pre><code class="language-julia-repl">❯ JULIA_DEBUG&#61;CUDA_Driver_jll julia
julia&gt; using CUDA_Driver_jll
┌ System CUDA driver found at libcuda.so.1, detected as version 12.0.0
└ @ CUDA_Driver_jll
┌ System CUDA driver is recent enough; not using forward-compatible driver
└ @ CUDA_Driver_jll</code></pre>
<p>With the driver identified and loaded, <code>CUDA_Runtime_jll</code> can select a compatible toolkit. By default, it uses the latest supported toolkit that is compatible with the driver:</p>
<pre><code class="language-julia-repl">julia&gt; using CUDA_Runtime_jlljulia&gt; CUDA_Runtime_jll.cuda_toolkits
10-element Vector&#123;VersionNumber&#125;:
 v&quot;10.2.0&quot;
 v&quot;11.0.0&quot;
 v&quot;11.1.0&quot;
 v&quot;11.2.0&quot;
 v&quot;11.3.0&quot;
 v&quot;11.4.0&quot;
 v&quot;11.5.0&quot;
 v&quot;11.6.0&quot;
 v&quot;11.7.0&quot;
 v&quot;11.8.0&quot;julia&gt; CUDA_Runtime_jll.host_platform
Linux x86_64 &#123;cuda&#61;11.8&#125;</code></pre>
<p>As you can see, the selected CUDA runtime is encoded in the host platform. This makes it possible for Julia to automatically select compatible versions of other binary packages. For example, if we install and load <code>SuiteSparse_GPU_jll</code>, which right now <a href="https://github.com/JuliaPackaging/Yggdrasil/blob/2f5a64d9f61d0f1b619367b03b5cecae979ed6d1/S/SuiteSparse/SuiteSparse_GPU/build_tarballs.jl#L104-L126">provides builds</a> for CUDA 10.2, 11.0 and 12.0, the artifact resolution code knows to load the build for CUDA 11.0 which is compatible with the selected CUDA 11.8 runtime:</p>
<pre><code class="language-julia">julia&gt; using SuiteSparse_GPU_jlljulia&gt; SuiteSparse_GPU_jll.best_wrapper
&quot;~/.julia/packages/SuiteSparse_GPU_jll/.../x86_64-linux-gnu-cuda&#43;11.0.jl&quot;</code></pre>
<p>The change to JLLs requires a breaking change: the <code>JULIA_CUDA_VERSION</code> and <code>JULIA_CUDA_USE_BINARYBUILDER</code> environment variables have been removed, and are replaced by preferences that are set in the current environment. For convenience, you can set these preferences by calling <code>CUDA.set_runtime_version&#33;</code>:</p>
<pre><code class="language-julia-repl">❯ julia --project
julia&gt; using CUDA
julia&gt; CUDA.runtime_version&#40;&#41;
v&quot;11.8.0&quot;julia&gt; CUDA.set_runtime_version&#33;&#40;v&quot;11.7&quot;&#41;
┌ Set CUDA Runtime version preference to 11.7,
└ please re-start Julia for this to take effect.❯ julia --project
julia&gt; using CUDA
julia&gt; CUDA.runtime_version&#40;&#41;
v&quot;11.7.0&quot;julia&gt; using CUDA_Runtime_jll
julia&gt; CUDA_Runtime_jll.host_platform
Linux x86_64 &#123;cuda&#61;11.7&#125;</code></pre>
<p>The changed preference is reflected in the host platform, which means that you can use this mechanism to load a different builds of other binary packages. For example, if you rely on a package or JLL that does not yet have a build for CUDA 12, you could set the preference to <code>v&quot;11.x&quot;</code> to load an available build.</p>
<p>For discovering a local runtime, you can set the version to <code>&quot;local&quot;</code>, which will replace the use of <code>CUDA_Runtime_jll</code> by <code>CUDA_Runtime_discovery.jl</code>, an API-compatible package that replaces the JLL with a local runtime discovery mechanism:</p>
<pre><code class="language-julia-repl">❯ julia --project
julia&gt; CUDA.set_runtime_version&#33;&#40;&quot;local&quot;&#41;
┌ Set CUDA Runtime version preference to local,
└ please re-start Julia for this to take effect.❯ JULIA_DEBUG&#61;CUDA_Runtime_Discovery julia --project
julia&gt; using CUDA
┌ Looking for CUDA toolkit via environment variables CUDA_PATH
└ @ CUDA_Runtime_Discovery
┌ Looking for binary ptxas in /opt/cuda
│   all_locations &#61;
│    2-element Vector&#123;String&#125;:
│     &quot;/opt/cuda&quot;
│     &quot;/opt/cuda/bin&quot;
└ @ CUDA_Runtime_Discovery
┌ Debug: Found ptxas at /opt/cuda/bin/ptxas
└ @ CUDA_Runtime_Discovery
...</code></pre>
<h2 id="memory_limits">Memory limits</h2>
<p>By popular demand, support for memory limits has been reinstated. This functionality had been removed after the switch to CUDA memory pools, as the memory pool allocator does not yet support memory limits. Awaiting improvements by NVIDIA, we have added functionality to impose memory limits from the Julia side, in the form of two environment variables:</p>
<ul>
<li><p><code>JULIA_CUDA_SOFT_MEMORY_LIMIT</code>: This is an advisory limit, used to configure the memory pool, which will result in the pool being shrunk down to the requested limit at every synchronization point. That means that the pool may temporarily grow beyond the limit. This limit is unavailable when disabling memory pools &#40;with <code>JULIA_CUDA_MEMORY_POOL&#61;none</code>&#41;.</p>
</li>
<li><p><code>JULIA_CUDA_HARD_MEMORY_LIMIT</code>: This is a hard limit, checked before every allocation. Doing so is relatively expensive, so it is recommended to use the soft limit instead.</p>
</li>
</ul>
<p>The value of these variables can be formatted as a numer of bytes, optionally followed by a unit, or as a percentage of the total device memory. Examples: <code>100M</code>, <code>50&#37;</code>, <code>1.5GiB</code>, <code>10000</code>.</p>
<h2 id="cusparse_improvements">CUSPARSE improvements</h2>
<p>Thanks to the work of <a href="https://github.com/amontoison">@amontoison</a>, the CUSPARSE interface has undergone many improvements:</p>
<ul>
<li><p>Better support of the <code>CuSparseMatrixCOO</code> format with, in particular, the addition of <code>CuSparseMatrixCOO * CuVector</code> and <code>CuSparseMatrixCOO * CuMatrix</code> products;</p>
</li>
<li><p>Routines specialized for <code>-</code>, <code>&#43;</code>, <code>*</code> operations between sparse matrices &#40;<code>CuSparseMatrixCOO</code>, <code>CuSparseMatrixCSC</code> and <code>CuSparseMatrixCSR</code>&#41; have been interfaced;</p>
</li>
<li><p>New generic routines for backward and forward sweeps with sparse triangular matrices are now used by <code>\</code>;</p>
</li>
<li><p><code>CuMatrix * CuSparseVector</code> and <code>CuMatrix * CuSparseMatrix</code> products have been added;</p>
</li>
<li><p>Conversions between sparse and dense matrices have been updated for using more recent and optimized routines;</p>
</li>
<li><p>High-level Julia functions for the new set of sparse BLAS 1 routines such as dot products between <code>CuSparseVector</code>;</p>
</li>
<li><p>Add missing dispatchs for <code>mul&#33;</code> and <code>ldiv&#33;</code> functions;</p>
</li>
<li><p>Interfacing of almost all new CUSPARSE routines added by the CUDA toolkits <code>v&quot;11.x&quot;</code>.</p>
</li>
</ul>
<h2 id="other_changes">Other changes</h2>
<ul>
<li><p>Removal of the CUDNN, CUTENSOR, CUTENSORNET and CUSTATEVEC submodules: These have been moved into their own packages, respectively cuDNN.jl, cuTENSOR.jl, cuTensorNet.jl and cuStateVec.jl &#40;note the change in capitalization, now following NVIDIA&#39;s naming scheme&#41;;</p>
</li>
<li><p>Removal of the NVTX submodule: NVTX.jl should be used instead, which is a more complete implementation of the NVTX API;</p>
</li>
<li><p>Support for CUDA 11.8 &#40;support for CUDA 12.0 is being worked on&#41;;</p>
</li>
<li><p>Support for Julia 1.9.</p>
</li>
</ul>
<h2 id="backport_releases">Backport releases</h2>
<p>Because CUDA.jl 4.0 is a breaking release, two additional releases have been made that backport bugfixes and select features:</p>
<ul>
<li><p>CUDA.jl 3.12.1 and 3.12.2: backports of bugfixes since 3.12</p>
</li>
<li><p>CUDA.jl 3.13.0: additionally adding the memory limit functionality</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Wed, 01 Feb 2023 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Technical preview: Programming Apple M1 GPUs in Julia with Metal.jl]]></title>
  <link>https://juliagpu.org/post/2022-06-24-metal/index.html</link>
  <guid>https://juliagpu.org/2022-06-24-metal/</guid>
  <description><![CDATA[Julia has gained a new GPU back-end: Metal.jl, for working with Apple&#39;s M1   GPUs. The back-end is built on the same foundations that make up existing   GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody   who&#39;s already programmed GPUs in Julia. In the following post I&#39;ll demonstrate   some of that functionality and explain how it works.]]></description>  
  
  <content:encoded><![CDATA[
<p>Julia has gained a new GPU back-end: Metal.jl, for working with Apple&#39;s M1   GPUs. The back-end is built on the same foundations that make up existing   GPU packages like CUDA.jl and AMDGPU.jl, so it should be familiar to anybody   who&#39;s already programmed GPUs in Julia. In the following post I&#39;ll demonstrate   some of that functionality and explain how it works.</p>
<p>But first, note that <strong><a href="https://github.com/JuliaGPU/Metal.jl">Metal.jl</a> is under heavy development</strong>: The package is considered experimental for now, as we&#39;re still working on squashing bugs and adding essential functionality. We also haven&#39;t optimized for performance yet. If you&#39;re interesting in using Metal.jl, please consider contributing to its development&#33; Most of the package is written in Julia, and checking-out the source code is a single <code>Pkg.develop</code> away :-&#41;</p>
<h2 id="quick_start">Quick start</h2>
<p>Start by getting a hold of the upcoming <a href="https://julialang.org/downloads/#upcoming_release">Julia 1.8</a>, launch it, and enter the package manager by pressing <code>&#93;</code>:</p>
<pre><code class="language-text">julia&gt; &#93;pkg&gt; add Metal
  Installed Metal</code></pre>
<p>Installation is as easy as that, and we&#39;ll automatically download the necessary binary artifacts &#40;a C wrapper for the Metal APIs, and an LLVM back-end&#41;. Then, leave the package manager by pressing backspace, import the Metal package, and e.g. call the <code>versioninfo&#40;&#41;</code> method for some details on the toolchain:</p>
<pre><code class="language-text">julia&gt; using Metaljulia&gt; Metal.versioninfo&#40;&#41;
macOS 13.0.0, Darwin 21.3.0Toolchain:
- Julia: 1.8.0-rc1
- LLVM: 13.0.11 device:
- Apple M1 Pro &#40;64.000 KiB allocated&#41;</code></pre>
<p>And there we go&#33; You&#39;ll note here that I&#39;m using the upcoming macOS 13 &#40;Ventura&#41;; this is currently the only supported operating system. We also only support M-series GPUs, even though Metal does support other GPUs. These choices were made to simplify development, and aren&#39;t technical limitations. In fact, Metal.jl <em>does</em> work on e.g. macOS Monterey with an Intel GPU, but it&#39;s an untested combination that may suffer from bugs.</p>
<h2 id="array_programming">Array programming</h2>
<p>Just like our other GPU back-ends, Metal.jl offers an array abstraction that greatly simplifies GPU programming. The abstraction centers around the <code>MtlArray</code> type that can be used to manage memory and perform GPU computations:</p>
<pre><code class="language-julia"># allocate &#43; initialize
julia&gt; a &#61; MtlArray&#40;rand&#40;Float32, 2, 2&#41;&#41;
2×2 MtlArray&#123;Float32, 2&#125;:
 0.158752  0.836366
 0.535798  0.153554# perform some GPU-accelerated operations
julia&gt; b &#61; a * a
2×2 MtlArray&#123;Float32, 2&#125;:
 0.473325  0.261202
 0.167333  0.471702# back to the CPU
julia&gt; Array&#40;b&#41;
2×2 Matrix&#123;Float32&#125;:
 0.473325  0.261202
 0.167333  0.471702</code></pre>
<p>Beyond these simple operations, Julia&#39;s higher-order array abstractions can be used to express more complex operations without ever having to write a kernel:</p>
<pre><code class="language-julia">julia&gt; mapreduce&#40;sin, &#43;, a; dims&#61;1&#41;
1×2 MtlArray&#123;Float32, 2&#125;:
 1.15276  0.584146julia&gt; cos.&#40;a .&#43; 2&#41; .* 3
2×2 MtlArray&#123;Float32, 2&#125;:
 -2.0472   -1.25332
 -2.96594  -2.60351</code></pre>
<p>Much of this functionality comes from the <a href="https://github.com/JuliaGPU/GPUArrays.jl/">GPUArrays.jl</a> package, which provides vendor-neutral implementations of common array operations. As a result, <code>MtlArray</code> is already pretty capable, and should be usable with realistic array-based applications.</p>
<h2 id="kernel_programming">Kernel programming</h2>
<p>Metal.jl&#39;s array operations are implemented in Julia, using our native kernel programming capabilities and accompanying JIT-compiler. A small demonstration:</p>
<pre><code class="language-julia"># a simple kernel that sets elements of an array to a value
function memset_kernel&#40;array, value&#41;
  i &#61; thread_position_in_grid_1d&#40;&#41;
  if i &lt;&#61; length&#40;array&#41;
    @inbounds array&#91;i&#93; &#61; value
  end
  return
enda &#61; MtlArray&#123;Float32&#125;&#40;undef, 512&#41;
@metal threads&#61;512 grid&#61;2 memset_kernel&#40;a, 42&#41;# verify
@assert all&#40;isequal&#40;42&#41;, Array&#40;a&#41;&#41;</code></pre>
<p>As can be seen here, we&#39;ve opted to deviate slightly from the Metal Shading Language, instead providing a programming experience that&#39;s similar to Julia&#39;s existing back-ends. Some key differences:</p>
<ul>
<li><p>we use intrinsic functions instead of special kernel function arguments to access properties like the thread position, grid size, ...;</p>
</li>
<li><p>all types of arguments &#40;buffers, indirect buffers, value-typed inputs&#41; are transparently converted to a GPU-compatible structure<sup id="fnref:1">[1]</sup>;</p>
</li>
<li><p>global &#40;task-bound&#41; state is used to keep track of the active device and a queue;</p>
</li>
<li><p>compute pipeline set-up and command encoding is hidden behind a single macro.</p>
</li>
</ul>
<p>Behind the scenes, we compile Julia to LLVM IR and use a <a href="https://github.com/JuliaGPU/llvm-metal">tiny LLVM back-end</a> &#40;based on <a href="https://github.com/a2flo">@a2flo</a>&#39;s <a href="https://github.com/a2flo/floor">libfloor</a>&#41; that &#40;re&#41;writes the bitcode to a Metal-compatible library containing LLVM 5 bitcode. You can inspect the generated IR using <code>@device_code_metal</code>:</p>
<pre><code class="language-julia">julia&gt; @device_code_metal @metal threads&#61;512 grid&#61;2 memset_kernel&#40;a, 42&#41;</code></pre>
<pre><code class="language-text">&#91;header&#93;
program_count: 1
...&#91;program&#93;
name: julia_memset_kernel
type: kernel
...</code></pre>
<pre><code class="language-llvm">target datalayout &#61; &quot;...&quot;
target triple &#61; &quot;air64-apple-macosx13.0.0&quot;; the &#40;rewritten&#41; kernel function:
;  - &#37;value argument passed by reference
;  - &#37;thread_position_in_grid argument added
;  - sitofp rewritten to AIR-specific intrinsic
define void @julia_memset_kernel&#40;
    &#123; i8 addrspace&#40;1&#41;*, &#91;1 x i64&#93; &#125; addrspace&#40;1&#41;* &#37;array,
    i64 addrspace&#40;1&#41;* &#37;value,
    i32 &#37;thread_position_in_grid&#41; &#123;
  ...
  &#37;9 &#61; tail call float @air.convert.f.f32.s.i64&#40;i64 &#37;7&#41;
  ...
  ret void
&#125;; minimal required argument metadata
&#33;air.kernel &#61; &#33;&#123;&#33;10&#125;
&#33;10 &#61; &#33;&#123;void &#40;&#123; i8 addrspace&#40;1&#41;*, &#91;1 x i64&#93; &#125; addrspace&#40;1&#41;*,
              i64 addrspace&#40;1&#41;*, i32&#41;* @julia_memset_kernel, &#33;11, &#33;12&#125;
&#33;12 &#61; &#33;&#123;&#33;13, &#33;14, &#33;15&#125;
&#33;13 &#61; &#33;&#123;i32 0, &#33;&quot;air.buffer&quot;, &#33;&quot;air.location_index&quot;, i32 0, i32 1,
       &#33;&quot;air.read_write&quot;, &#33;&quot;air.address_space&quot;, i32 1,
       &#33;&quot;air.arg_type_size&quot;, i32 16, &#33;&quot;air.arg_type_align_size&quot;, i32 8&#125;
&#33;14 &#61; &#33;&#123;i32 1, &#33;&quot;air.buffer&quot;, &#33;&quot;air.location_index&quot;, i32 1, i32 1,
       &#33;&quot;air.read_write&quot;, &#33;&quot;air.address_space&quot;, i32 1,
       &#33;&quot;air.arg_type_size&quot;, i32 8, &#33;&quot;air.arg_type_align_size&quot;, i32 8&#125;
&#33;15 &#61; &#33;&#123;i32 0, &#33;&quot;air.thread_position_in_grid&quot;&#125;; other metadata not shown, for brevity</code></pre>
<p>Shout-out to <a href="https://github.com/max-Hawkins">@max-Hawkins</a> for exploring Metal code generation during his internship at Julia Computing&#33;</p>
<h2 id="metal_apis_in_julia">Metal APIs in Julia</h2>
<p>Lacking an Objective C or C&#43;&#43; FFI, we interface with the Metal libraries using <a href="https://github.com/recp/cmt">a shim C library</a>. Most users won&#39;t have to interface with Metal directly – the array abstraction is sufficient for many – but more experienced developers can make use of the high-level wrappers that we&#39;ve designed for the Metal APIs:</p>
<pre><code class="language-julia">julia&gt; dev &#61; MtlDevice&#40;1&#41;
MtlDevice:
  name:             Apple M1 Pro
  lowpower:         false
  headless:         false
  removable:        false
  unified memory:   truejulia&gt; desc &#61; MtlHeapDescriptor&#40;&#41;
MtlHeapDescriptor:
  type:             MtHeapTypeAutomatic
  storageMode:      MtStorageModePrivate
  size:             0julia&gt; desc.size &#61; 16384
16384julia&gt; heap &#61; MtlHeap&#40;dev, desc&#41;
MtlHeap:
  type:                 MtHeapTypeAutomatic
  size:                 16384
  usedSize:             0
  currentAllocatedSize: 16384# etc</code></pre>
<p>These wrappers are based on <a href="https://github.com/PhilipVinc">@PhilipVinc</a>&#39;s excellent work on MetalCore.jl, which formed the basis for &#40;and has been folded into&#41; Metal.jl.</p>
<h2 id="whats_next">What&#39;s next?</h2>
<p>The current release of Metal.jl focusses on code generation capabilities, and is meant as a preview for users and developers to try out on their system or with their specific GPU application. It is not production-ready yet, and is lacking some crucial features:</p>
<ul>
<li><p>performance optimization</p>
</li>
<li><p>integration with Metal Performance Shaders</p>
</li>
<li><p>integration / documentation for use with Xcode tools</p>
</li>
<li><p>fleshing out the array abstraction based on user feedback</p>
</li>
</ul>
<p><strong>Please consider helping out with any of these&#33;</strong> Since Metal.jl and its dependencies are almost entirely implemented in Julia, any experience with the language is sufficient to contribute. If you&#39;re not certain, or have any questions, please drop by the <code>#gpu</code> channel on <a href="https://julialang.org/slack/">the JuliaLang Slack</a>, ask questions on our <a href="https://discourse.julialang.org/c/domain/gpu/11">Discourse</a>, or chat to us during the <a href="https://julialang.org/community/#events">GPU office hours</a> every other Monday.</p>
<p>If you encounter any bugs, feel free to let us know on the <a href="https://github.com/JuliaGPU/Metal.jl/issues">Metal.jl issue tracker</a>. For information on upcoming releases, <a href="https://juliagpu.org/post/">subscribe</a> to this website&#39;s blog where we post about significant developments in Julia&#39;s GPU ecosystem.</p>
<hr />
<p><table class="fndef" id="fndef:1">
    <tr>
        <td class="fndef-backref">[1]</td>
        <td class="fndef-content">This relies on Metal 3 from macOS 13, which introduced bindless argument</td>
    </tr>
</table>
      buffers, as we didn&#39;t fully figure out how to reliably encode       arbitrarily-nested indirect buffers in argument encoder metadata.</p>
]]></content:encoded>
    
  <pubDate>Fri, 24 Jun 2022 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[oneAPI.jl status update]]></title>
  <link>https://juliagpu.org/post/2022-04-06-oneapi_update/index.html</link>
  <guid>https://juliagpu.org/2022-04-06-oneapi_update/</guid>
  <description><![CDATA[It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs &#40;and other accelerators&#41; using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.]]></description>  
  
  <content:encoded><![CDATA[
<p>It has been over a year since the last update on oneAPI.jl, the Julia package for programming Intel GPUs &#40;and other accelerators&#41; using the oneAPI toolkit. Since then, the package has been under steady development, and several new features have been added to improve the developer experience and usability of the package.</p>
<h2 id="atomic_intrinsics"><code>@atomic</code> intrinsics</h2>
<p>oneAPI.jl <a href="https://github.com/JuliaGPU/oneAPI.jl/pull/85">now supports</a> atomic operations, which are required to implement a variety of parallel algorithms. Low-level atomic functions &#40;<code>atomic_add&#33;</code>, <code>atomic_xchg&#33;</code>, etc&#41; are available as unexported methods in the oneAPI module:</p>
<pre><code class="language-julia">a &#61; oneArray&#40;Int32&#91;0&#93;&#41;function kernel&#40;a&#41;
    oneAPI.atomic_add&#33;&#40;pointer&#40;a&#41;, Int32&#40;1&#41;&#41;
    return
end@oneapi items&#61;256 kernel&#40;a&#41;
@test Array&#40;a&#41;&#91;1&#93; &#61;&#61; 256</code></pre>
<p>Note that these methods are only available for those types that are supported by the underlying OpenCL intrinsics. For example, the <code>atomic_add&#33;</code> from above can only be used with <code>Int32</code> and <code>UInt32</code> inputs.</p>
<p>Most users will instead rely on the higher-level <code>@atomic</code> macro, which can be easily put in front of many array operations to make them behave atomically. To avoid clashing with the new <code>@atomic</code> macro in Julia 1.7, this macro is also unexported:</p>
<pre><code class="language-julia">a &#61; oneArray&#40;Int32&#91;0&#93;&#41;function kernel&#40;a&#41;
    oneAPI.@atomic a&#91;1&#93; &#43;&#61; Int32&#40;1&#41;
    return
end@oneapi items&#61;256 kernel&#40;a&#41;
@test Array&#40;a&#41;&#91;1&#93; &#61;&#61; 512</code></pre>
<p>When used with operations that are supported by OpenCL, this macro will lower to calls like <code>atomic_add&#33;</code>. For other operations, a compare-and-exchange loop will be used. Note that for now, this is still restricted to 32-bit operations, as we do not support the <code>cl_khr_int64_base_atomics</code> extension for 64-bit atomics.</p>
<h2 id="initial_integration_with_vendor_libraries">Initial integration with vendor libraries</h2>
<p>One significant missing features is the integration with vendor libraries like oneMKL. These integrations are required to ensure good performance for important operations like matrix multiplication, which currently fall-back to generic implementations in Julia that may not always perform as good.</p>
<p>To improve this situation, we <a href="https://github.com/JuliaGPU/oneAPI.jl/pull/97">are working on</a> a wrapper library that allows us to integrate with oneMKL and other oneAPI and SYCL libraries. Currently, only matrix multiplication is supported, but once the infrastructural issues are worked out we expect to quickly support many more operations.</p>
<p>If you need support for specific libraries, please have a look at this PR. As the API surface is significant, we will need help to extend the wrapper library and integrate it with high-level Julia libraries like LinearAlgebra.jl.</p>
<h2 id="correctness_issues">Correctness issues</h2>
<p>In porting existing Julia GPU applications to oneAPI.jl, we fixed several issues that caused correctness issues when executing code on Intel GPUs:</p>
<ul>
<li><p>when the garbage collector frees GPU memory, <a href="https://github.com/JuliaGPU/oneAPI.jl/pull/157">it now blocks</a> until all outstanding commands &#40;which may include uses of said memory&#41; are completes</p>
</li>
<li><p>the <code>barrier</code> function to synchronize threads <a href="https://github.com/JuliaGPU/oneAPI.jl/pull/162">is now</a> marked as <code>convert</code> to avoid LLVM miscompilations</p>
</li>
</ul>
<p>Note that if you are using Tiger Lake hardware, there is currently a <a href="https://github.com/intel/compute-runtime/issues/522">known issue</a> in the back-end Intel compiler that affects oneAPI.jl, causing correctness issues that can be spotted by running the oneAPI.jl test suite.</p>
<h2 id="future_work">Future work</h2>
<p>To significantly improve usability of oneAPI.jl, we will add support to the KernelAbstraction.jl package. This library is used by <a href="https://juliahub.com/ui/Packages/KernelAbstractions/aywHT/0.7.2?page&#61;2">many other packages</a> for adding GPU acceleration to algorithms that cannot be easily expressed using only array operations. As such, support for oneAPI.jl will make it possible to use your oneAPI GPUs with all of these packages.</p>
]]></content:encoded>
    
  <pubDate>Wed, 06 Apr 2022 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 3.5-3.8]]></title>
  <link>https://juliagpu.org/post/2022-01-28-cuda_3.5_3.8/index.html</link>
  <guid>https://juliagpu.org/2022-01-28-cuda_3.5_3.8/</guid>
  <description><![CDATA[CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl versions 3.5 to 3.8 have brought several new features to improve performance and productivity. This blog post will highlight a couple: direct copies between devices, better performance by preserving array index types and changing the memory pool, and a much-improved interface to the compute sanitizer utility.</p>
<h2 id="copies_between_devices">Copies between devices</h2>
<p>Typically, when sending data between devices you need to stage through the CPU. CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1284">now does this automatically</a>, making it possible to directly copy between <code>CuArray</code>s on different devices:</p>
<pre><code class="language-julia-repl">julia&gt; device&#33;&#40;0&#41;;julia&gt; a &#61; CUDA.rand&#40;2,2&#41;
2×2 CuArray&#123;Float32, 2, CUDA.Mem.DeviceBuffer&#125;:
 0.440147  0.986939
 0.622901  0.698119julia&gt; device&#33;&#40;1&#41;;julia&gt; b &#61; CUDA.zeros&#40;2,2&#41;;julia&gt; copyto&#33;&#40;b, a&#41;
2×2 CuArray&#123;Float32, 2, CUDA.Mem.DeviceBuffer&#125;:
 0.440147  0.986939
 0.622901  0.698119</code></pre>
<p>When your hardware supports it, CUDA.jl will automatically enable so-called peer-to-peer mode, making it possible to copy data directly without going through the CPU. This can result in significant bandwidth and latency reductions. You can check if this mode of communication is possible:</p>
<pre><code class="language-julia-repl">julia&gt; src &#61; CuDevice&#40;0&#41;
CuDevice&#40;0&#41;: NVIDIA A100-PCIE-40GBjulia&gt; dst &#61; CuDevice&#40;1&#41;
CuDevice&#40;1&#41;: Tesla V100-PCIE-32GBjulia&gt; can_access_peer&#40;src, dst&#41;
false</code></pre>
<p>In this case, peer-to-peer communication is not possible because the devices have a different compute capability major revision number. With a compatible device, the function reports <code>true</code>:</p>
<pre><code class="language-julia">julia&gt; src &#61; CuDevice&#40;1&#41;
CuDevice&#40;1&#41;: Tesla V100-PCIE-32GBjulia&gt; dst &#61; CuDevice&#40;2&#41;
CuDevice&#40;2&#41;: Tesla V100-PCIE-16GBjulia&gt; can_access_peer&#40;src, dst&#41;
true</code></pre>
<p>Thanks to <a href="https://github.com/kshyatt">@kshyatt</a> for help with this change&#33;</p>
<h2 id="helper_function_to_use_compute-sanitizer">Helper function to use <code>compute-sanitizer</code></h2>
<p>The CUDA toolkit comes with a powerful tool to check GPU kernels for common issues like memory errors and race conditions: the <a href="https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html">compute sanitizer</a>. To make it easier to use this tool, CUDA.jl now ships the binary as part of its artifacts, and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1340">provides a helper function</a> to restart Julia under the <code>compute-sanitizer</code>. Let&#39;s demonstrate, and trigger a memory error to show what the compute sanitizer can detect:</p>
<pre><code class="language-julia-repl">julia&gt; using CUDAjulia&gt; CUDA.run_compute_sanitizer&#40;&#41;
Re-starting your active Julia session...&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61; COMPUTE-SANITIZER
julia&gt; using CUDAjulia&gt; unsafe_wrap&#40;CuArray, pointer&#40;CuArray&#40;&#91;1&#93;&#41;&#41;, 2&#41; .&#61; 1
&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61; Invalid __global__ write of size 8 bytes
&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;     at 0x2a0 in LLVM/src/interop/base.jl:45:julia_broadcast_kernel_1892&#40;CuKernelContext, CuDeviceArray&lt;Int64, &#40;int&#41;1, &#40;int&#41;1&gt;, Broadcasted&lt;void, Tuple&lt;OneTo&lt;Int64&gt;&gt;, _identity, Broadcasted&lt;Int64&gt;&gt;, Int64&#41;
&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;     by thread &#40;1,0,0&#41; in block &#40;0,0,0&#41;
&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;     Address 0xa64000008 is out of bounds
&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;&#61;     and is 1 bytes after the nearest allocation at 0xa64000000 of size 8 bytes</code></pre>
<p>Other tools are available too, e.g. <code>racecheck</code> for detecting races or <code>synccheck</code> for finding synchronization issues. These tools can be selected using the <code>tool</code> keyword argument to <code>run_compute_sanitizer</code>.</p>
<h2 id="updated_binary_dependencies">Updated binary dependencies</h2>
<p>As is common with every release, CUDA.jl now supports newer versions of NVIDIA&#39;s tools and libraries:</p>
<ul>
<li><p>CUDA toolkit <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1256">11.5</a> and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1326">11.6</a></p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1328">CUDNN 8.3.2</a></p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1327">CUTENSOR 1.4.0</a></p>
</li>
</ul>
<p>The update to CUDA toolkit 11.6 comes with improved debug info compatibility. If you need to debug Julia GPU code with tools like <code>compute-sanitizer</code> or <code>cuda-gdb</code>, and you need debug info &#40;the equivalent of <code>nvcc -G</code>&#41;, ensure CUDA.jl can use the latest version of the CUDA toolkit.</p>
<p>To make it easier to use the latest supported toolkit, CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1182">now implements</a> CUDA&#39;s so-called <strong>Forward Compatibility mode</strong>: When your driver is outdated, CUDA.jl will attempt to load a newer version of the CUDA driver library, enabling use of a newer CUDA toolkit and libraries. Note that this is only supported on select hardware, refer to <a href="https://docs.nvidia.com/deploy/cuda-compatibility/#forward-compatibility-title">the NVIDIA documentation</a> for more details.</p>
<h2 id="preserving_array_indices">Preserving array indices</h2>
<p>Julia&#39;s integers are typically 64-bits wide, which can be wasteful when dealing with GPU indexing intrinsics that are typically only 32-bits wide. CUDA.jl&#39;s device array type <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1153">now carefully preserves the type of indices</a> so that 32-bits indices aren&#39;t unnecessarily promoted to 64-bits. With some careful kernel programming &#40;note the use of <code>0x1</code> instead of <code>1</code> below&#41;, this makes it possible to significantly reduce the register pressure surrounding indexing operations, which may be useful in register-constrained situations:</p>
<pre><code class="language-julia-repl">julia&gt; function memset&#40;arr, val&#41;
           i &#61; &#40;blockIdx&#40;&#41;.x-0x1&#41; * blockDim&#40;&#41;.x &#43; threadIdx&#40;&#41;.x
           @inbounds arr&#91;i&#93; &#61; val
           return
       endjulia&gt; CUDA.code_ptx&#40;memset, Tuple&#123;CuDeviceArray&#123;Float32,1,AS.Global&#125;,Float32&#125;&#41;
.func julia_memset&#40;.param .b64 arr, .param .b32 val&#41; &#123;
        .reg .f32       &#37;f&lt;2&gt;;
        .reg .b32       &#37;r&lt;5&gt;;
        .reg .b64       &#37;rd&lt;5&gt;;        ld.param.u64    &#37;rd1, &#91;arr&#93;;
        ld.param.f32    &#37;f1, &#91;val&#93;;
        mov.u32         &#37;r1, &#37;ctaid.x;
        mov.u32         &#37;r2, &#37;ntid.x;
        mov.u32         &#37;r3, &#37;tid.x;
        mad.lo.s32      &#37;r4, &#37;r2, &#37;r1, &#37;r3;
        ld.u64          &#37;rd2, &#91;&#37;rd1&#93;;
        mul.wide.s32    &#37;rd3, &#37;r4, 4;
        add.s64         &#37;rd4, &#37;rd2, &#37;rd3;
        st.global.f32   &#91;&#37;rd4&#93;, &#37;f1;
        ret;
&#125;</code></pre>
<p>On CUDA.jl 3.4, this simple function used 3 more 64-bit registers:</p>
<pre><code class="language-julia">.func julia_memset&#40;.param .b64 arr, .param .b32 val&#41; &#123;
        .reg .f32       &#37;f&lt;2&gt;;
        .reg .b32       &#37;r&lt;5&gt;;
        .reg .b64       &#37;rd&lt;8&gt;;        ld.param.u64    &#37;rd1, &#91;arr&#93;;
        ld.param.f32    &#37;f1, &#91;val&#93;;
        mov.u32         &#37;r1, &#37;ctaid.x;
        mov.u32         &#37;r2, &#37;ntid.x;
        mul.wide.u32    &#37;rd2, &#37;r2, &#37;r1;
        mov.u32         &#37;r3, &#37;tid.x;
        add.s32         &#37;r4, &#37;r3, 1;
        cvt.u64.u32     &#37;rd3, &#37;r4;
        ld.u64          &#37;rd4, &#91;&#37;rd1&#93;;
        add.s64         &#37;rd5, &#37;rd2, &#37;rd3;
        shl.b64         &#37;rd6, &#37;rd5, 2;
        add.s64         &#37;rd7, &#37;rd4, &#37;rd6;
        st.global.f32   &#91;&#37;rd7&#43;-4&#93;, &#37;f1;
        ret;
&#125;</code></pre>
<h2 id="more_aggressive_memory_management">More aggressive memory management</h2>
<p>Starting with CUDA 3.8, the memory pool used to allocate <code>CuArray</code>s will be configured differently: The pool will now be <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1344">allowed to use all available GPU memory</a>, whereas previously all cached memory was released at each synchronization point. This can significantly improve performance, and makes synchronization much cheaper.</p>
<p>This behavior can be observed by calling the <code>memory_status&#40;&#41;</code> function:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.memory_status&#40;&#41;
Effective GPU memory usage: 13.57&#37; &#40;2.001 GiB/14.751 GiB&#41;
Memory pool usage: 0 bytes &#40;0 bytes reserved&#41;julia&gt; a &#61; CuArray&#123;Float32&#125;&#40;undef, &#40;1024, 1024, 1024&#41;&#41;;
julia&gt; Base.format_bytes&#40;sizeof&#40;a&#41;&#41;
&quot;4.000 GiB&quot;julia&gt; a &#61; nothing
julia&gt; GC.gc&#40;&#41;julia&gt; CUDA.memory_status&#40;&#41;
Effective GPU memory usage: 40.59&#37; &#40;5.988 GiB/14.751 GiB&#41;
Memory pool usage: 0 bytes &#40;4.000 GiB reserved&#41;</code></pre>
<p>So far nothing new. On previous versions of CUDA.jl however, any subsequent synchronization of the GPU &#40;e.g., by copying memory to the CPU&#41; would have resulted in a release of this reserved memory. This is not the case anymore:</p>
<pre><code class="language-julia-repl">julia&gt; synchronize&#40;&#41;julia&gt; CUDA.memory_status&#40;&#41;
Effective GPU memory usage: 40.59&#37; &#40;5.988 GiB/14.751 GiB&#41;
Memory pool usage: 0 bytes &#40;4.000 GiB reserved&#41;</code></pre>
<p>If you still want to release this memory, you can call the <code>reclaim&#40;&#41;</code> function:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.reclaim&#40;&#41;julia&gt; CUDA.memory_status&#40;&#41;
Effective GPU memory usage: 13.48&#37; &#40;1.988 GiB/14.751 GiB&#41;
Memory pool usage: 0 bytes &#40;0 bytes reserved&#41;</code></pre>
<p>With interactive Julia sessions, this function is called periodically so that the GPU&#39;s memory isn&#39;t held on to unnecessarily. Otherwise it shouldn&#39;t be necessary to call this function, as memory is freed automatically when it is needed.</p>
<h2 id="minor_changes_and_improvements">Minor changes and improvements</h2>
<ul>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1217">Bitonic sort</a> is now used instead of quicksort &#40;by <a href="https://github.com/xaellison">@xaellison</a>&#41;.</p>
</li>
<li><p><code>CuDeviceArray</code> <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1303">now stores the length of the array</a>, greatly speeding up indexing with high-dimensional arrays.</p>
</li>
<li><p>Device intrinsics <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1305">cannot be called on the CPU anymore</a>, protecting against segfaults when something isn&#39;t dispatching correctly.</p>
</li>
<li><p>Support for Multi-GPU instances <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1199">has been improved</a>, providing the <code>parent_uuid</code> function to look up the UUID of the parent device.</p>
</li>
<li><p><code>randn</code> and <code>randexp</code> <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1236">are now supported in kernel code</a>, which should help with initial support of Distributions.jl-based operations.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Fri, 28 Jan 2022 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 3.4]]></title>
  <link>https://juliagpu.org/post/2021-08-13-cuda_3.4/index.html</link>
  <guid>https://juliagpu.org/2021-08-13-cuda_3.4/</guid>
  <description><![CDATA[The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl 3.0 is now the default fallback, and support for memory pools other than the CUDA stream-ordered one has been removed.]]></description>  
  
  <content:encoded><![CDATA[
<p>The latest version of CUDA.jl brings several new features, from improved atomic operations to initial support for arrays with unified memory. The native random number generator introduced in CUDA.jl 3.0 is now the default fallback, and support for memory pools other than the CUDA stream-ordered one has been removed.</p>
<h2 id="streamlined_atomic_operations">Streamlined atomic operations</h2>
<p>In preparation of integrating with the new standard <code>@atomic</code> macro introduced in Julia 1.7, we have <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1059">streamlined the capabilities of atomic operations in CUDA.jl</a>. The API is now split into two levels: low-level <code>atomic_</code> methods for atomic functionality that&#39;s directly supported by the hardware, and a high-level <code>@atomic</code> macro that tries to perform operations natively or falls back to a loop with compare-and-swap. This fall-back implementation makes it possible to use more complex operations that do not map onto a single atomic operation:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; CuArray&#40;&#91;1&#93;&#41;;julia&gt; function kernel&#40;a&#41;
         CUDA.@atomic a&#91;&#93; &lt;&lt;&#61; 1
         return
       endjulia&gt; @cuda threads&#61;16 kernel&#40;a&#41;julia&gt; a
1-element CuArray&#123;Int64, 1, CUDA.Mem.DeviceBuffer&#125;:
 65536julia&gt; 1&lt;&lt;16
65536</code></pre>
<p>The only requirement is that the types being used are supported by <code>CUDA.atomic_cas&#33;</code>. This includes common types like 32 and 64-bit integers and floating-point numbers, as well as 16-bit numbers on devices with compute capability 7.0 or higher.</p>
<p>Note that on Julia 1.7 and higher, CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1097">does not export the <code>@atomic</code> macro anymore</a> to avoid conflicts with the version in Base. That means it is recommended to always fully specify uses of the macro, i.e., use <code>CUDA.@atomic</code> as in the example above.</p>
<h2 id="arrays_with_unified_memory">Arrays with unified memory</h2>
<p>You may have noticed that the <code>CuArray</code> type in the example above included an additional parameter, <code>Mem.DeviceBuffer</code>. This has been introduced to support arrays backed by different kinds of buffers. By default, we will use an ordinary device buffer, but it&#39;s now possible to <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1023">allocate arrays backed by unified buffers</a> that can be used on multiple devices:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; cu&#40;&#91;0&#93;; unified&#61;true&#41;
1-element CuArray&#123;Int64, 1, CUDA.Mem.UnifiedBuffer&#125;:
 0julia&gt; a .&#43;&#61; 1
1-element CuArray&#123;Int64, 1, CUDA.Mem.UnifiedBuffer&#125;:
 1julia&gt; device&#33;&#40;1&#41;julia&gt; a .&#43;&#61; 1
1-element CuArray&#123;Int64, 1, CUDA.Mem.UnifiedBuffer&#125;:
 2</code></pre>
<p>Although all operations should work equally well with arrays backed by unified memory, they have not been optimized yet. For example, copying memory to the device could be avoided as the driver can automatically page in unified memory on-demand.</p>
<h2 id="new_default_random_number_generator">New default random number generator</h2>
<p>CUDA.jl 3.0 introduced a new random number generator, and starting with CUDA.jl 3.2 performance and quality of this generator was improved up to the point it could be used by applications. A couple of features were still missing though, such as generating normally-distributed random numbers, or support for complex numbers. These features have been <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1082">added in CUDA.jl 3.3</a>, and the generator is now used as the default fallback when CURAND does not support the requested element types.</p>
<p>Both the performance and quality of this generator is much better than the previous, GPUArrays.jl-based one:</p>
<pre><code class="language-julia-repl">julia&gt; using BenchmarkTools
julia&gt; cuda_rng &#61; CUDA.RNG&#40;&#41;;
julia&gt; gpuarrays_rng &#61; GPUArrays.default_rng&#40;CuArray&#41;;
julia&gt; a &#61; CUDA.zeros&#40;1024,1024&#41;;julia&gt; @benchmark CUDA.@sync rand&#33;&#40;&#36;cuda_rng, &#36;a&#41;
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range &#40;min … max&#41;:  17.040 μs …  2.430 ms  ┊ GC &#40;min … max&#41;: 0.00&#37; … 99.04&#37;
 Time  &#40;median&#41;:     18.500 μs              ┊ GC &#40;median&#41;:    0.00&#37;
 Time  &#40;mean ± σ&#41;:   20.604 μs ± 34.734 μs  ┊ GC &#40;mean ± σ&#41;:  1.17&#37; ±  0.99&#37;         ▃▆█▇▇▅▄▂▁
  ▂▂▂▃▄▆███████████▇▆▆▅▅▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▄
  17 μs           Histogram: frequency by time        24.1 μs &lt;julia&gt; @benchmark CUDA.@sync rand&#33;&#40;&#36;gpuarrays_rng, &#36;a&#41;
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range &#40;min … max&#41;:  72.489 μs …  2.790 ms  ┊ GC &#40;min … max&#41;: 0.00&#37; … 98.44&#37;
 Time  &#40;median&#41;:     74.479 μs              ┊ GC &#40;median&#41;:    0.00&#37;
 Time  &#40;mean ± σ&#41;:   81.211 μs ± 61.598 μs  ┊ GC &#40;mean ± σ&#41;:  0.67&#37; ±  1.40&#37;  █                                                           ▁
  █▆▃▁▃▃▅▆▅▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▁▁▁▁▁▁▁▁▄▄▃▄▃▁▁▁▁▁▁▁▁▁▃▃▄▆▄▁▄▃▆ █
  72.5 μs      Histogram: log&#40;frequency&#41; by time       443 μs &lt;</code></pre>
<pre><code class="language-julia-repl">julia&gt; using RNGTest
julia&gt; test_cuda_rng &#61; RNGTest.wrap&#40;cuda_rng, UInt32&#41;;
julia&gt; test_gpuarrays_rng &#61; RNGTest.wrap&#40;gpuarrays_rng, UInt32&#41;;julia&gt; RNGTest.smallcrushTestU01&#40;test_cuda_rng&#41;
 All tests were passedjulia&gt; RNGTest.smallcrushTestU01&#40;test_gpuarrays_rng&#41;
 The following tests gave p-values outside &#91;0.001, 0.9990&#93;:       Test                          p-value
 ----------------------------------------------
  1  BirthdaySpacings                 eps
  2  Collision                        eps
  3  Gap                              eps
  4  SimpPoker                       1.0e-4
  5  CouponCollector                  eps
  6  MaxOft                           eps
  7  WeightDistrib                    eps
 10  RandomWalk1 M                   6.0e-4
 ----------------------------------------------
 &#40;eps  means a value &lt; 1.0e-300&#41;:</code></pre>
<h2 id="removal_of_old_memory_pools">Removal of old memory pools</h2>
<p>With the new stream-ordered allocator, caching memory allocations at the CUDA library level, much of the need for memory pools to cache memory allocations has disappeared. To simplify the allocation code, we have <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1015">removed support for those Julia-managed memory pools</a> &#40;i.e., <code>binned</code>, <code>split</code> and <code>simple</code>&#41;. You can now only use the <code>cuda</code> memory pool, or use no pool at all by setting the <code>JULIA_CUDA_MEMORY_POOL</code> environment variable to <code>none</code>.</p>
<p>Not using a memory pool degrades performance, so if you are stuck on an NVIDIA driver that does not support CUDA 11.2, it is advised to remain on CUDA.jl 3.3 until you can upgrade.</p>
<p>Also note that the new stream-ordered allocator has <a href="https://github.com/JuliaGPU/CUDA.jl/issues/1053">turned out incompatible with legacy cuIpc APIs</a> as used by OpenMPI. If that applies to you, consider disabling the memory pool or reverting to CUDA.jl 3.3 if your application&#39;s allocation pattern benefits from a memory pool.</p>
<p>Because of this, we will be maintaining CUDA.jl 3.3 longer than usual. All bug fixes in CUDA.jl 3.4 have already been backported to the previous release, which is currently at version 3.3.6.</p>
<h2 id="device_capability-dependent_kernel_code">Device capability-dependent kernel code</h2>
<p>Some of the improvements in this release depend on the ability to write generic code that only uses certain hardware features when they are available. To facilitate writing such code, the compiler now embeds metadata in the generated code that can be used to branch on.</p>
<p>Currently, the device capability and PTX ISA version are embedded and made available using respectively the <code>compute_capability</code> and <code>ptx_isa_version</code> functions. A simplified version number type, constructable using the <code>sv&quot;...&quot;</code> string macro, can be used to test against these properties. For example:</p>
<pre><code class="language-julia-repl">julia&gt; function kernel&#40;a&#41;
           a&#91;&#93; &#61; compute_capability&#40;&#41; &gt;&#61; sv&quot;6.0&quot; ? 1 : 2
           return
       end
kernel &#40;generic function with 1 method&#41;julia&gt; CUDA.code_llvm&#40;kernel, Tuple&#123;CuDeviceVector&#123;Float32, AS.Global&#125;&#125;&#41;
define void @julia_kernel_1&#40;&#123; i8 addrspace&#40;1&#41;*, i64, &#91;1 x i64&#93; &#125;* &#37;0&#41; &#123;
top:
  &#37;1 &#61; bitcast &#123; i8 addrspace&#40;1&#41;*, i64, &#91;1 x i64&#93; &#125;* &#37;0 to float addrspace&#40;1&#41;**
  &#37;2 &#61; load float addrspace&#40;1&#41;*, float addrspace&#40;1&#41;** &#37;1, align 8
  store float 1.000000e&#43;00, float addrspace&#40;1&#41;* &#37;2, align 4
  ret void
&#125;julia&gt; capability&#40;device&#33;&#40;1&#41;&#41;
v&quot;3.5.0&quot;julia&gt; CUDA.code_llvm&#40;kernel, Tuple&#123;CuDeviceVector&#123;Float32, AS.Global&#125;&#125;&#41;
define void @julia_kernel_2&#40;&#123; i8 addrspace&#40;1&#41;*, i64, &#91;1 x i64&#93; &#125;* &#37;0&#41; &#123;
top:
  &#37;1 &#61; bitcast &#123; i8 addrspace&#40;1&#41;*, i64, &#91;1 x i64&#93; &#125;* &#37;0 to float addrspace&#40;1&#41;**
  &#37;2 &#61; load float addrspace&#40;1&#41;*, float addrspace&#40;1&#41;** &#37;1, align 8
  store float 2.000000e&#43;00, float addrspace&#40;1&#41;* &#37;2, align 4
  ret void
&#125;</code></pre>
<p>The branch on the compute capability is completely optimized away. At the same time, this does not require re-inferring the function as the optimization happens at the LLVM level.</p>
<h2 id="other_changes">Other changes</h2>
<ul>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/1084">Support for CUDA 11.4 Update 1</a></p>
</li>
<li><p>Improved thread safety <a href="https://github.com/JuliaGPU/CUDA.jl/pull/993">&#91;1&#93;</a> <a href="https://github.com/JuliaGPU/CUDA.jl/pull/1074">&#91;2&#93;</a></p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Fri, 13 Aug 2021 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 3.3]]></title>
  <link>https://juliagpu.org/post/2021-06-10-cuda_3.3/index.html</link>
  <guid>https://juliagpu.org/2021-06-10-cuda_3.3/</guid>
  <description><![CDATA[There have been several releases of CUDA.jl in the past couple of months, with many bugfixes and many exciting new features to improve GPU programming in Julia: &lt;code&gt;CuArray&lt;/code&gt; now supports isbits Unions, CUDA.jl can emit debug info for use with NVIDIA tools, and changes to the compiler make it even easier to use the latest version of the CUDA toolkit.]]></description>  
  
  <content:encoded><![CDATA[
<p>There have been several releases of CUDA.jl in the past couple of months, with many bugfixes and many exciting new features to improve GPU programming in Julia: <code>CuArray</code> now supports isbits Unions, CUDA.jl can emit debug info for use with NVIDIA tools, and changes to the compiler make it even easier to use the latest version of the CUDA toolkit.</p>
<h2 id="cuarray_support_for_isbits_unions"><code>CuArray</code> support for isbits Unions</h2>
<p>Unions are a way to represent values of one type or another, e.g., a value that can be an integer or a floating point. If all possible element types of a Union are so-called bitstypes, which can be stored contiguously in memory, the Union of these types can be stored contiguously too. This kind of optimization is implemented by the Array type, which can store such &quot;isbits Unions&quot; inline, as opposed to storing a pointer to a heap-allocated box. For more details, refer to the <a href="https://docs.julialang.org/en/v1/devdocs/isbitsunionarrays/">Julia documentation</a>.</p>
<p>With CUDA.jl 3.3, the CuArray GPU array type now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/941">supports this optimization too</a>. That means you can safely allocate CuArrays with isbits union element types and perform GPU-accelerated operations on then:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; CuArray&#40;&#91;1, nothing, 3&#93;&#41;
3-element CuArray&#123;Union&#123;Nothing, Int64&#125;, 1&#125;:
 1
  nothing
 3julia&gt; findfirst&#40;isnothing, a&#41;
2</code></pre>
<p>It is also safe to pass these CuArrays to a kernel and use unions there:</p>
<pre><code class="language-julia-repl">julia&gt; function kernel&#40;a&#41;
         i &#61; threadIdx&#40;&#41;.x
         if a&#91;i&#93; &#33;&#61;&#61; nothing
           a&#91;i&#93; &#43;&#61; 1
         end
         return
       endjulia&gt; @cuda threads&#61;3 kernel&#40;a&#41;julia&gt; a
3-element CuArray&#123;Union&#123;Nothing, Int64&#125;, 1&#125;:
 2
  nothing
 4</code></pre>
<p>This feature is especially valuable to represent missing values, and is an important step towards GPU support for DataFrames.jl.</p>
<h2 id="debug_and_location_information">Debug and location information</h2>
<p>Another noteworthy addition is the <a href="https://github.com/JuliaGPU/CUDA.jl/pull/891">support for emitting debug and location information</a>. The debug level, set by passing <code>-g &lt;level&gt;</code> to the <code>julia</code> executable, determines how much info is emitted. The default of level 1 only enables location information instructions which should not impact performance. Passing <code>-g0</code> disables this, while passing <code>-g2</code> also enables the output of DWARF debug information and compiles in debug mode.</p>
<p>Location information is useful for a variety of reasons. Many tools, like the NVIDIA profilers, use it corelate instructions to source code:</p>
<figure>
  <img src="https://juliagpu.org/post/2021-06-10-cuda_3.3/nvvp.png" alt="NVIDIA Visual Profiler with source-code location information">
</figure><p>Debug information can be used to debug compiled code using <code>cuda-gdb</code>:</p>
<pre><code class="language-julia">&#36; cuda-gdb --args julia -g2 examples/vadd.jl
&#40;cuda-gdb&#41; set cuda break_on_launch all
&#40;cuda-gdb&#41; run
&#91;Switching focus to CUDA kernel 0, grid 1, block &#40;0,0,0&#41;, thread &#40;0,0,0&#41;, device 0, sm 0, warp 0, lane 0&#93;
macro expansion &#40;&#41; at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
74                  Base.llvmcall&#40;&#40;&#36;ir,&#36;fn&#41;, &#36;rettyp, &#36;argtyp, &#36;&#40;args.args...&#41;&#41;&#40;cuda-gdb&#41; bt
#0  macro expansion &#40;&#41; at .julia/packages/LLVM/hHQuD/src/interop/base.jl:74
#1  macro expansion &#40;&#41; at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#2  _index &#40;&#41; at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:6
#3  blockIdx_x &#40;&#41; at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:56
#4  blockIdx &#40;&#41; at .julia/dev/CUDA/src/device/intrinsics/indexing.jl:76
#5  julia_vadd&lt;&lt;&lt;&#40;1,1,1&#41;,&#40;12,1,1&#41;&gt;&gt;&gt; &#40;a&#61;..., b&#61;..., c&#61;...&#41; at .julia/dev/CUDA/examples/vadd.jl:6&#40;cuda-gdb&#41; f 5
#5  julia_vadd&lt;&lt;&lt;&#40;1,1,1&#41;,&#40;12,1,1&#41;&gt;&gt;&gt; &#40;a&#61;..., b&#61;..., c&#61;...&#41; at .julia/dev/CUDA/examples/vadd.jl:6
6           i &#61; &#40;blockIdx&#40;&#41;.x-1&#41; * blockDim&#40;&#41;.x &#43; threadIdx&#40;&#41;.x&#40;cuda-gdb&#41; l
1       using Test
2
3       using CUDA
4
5       function vadd&#40;a, b, c&#41;
6           i &#61; &#40;blockIdx&#40;&#41;.x-1&#41; * blockDim&#40;&#41;.x &#43; threadIdx&#40;&#41;.x
7           c&#91;i&#93; &#61; a&#91;i&#93; &#43; b&#91;i&#93;
8           return
9       end
10</code></pre>
<h2 id="improved_cuda_compatibility_support">Improved CUDA compatibility support</h2>
<p>As always, new CUDA.jl releases come with updated support for the CUDA toolkit. CUDA.jl is now compatible with <a href="https://github.com/JuliaGPU/CUDA.jl/pull/858">CUDA 11.3</a>, as well as <a href="https://github.com/JuliaGPU/CUDA.jl/pull/945">CUDA 11.3 Update 1</a>. Users don&#39;t have to do anything to update to these versions, as CUDA.jl will automatically select and download the latest supported version.</p>
<p>Of course, for CUDA.jl to use the latest versions of the CUDA toolkit, a sufficiently recent version of the NVIDIA driver is required. Before CUDA 11.0, the driver&#39;s CUDA compatibility was a strict lower bound, and every minor CUDA release required a driver update. CUDA 11.0 comes with an enhanced compatibility option that follows semantic versioning, e.g., CUDA 11.3 can be used on an NVIDIA driver that only supports up to CUDA 11.0. CUDA.jl now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/936">follows semantic versioning</a> when selecting a compatible toolkit, making it easier to use the latest version of the CUDA toolkit in Julia.</p>
<p>For those interested: Implementing semantic versioning required the CUDA.jl compiler to <a href="https://github.com/JuliaGPU/CUDA.jl/pull/892">use <code>ptxas</code> instead of the driver&#39;s embedded JIT</a> to generate GPU machine code. At the same time, many parts of CUDA.jl still use the CUDA driver APIs, so it&#39;s always recommended to keep your NVIDIA driver up-to-date.</p>
<h2 id="high-level_graph_apis">High-level graph APIs</h2>
<p>To overcome the cost of launching kernels, CUDA makes it possible to build computational graphs, and execute those graphs with less overhead than the underlying operations. In CUDA.jl we provide easy access to the APIs <a href="https://github.com/JuliaGPU/CUDA.jl/pull/877">to record and execute</a> these graphs:</p>
<pre><code class="language-julia">A &#61; CUDA.zeros&#40;Int, 1&#41;# ensure the operation is compiled
A .&#43;&#61; 1# capture
graph &#61; capture&#40;&#41; do
    A .&#43;&#61; 1
end
@test Array&#40;A&#41; &#61;&#61; &#91;1&#93;   # didn&#39;t change anything# instantiate and launch
exec &#61; instantiate&#40;graph&#41;
CUDA.launch&#40;exec&#41;
@test Array&#40;A&#41; &#61;&#61; &#91;2&#93;# update and instantiate/launch again
graph′ &#61; capture&#40;&#41; do
    A .&#43;&#61; 2
end
update&#40;exec, graph′&#41;
CUDA.launch&#40;exec&#41;
@test Array&#40;A&#41; &#61;&#61; &#91;4&#93;</code></pre>
<p>This sequence of operations is common enough that we provide a high-level <code>@captured</code> macro wraps that automatically records, updates, instantiates and launches the graph:</p>
<pre><code class="language-julia">A &#61; CUDA.zeros&#40;Int, 1&#41;for i in 1:2
    @captured A .&#43;&#61; 1
end
@test Array&#40;A&#41; &#61;&#61; &#91;2&#93;</code></pre>
<h2 id="minor_changes_and_features">Minor changes and features</h2>
<ul>
<li><p>CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/842">now supports</a> <code>@atomic</code> multiplication and division &#40;by @yuehhua&#41;</p>
</li>
<li><p>Several statistics functions <a href="https://github.com/JuliaGPU/CUDA.jl/pull/509">have been implemented</a> &#40;by @berquist&#41;</p>
</li>
<li><p>The device-side random number generator in <a href="https://github.com/JuliaGPU/CUDA.jl/pull/890">is now based on Philox2x</a>, greatly improving quality of randomness &#40;passing BigCrush&#41; while allowing calls to <code>rand&#40;&#41;</code> from divergent threads.</p>
</li>
<li><p>Dependent libraries like CUDNN and CUTENSOR <a href="https://github.com/JuliaGPU/CUDA.jl/pull/882">are now only downloaded and initialized</a> when they are used.</p>
</li>
<li><p>The <code>synchronize&#40;&#41;</code> function in <a href="https://github.com/JuliaGPU/CUDA.jl/pull/896">now first spins</a> before yielding and sleeping, to improve the latency of short-running operations.</p>
</li>
<li><p>Several additional operations are now supported on Float16 inputs, such as <a href="https://github.com/JuliaGPU/CUDA.jl/pull/904">CUSPARSE and CUBLAS</a> operations, and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/871">various math intrinsics</a>.</p>
</li>
<li><p>Kepler support &#40;compute capability 3.5&#41; <a href="https://github.com/JuliaGPU/CUDA.jl/pull/923">has been reinstated</a> for the time being.</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Thu, 10 Jun 2021 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 3.0]]></title>
  <link>https://juliagpu.org/post/2021-04-09-cuda_3.0/index.html</link>
  <guid>https://juliagpu.org/2021-04-09-cuda_3.0/</guid>
  <description><![CDATA[CUDA.jl 3.0 is a significant, semi-breaking release that features greatly improved multi-tasking and multi-threading, support for CUDA 11.2 and its new memory allocator, compiler tooling for GPU method overrides, device-side random number generation and a completely revamped cuDNN interface.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 3.0 is a significant, semi-breaking release that features greatly improved multi-tasking and multi-threading, support for CUDA 11.2 and its new memory allocator, compiler tooling for GPU method overrides, device-side random number generation and a completely revamped cuDNN interface.</p>
<h2 id="improved_multi-tasking_and_multi-threading">Improved multi-tasking and multi-threading</h2>
<p>Before this release, CUDA operations were enqueued on a single global stream, and many of these operations &#40;like copying memory, or synchronizing execution&#41; were fully blocking. This posed difficulties when using multiple tasks to perform independent operations: Blocking operations prevent all tasks from making progress, and using the same stream introduces unintended dependencies on otherwise independend operations. <strong>CUDA.jl now uses <a href="https://github.com/JuliaGPU/CUDA.jl/pull/662">private streams for each Julia task</a>, and avoids blocking operations where possible, enabling task-based concurrent execution.</strong> It is also possible to use different devices on each task, and there is experimental support for executing those tasks from different threads.</p>
<p>A <s>picture</s> snippet of code is worth a thousand words, so let&#39;s demonstrate using a computation that uses both a library function &#40;GEMM from CUBLAS&#41; and a native Julia broadcast kernel:</p>
<pre><code class="language-julia">using CUDA, LinearAlgebrafunction compute&#40;a,b,c&#41;
    mul&#33;&#40;c, a, b&#41;
    broadcast&#33;&#40;sin, c, c&#41;
    synchronize&#40;&#41;
    c
end</code></pre>
<p>To execute multiple invocations of this function concurrently, we can simply use Julia&#39;s task-based programming interfaces and wrap each call to <code>compute</code> in an <code>@async</code> block. Then, we synchronize execution again by wrapping in a <code>@sync</code> block:</p>
<pre><code class="language-julia">function iteration&#40;a,b,c&#41;
    results &#61; Vector&#123;Any&#125;&#40;undef, 2&#41;
    NVTX.@range &quot;computation&quot; @sync begin
        @async begin
            results&#91;1&#93; &#61; compute&#40;a,b,c&#41;
        end
        @async begin
            results&#91;2&#93; &#61; compute&#40;a,b,c&#41;
        end
    end
    NVTX.@range &quot;comparison&quot; Array&#40;results&#91;1&#93;&#41; &#61;&#61; Array&#40;results&#91;2&#93;&#41;
end</code></pre>
<p>The calls to the <code>@range</code> macro from NVTX, a submodule of CUDA.jl, will visualize the different phases of execution when we profile our program. We now invoke our function using some random data:</p>
<pre><code class="language-julia">function main&#40;N&#61;1024&#41;
    a &#61; CUDA.rand&#40;N,N&#41;
    b &#61; CUDA.rand&#40;N,N&#41;
    c &#61; CUDA.rand&#40;N,N&#41;    # make sure this data can be used by other tasks&#33;
    synchronize&#40;&#41;    # warm-up
    iteration&#40;a,b,c&#41;
    GC.gc&#40;true&#41;    NVTX.@range &quot;main&quot; iteration&#40;a,b,c&#41;
end</code></pre>
<p>The snippet above illustrates one breaking aspect of this release: Because each task uses its own stream, <strong>you now need to synchronize when re-using data in another task.</strong> Although it is unlikely that any user code was relying on the old behavior, it is technically a breaking change, and as such we are bumping the major version of the CUDA.jl package.</p>
<p>If we profile these our program using NSight Systems, we can see how the execution of both calls to <code>compute</code> was overlapped:</p>
<figure>
  <img src="https://juliagpu.org/post/2021-04-09-cuda_3.0/task_based_concurrency.png" alt="Overlapping execution on the GPU using task-based concurrency">
</figure><p>The region highlighted in green was spent enqueueing operations from the CPU, which includes the call to <code>synchronize&#40;&#41;</code>. This used to be a blocking operation, whereas now it only synchronizes the task-local stream while yielding to the Julia scheduler so that it can continue execution on another task. <strong>For synchronizing the entire device, use the new <code>device_synchronize&#40;&#41;</code> function.</strong></p>
<p>The remainder of computation was then spent executing kernels. Here, execution was overlapped, but that obviously depends on the exact characteristics of the computations and your GPU. Also note that copying to and from the CPU is always going to block for some time, unless the memory was page-locked. CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/760">now supports</a> locking memory like that using the <code>pin</code> function; for more details refer to <a href="https://juliagpu.github.io/CUDA.jl/dev/usage/multitasking/">the CUDA.jl documentation on tasks and threads</a>.</p>
<h2 id="cuda_112_and_stream-ordered_allocations">CUDA 11.2 and stream-ordered allocations</h2>
<p>CUDA.jl now also fully supports CUDA 11.2, and it will default to using that version of the toolkit if your driver supports it. The release came with several new features, such as <a href="https://developer.nvidia.com/blog/enhancing-memory-allocation-with-new-cuda-11-2-features/">the new stream-ordered memory allocator</a>. Without going into details, it is now possible to asynchonously allocate memory, obviating much of the need to cache those allocations in a memory pool. Initial benchmarks have shown nice speed-ups from using this allocator, while lowering memory pressure and thus reducing invocations of the Julia garbage collector.</p>
<p>When using CUDA 11.2, CUDA.jl will <a href="https://github.com/JuliaGPU/CUDA.jl/pull/679">default to the CUDA-backed memory pool</a> and disable its own caching layer. If you want to compare performance, you can still use the old allocator and caching memory pool by setting the <code>JULIA_CUDA_MEMORY_POOL</code> environment variable to, e.g. <code>binned</code>. On older versions of CUDA, the <code>binned</code> pool is still used by default.</p>
<h2 id="gpu_method_overrides">GPU method overrides</h2>
<p>With the new <code>AbstractInterpreter</code> functionality in Julia 1.6, it is now much easier to further customize the Base compiler. This has enabled us to develop <a href="https://github.com/JuliaGPU/GPUCompiler.jl/pull/151">a mechanism for overriding methods with GPU-specific counterparts</a>. It used to be required to explicitly pick CUDA-specific versions, e.g. <code>CUDA.sin</code>, because the Base version performed some GPU-incompatible operation. This was problematic as it did not compose with generic code, and the CUDA-specific versions often lacked support for specific combinations of argument types &#40;for example, <code>CUDA.sin&#40;::Complex&#41;</code> was not supported&#41;.</p>
<p>With CUDA 3.0, it is possible to <strong>define GPU-specific methods that override an existing definition, without requiring a new function type</strong>. For now, this functionality is private to CUDA.jl, but we expect to make it available to other packages starting with Julia 1.7.</p>
<p>This functionality has unblocked <em>many</em> issues, as can be seen in the <a href="https://github.com/JuliaGPU/CUDA.jl/pull/750">corresponding pull request</a>. It is now no longer needed to prefix a call with the CUDA module to ensure a GPU-compatible version is used. Furthermore, it also protects users from accidentally calling GPU intrinsics, as doing so will now result in an error instead of a crash:</p>
<pre><code class="language-text">julia&gt; CUDA.saturate&#40;1f0&#41;
ERROR: This function is not intended for use on the CPU
Stacktrace:
 &#91;1&#93; error&#40;s::String&#41;
   @ Base ./error.jl:33
 &#91;2&#93; saturate&#40;x::Float32&#41;
   @ CUDA ~/Julia/pkg/CUDA/src/device/intrinsics.jl:23
 &#91;3&#93; top-level scope
   @ REPL&#91;10&#93;:1</code></pre>
<h2 id="device-side_random_number_generation">Device-side random number generation</h2>
<p>As an illustration of the value of GPU method overrides, CUDA.jl now provides a device-side random number generator that is accessible by simply calling <code>rand&#40;&#41;</code> from a kernel:</p>
<pre><code class="language-julia">julia&gt; function kernel&#40;&#41;
         @cushow rand&#40;&#41;
         return
       end
kernel &#40;generic function with 1 method&#41;julia&gt; @cuda kernel&#40;&#41;
rand&#40;&#41; &#61; 0.668274</code></pre>
<p>This works by overriding the <code>Random.default_rng&#40;&#41;</code> method, and providing a GPU-compatible random number generator: Building on <a href="https://github.com/JuliaGPU/CUDA.jl/pull/772">exploratory work</a> by <a href="https://github.com/S-D-R">@S-D-R</a>, the <a href="https://github.com/JuliaGPU/CUDA.jl/pull/788">current generator</a> is a maximally equidistributed combined Tausworthe RNG that shares 32-bytes of random state across threads in a warp for performance. The generator performs well, but <a href="https://github.com/JuliaGPU/CUDA.jl/issues/803">does not pass</a> the Crush battery of tests, so PRs are welcome here to improve the implementation&#33;</p>
<p>Note that for host-side operations, e.g. <code>rand&#33;&#40;::CuArray&#41;</code>, the generator is not yet used by default. Instead, we use CURAND whenever possible, and fall back to the slower but more full-featured GPUArrays.jl-generator in other cases.</p>
<h2 id="revamped_cudnn_interface">Revamped cuDNN interface</h2>
<p>Finally, the cuDNN wrappers have been <a href="https://github.com/JuliaGPU/CUDA.jl/pull/523">completely revamped</a> by <a href="https://github.com/denizyuret">@denizyuret</a>. The goal of the redesign is to more faithfully map the cuDNN API to more natural Julia functions, so that packages like Knet.jl or NNlib.jl can more easily use advanced cuDNN features without having to resort to low-level C calls. For more details, refer to <a href="https://github.com/JuliaGPU/CUDA.jl/blob/da7c6eee82d6ea0eee1cb75c8589c8a92b0bc474/lib/cudnn/README.md">the design document</a>. As part of this redesign, the high-level wrappers of CUDNN <a href="https://github.com/FluxML/NNlib.jl/pull/286">have been moved to</a> a subpackage of NNlib.jl.</p>
]]></content:encoded>
    
  <pubDate>Fri, 09 Apr 2021 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 2.4 and 2.5]]></title>
  <link>https://juliagpu.org/post/2021-01-08-cuda_2.4_2.5/index.html</link>
  <guid>https://juliagpu.org/2021-01-08-cuda_2.4_2.5/</guid>
  <description><![CDATA[CUDA.jl v2.4 and v2.5 are two almost-identical feature releases, respectively for Julia 1.5 and 1.6. These releases feature a greatly improved &lt;code&gt;findmin&lt;/code&gt; and &lt;code&gt;findmax&lt;/code&gt; kernels, an improved interface for kernel introspection, support for CUDA 11.2, and of course many bug fixes.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl v2.4 and v2.5 are two almost-identical feature releases, respectively for Julia 1.5 and 1.6. These releases feature a greatly improved <code>findmin</code> and <code>findmax</code> kernels, an improved interface for kernel introspection, support for CUDA 11.2, and of course many bug fixes.</p>
<h2 id="improved_findmin_and_findmax_kernels">Improved <code>findmin</code> and <code>findmax</code> kernels</h2>
<p>Thanks to <a href="https://github.com/tkf">@tkf</a> and <a href="https://github.com/Ellipse0934">@Ellipse0934</a>, CUDA.jl now <a href="https://github.com/JuliaGPU/CUDA.jl/pull/576">uses a single-pass kernel for finding the minimum or maximum item in a CuArray</a>. This fixes compatibility with <code>NaN</code>-valued elements, while on average improving performance. Depending on the rank, shape and size of the array these improvements vary from a minor regression to order-of-magnitude improvements.</p>
<h2 id="new_kernel_introspection_interface">New kernel introspection interface</h2>
<p>It is now possible to obtain a compiled-but-not-launched kernel by passing the <code>launch&#61;false</code> keyword to <code>@cuda</code>. This is useful when you want to reflect, e.g., query the amount of registers, or other kernel properties:</p>
<pre><code class="language-julia">julia&gt; kernel &#61; @cuda launch&#61;false identity&#40;nothing&#41;
CUDA.HostKernel&#123;identity,Tuple&#123;Nothing&#125;&#125;&#40;...&#41;julia&gt; CUDA.registers&#40;kernel&#41;
4</code></pre>
<p>The old API is still available, and will even be extended in future versions of CUDA.jl for the purpose of compiling device functions &#40;not kernels&#41;:</p>
<pre><code class="language-julia">julia&gt; kernel &#61; cufunction&#40;identity, Tuple&#123;Nothing&#125;&#41;
CUDA.HostKernel&#123;identity,Tuple&#123;Nothing&#125;&#125;&#40;...&#41;</code></pre>
<h2 id="support_for_cuda_112">Support for CUDA 11.2</h2>
<p>CUDA.jl now supports the latest version of CUDA, version 11.2. Because CUDNN and CUTENSOR are not compatible with this release yet, CUDA.jl won&#39;t automatically switch to it unless you explicitly request so:</p>
<pre><code class="language-julia">julia&gt; ENV&#91;&quot;JULIA_CUDA_VERSION&quot;&#93; &#61; &quot;11.2&quot;
&quot;11.2&quot;julia&gt; using CUDAjulia&gt; CUDA.versioninfo&#40;&#41;
CUDA toolkit 11.2.0, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4</code></pre>
<p>Alternatively, if you disable use of artifacts through <code>JULIA_CUDA_USE_BINARYBUILDER&#61;false</code>, CUDA 11.2 can be picked up from your local system.</p>
<h2 id="future_developments">Future developments</h2>
<p>Due to upstream compiler changes, CUDA.jl 2.4 is expected to be the last release compatible with Julia 1.5. Patch releases are still possible, but are not automatic: If you need a specific bugfix from a future CUDA.jl release, create an issue or PR to backport the change.</p>
]]></content:encoded>
    
  <pubDate>Fri, 08 Jan 2021 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Introducing: oneAPI.jl]]></title>
  <link>https://juliagpu.org/post/2020-11-05-oneapi_0.1/index.html</link>
  <guid>https://juliagpu.org/2020-11-05-oneapi_0.1/</guid>
  <description><![CDATA[We&#39;re proud to announce the first version of oneAPI.jl, a Julia package for programming accelerators with the &lt;a href&#61;&quot;https://www.oneapi.com/&quot;&gt;oneAPI programming model&lt;/a&gt;. It is currently available for select Intel GPUs, including common integrated ones, and offers a similar experience to CUDA.jl.]]></description>  
  
  <content:encoded><![CDATA[
<p>We&#39;re proud to announce the first version of oneAPI.jl, a Julia package for programming accelerators with the <a href="https://www.oneapi.com/">oneAPI programming model</a>. It is currently available for select Intel GPUs, including common integrated ones, and offers a similar experience to CUDA.jl.</p>
<p>The initial version of this package, v0.1, consists of three key components:</p>
<ul>
<li><p>wrappers for the oneAPI Level Zero interfaces;</p>
</li>
<li><p>a compiler for Julia source code to SPIR-V IR;</p>
</li>
<li><p>and an array interface for convenient data-parallel programming.</p>
</li>
</ul>
<p>In this post, I&#39;ll briefly describe each of these. But first, some essentials.</p>
<h2 id="installation">Installation</h2>
<p>oneAPI.jl is currently only supported on 64-bit Linux, using a sufficiently recent kernel, and requires Julia 1.5. Furthermore, it currently only supports a limited set of Intel GPUs: Gen9 &#40;Skylake, Kaby Lake, Coffee Lake&#41;, Gen11 &#40;Ice Lake&#41;, and Gen12 &#40;Tiger Lake&#41;.</p>
<p>If your Intel CPU has an integrated GPU supported by oneAPI, you can just go ahead and install the oneAPI.jl package:</p>
<pre><code class="language-julia">pkg&gt; add oneAPI</code></pre>
<p>That&#39;s right, no additional drivers required&#33; oneAPI.jl ships its own copy of the <a href="https://github.com/intel/compute-runtime">Intel Compute Runtime</a>, which works out of the box on any &#40;sufficiently recent&#41; Linux kernel. The initial download, powered by Julia&#39;s artifact subsystem, might take a while to complete. After that, you can import the package and start using its functionality:</p>
<pre><code class="language-julia-repl">julia&gt; using oneAPIjulia&gt; oneAPI.versioninfo&#40;&#41;
Binary dependencies:
- NEO_jll: 20.42.18209&#43;0
- libigc_jll: 1.0.5186&#43;0
- gmmlib_jll: 20.3.2&#43;0
- SPIRV_LLVM_Translator_jll: 9.0.0&#43;1
- SPIRV_Tools_jll: 2020.2.0&#43;1Toolchain:
- Julia: 1.5.2
- LLVM: 9.0.11 driver:
- 00007fee-06cb-0a10-1642-ca9f01000000 &#40;v1.0.0, API v1.0.0&#41;1 device:
- Intel&#40;R&#41; Graphics Gen9</code></pre>
<h2 id="the_onearray_type">The <code>oneArray</code> type</h2>
<p>Similar to CUDA.jl&#39;s <code>CuArray</code> type, oneAPI.jl provides an array abstraction that you can use to easily perform data parallel operations on your GPU:</p>
<pre><code class="language-julia-repl">julia&gt; a &#61; oneArray&#40;zeros&#40;2,3&#41;&#41;
2×3 oneArray&#123;Float64,2&#125;:
 0.0  0.0  0.0
 0.0  0.0  0.0julia&gt; a .&#43; 1
2×3 oneArray&#123;Float64,2&#125;:
 1.0  1.0  1.0
 1.0  1.0  1.0julia&gt; sum&#40;ans; dims&#61;2&#41;
2×1 oneArray&#123;Float64,2&#125;:
 3.0
 3.0</code></pre>
<p>This functionality builds on the <a href="https://github.com/JuliaGPU/GPUArrays.jl/">GPUArrays.jl</a> package, which means that a lot of operations are supported out of the box. Some are still missing, of course, and we haven&#39;t carefully optimized for performance either.</p>
<h2 id="kernel_programming">Kernel programming</h2>
<p>The above array operations are made possible by a compiler that transforms Julia source code into SPIR-V IR for use with oneAPI. Most of this work is part of <a href="https://github.com/JuliaGPU/GPUCompiler.jl">GPUCompiler.jl</a>. In oneAPI.jl, we use this compiler to provide a kernel programming model:</p>
<pre><code class="language-julia-repl">julia&gt; function vadd&#40;a, b, c&#41;
           i &#61; get_global_id&#40;&#41;
           @inbounds c&#91;i&#93; &#61; a&#91;i&#93; &#43; b&#91;i&#93;
           return
       endjulia&gt; a &#61; oneArray&#40;rand&#40;10&#41;&#41;;julia&gt; b &#61; oneArray&#40;rand&#40;10&#41;&#41;;julia&gt; c &#61; similar&#40;a&#41;;julia&gt; @oneapi items&#61;10 vadd&#40;a, b, c&#41;julia&gt; @test Array&#40;a&#41; .&#43; Array&#40;b&#41; &#61;&#61; Array&#40;c&#41;
Test Passed</code></pre>
<p>Again, the <code>@oneapi</code> macro resembles <code>@cuda</code> from CUDA.jl. One of the differences with the CUDA stack is that we use OpenCL-style built-ins, like <code>get_global_id</code> instead of <code>threadIdx</code> and <code>barrier</code> instead of <code>sync_threads</code>. Other familiar functionality, e.g. to reflect on the compiler, is available as well:</p>
<pre><code class="language-julia-repl">julia&gt; @device_code_spirv @oneapi vadd&#40;a, b, c&#41;
; CompilerJob of kernel vadd&#40;oneDeviceArray&#123;Float64,1,1&#125;,
;                            oneDeviceArray&#123;Float64,1,1&#125;,
;                            oneDeviceArray&#123;Float64,1,1&#125;&#41;
; for GPUCompiler.SPIRVCompilerTarget; SPIR-V
; Version: 1.0
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 46
; Schema: 0
               OpCapability Addresses
               OpCapability Linkage
               OpCapability Kernel
               OpCapability Float64
               OpCapability Int64
               OpCapability Int8
          &#37;1 &#61; OpExtInstImport &quot;OpenCL.std&quot;
               OpMemoryModel Physical64 OpenCL
               OpEntryPoint Kernel
               ...
               OpReturn
               OpFunctionEnd</code></pre>
<h2 id="level_zero_wrappers">Level Zero wrappers</h2>
<p>To interface with the oneAPI driver, we use the <a href="https://github.com/oneapi-src/level-zero">Level Zero API</a>. Wrappers for this API is available under the <code>oneL0</code> submodule of oneAPI.jl:</p>
<pre><code class="language-julia-repl">julia&gt; using oneAPI.oneL0julia&gt; drv &#61; first&#40;drivers&#40;&#41;&#41;
ZeDriver&#40;00000000-0000-0000-1642-ca9f01000000, version 1.0.0&#41;julia&gt; dev &#61; first&#40;devices&#40;drv&#41;&#41;
ZeDevice&#40;GPU, vendor 0x8086, device 0x1912&#41;: Intel&#40;R&#41; Graphics Gen9</code></pre>
<p>This is a low-level interface, and importing this submodule should not be required for the vast majority of users. It is only useful when you want to perform very specific operations, like submitting an certain operations to the command queue, working with events, etc. In that case, you should refer to the <a href="https://spec.oneapi.com/level-zero/latest/index.html">upstream specification</a>; The wrappers in the <code>oneL0</code> module closely mimic the C APIs.</p>
<h2 id="status">Status</h2>
<p>Version 0.1 of oneAPI.jl forms a solid base for future oneAPI developments in Julia. Thanks to the continued effort of generalizing the Julia GPU support in packages like GPUArrays.jl and GPUCompiler.jl, this initial version is already much more usable than early versions of CUDA.jl or AMDGPU.jl ever were.</p>
<p>That said, there are crucial parts missing. For one, oneAPI.jl does not integrate with any of the vendor libraries like oneMKL or oneDNN. That means several important operations, e.g. matrix-matrix multiplication, will be slow. Hardware support is also limited, and the package currently only works on Linux.</p>
<p>If you want to contribute to oneAPI.jl, or run into problems, check out the GitHub repository at <a href="https://github.com/JuliaGPU/oneAPI.jl">JuliaGPU/oneAPI.jl</a>. For questions, please use the <a href="https://discourse.julialang.org/c/domain/gpu">Julia Discourse forum</a> under the GPU domain and/or in the #gpu channel of the <a href="https://julialang.org/community/">Julia Slack</a>.</p>
]]></content:encoded>
    
  <pubDate>Thu, 05 Nov 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 2.1]]></title>
  <link>https://juliagpu.org/post/2020-10-30-cuda_2.1/index.html</link>
  <guid>https://juliagpu.org/2020-10-30-cuda_2.1/</guid>
  <description><![CDATA[CUDA.jl v2.1 is a bug-fix release, with one new feature: support for cubic texture interpolations. The release also partly reverts a change from v2.0: &lt;code&gt;reshape&lt;/code&gt;, &lt;code&gt;reinterpret&lt;/code&gt; and contiguous &lt;code&gt;view&lt;/code&gt;s now return a &lt;code&gt;CuArray&lt;/code&gt; again.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl v2.1 is a bug-fix release, with one new feature: support for cubic texture interpolations. The release also partly reverts a change from v2.0: <code>reshape</code>, <code>reinterpret</code> and contiguous <code>view</code>s now return a <code>CuArray</code> again.</p>
<h2 id="generalized_texture_interpolations">Generalized texture interpolations</h2>
<p>CUDA&#39;s texture hardware only supports nearest-neighbour and linear interpolation, for other modes one is required to perform the interpolation by hand. In CUDA.jl v2.1 we are generalizing the texture interpolation API so that it is possible to use both hardware-backed and software-implemented interpolation modes in exactly the same way:</p>
<pre><code class="language-julia"># N is the dimensionality &#40;1, 2 or 3&#41;
# T is the element type &#40;needs to be supported by the texture hardware&#41;# source array
src &#61; rand&#40;T, fill&#40;10, N&#41;...&#41;# indices we want to interpolate
idx &#61; &#91;tuple&#40;rand&#40;1:0.1:10, N&#41;...&#41; for _ in 1:10&#93;# upload to the GPU
gpu_src &#61; CuArray&#40;src&#41;
gpu_idx &#61; CuArray&#40;idx&#41;# create a texture array for optimized fetching
# this is required for N&#61;1, optional for N&#61;2 and N&#61;3
gpu_src &#61; CuTextureArray&#40;gpu_src&#41;# interpolate using a texture
gpu_dst &#61; CuArray&#123;T&#125;&#40;undef, size&#40;gpu_idx&#41;&#41;
gpu_tex &#61; CuTexture&#40;gpu_src; interpolation&#61;CUDA.NearestNeighbour&#40;&#41;&#41;
broadcast&#33;&#40;gpu_dst, gpu_idx, Ref&#40;gpu_tex&#41;&#41; do idx, tex
    tex&#91;idx...&#93;
end# back to the CPU
dst &#61; Array&#40;gpu_dst&#41;</code></pre>
<p>Here, we can change the <code>interpolation</code> argument to <code>CuTexture</code> to either <code>NearestNeighbour</code> or <code>LinearInterpolation</code>, both supported by the hardware, or <code>CubicInterpolation</code> which is implemented in software &#40;building on the hardware-supported linear interpolation&#41;.</p>
<h2 id="partial_revert_of_array_wrapper_changes">Partial revert of array wrapper changes</h2>
<p>In CUDA.jl v2.0, we changed the behavior of several important array operations to reuse available wrappers in Base: <code>reshape</code> started returning a <code>ReshapedArray</code>, <code>view</code> now returned a <code>SubArray</code>, and <code>reinterpret</code> was reworked to use <code>ReinterpretArray</code>. These changes were made to ensure maximal compatibility with Base&#39;s array type, and to simplify the implementation in CUDA.jl and GPUArrays.jl.</p>
<p>However, this change turned out to regress the time to precompile and load CUDA.jl. Consequently, the change has been reverted, and these wrappers are now implemented as part of the <code>CuArray</code> type again. Note however that we intend to revisit this change in the future. It is therefore recommended to use the <code>DenseCuArray</code> type alias for methods that need a <code>CuArray</code> backed by contiguous GPU memory. For strided <code>CuArray</code>s, i.e. non-contiguous views, you should use the <code>StridedCuArray</code> alias.</p>
]]></content:encoded>
    
  <pubDate>Fri, 30 Oct 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 2.0]]></title>
  <link>https://juliagpu.org/post/2020-10-02-cuda_2.0/index.html</link>
  <guid>https://juliagpu.org/2020-10-02-cuda_2.0/</guid>
  <description><![CDATA[Today we&#39;re releasing CUDA.jl 2.0, a breaking release with several new features. Highlights include initial support for Float16, a switch to CUDA&#39;s new stream model, a much-needed rework of the sparse array support and support for CUDA 11.1.]]></description>  
  
  <content:encoded><![CDATA[
<p>Today we&#39;re releasing CUDA.jl 2.0, a breaking release with several new features. Highlights include initial support for Float16, a switch to CUDA&#39;s new stream model, a much-needed rework of the sparse array support and support for CUDA 11.1.</p>
<p>The release now requires <strong>Julia 1.5</strong>, and assumes a GPU with <strong>compute capability 5.0</strong> or higher &#40;although most of the package will still work with an older GPU&#41;.</p>
<h2 id="low-_and_mixed-precision_operations">Low- and mixed-precision operations</h2>
<p>With NVIDIA&#39;s latest GPUs featuring more and more low-precision operations, CUDA.jl <a href="https://github.com/JuliaGPU/CUDA.jl/pull/417">now</a> starts to support these data types. For example, the CUBLAS wrappers can be used with &#40;B&#41;Float16 inputs &#40;running under <code>JULIA_DEBUG&#61;CUBLAS</code> to illustrate the called methods&#41; thanks to the <code>cublasGemmEx</code> API call:</p>
<pre><code class="language-julia-repl">julia&gt; mul&#33;&#40;CUDA.zeros&#40;Float32,2,2&#41;,
            cu&#40;rand&#40;Float16,2,2&#41;&#41;,
            cu&#40;rand&#40;Float16,2,2&#41;&#41;&#41;I&#33; cuBLAS &#40;v11.0&#41; function cublasStatus_t cublasGemmEx&#40;...&#41; called:
i&#33;  Atype: type&#61;cudaDataType_t; val&#61;CUDA_R_16F&#40;2&#41;
i&#33;  Btype: type&#61;cudaDataType_t; val&#61;CUDA_R_16F&#40;2&#41;
i&#33;  Ctype: type&#61;cudaDataType_t; val&#61;CUDA_R_32F&#40;0&#41;
i&#33;  computeType: type&#61;cublasComputeType_t; val&#61;CUBLAS_COMPUTE_32F&#40;68&#41;2×2 CuArray&#123;Float32,2&#125;:
 0.481284  0.561241
 1.12923   1.04541</code></pre>
<pre><code class="language-julia-repl">julia&gt; using BFloat16sjulia&gt; mul&#33;&#40;CUDA.zeros&#40;BFloat16,2,2&#41;,
            cu&#40;BFloat16.&#40;rand&#40;2,2&#41;&#41;&#41;,
            cu&#40;BFloat16.&#40;rand&#40;2,2&#41;&#41;&#41;&#41;I&#33; cuBLAS &#40;v11.0&#41; function cublasStatus_t cublasGemmEx&#40;...&#41; called:
i&#33;  Atype: type&#61;cudaDataType_t; val&#61;CUDA_R_16BF&#40;14&#41;
i&#33;  Btype: type&#61;cudaDataType_t; val&#61;CUDA_R_16BF&#40;14&#41;
i&#33;  Ctype: type&#61;cudaDataType_t; val&#61;CUDA_R_16BF&#40;14&#41;
i&#33;  computeType: type&#61;cublasComputeType_t; val&#61;CUBLAS_COMPUTE_32F&#40;68&#41;2×2 CuArray&#123;BFloat16,2&#125;:
 0.300781   0.71875
 0.0163574  0.0241699</code></pre>
<p>Alternatively, CUBLAS can be configured to automatically down-cast 32-bit inputs to Float16. This is <a href="https://github.com/JuliaGPU/CUDA.jl/pull/424">now</a> exposed through a task-local CUDA.jl math mode:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.math_mode&#33;&#40;CUDA.FAST_MATH; precision&#61;:Float16&#41;julia&gt; mul&#33;&#40;CuArray&#40;zeros&#40;Float32,2,2&#41;&#41;,
            CuArray&#40;rand&#40;Float32,2,2&#41;&#41;,
            CuArray&#40;rand&#40;Float32,2,2&#41;&#41;&#41;I&#33; cuBLAS &#40;v11.0&#41; function cublasStatus_t cublasGemmEx&#40;...&#41; called:
i&#33;  Atype: type&#61;cudaDataType_t; val&#61;CUDA_R_32F&#40;0&#41;
i&#33;  Btype: type&#61;cudaDataType_t; val&#61;CUDA_R_32F&#40;0&#41;
i&#33;  Ctype: type&#61;cudaDataType_t; val&#61;CUDA_R_32F&#40;0&#41;
i&#33;  computeType: type&#61;cublasComputeType_t; val&#61;CUBLAS_COMPUTE_32F_FAST_16F&#40;74&#41;2×2 CuArray&#123;Float32,2&#125;:
 0.175258  0.226159
 0.511893  0.331351</code></pre>
<p>As part of these changes, CUDA.jl now defaults to using tensor cores. This may affect accuracy; use math mode <code>PEDANTIC</code> if you want the old behavior.</p>
<p>Work is <a href="https://github.com/JuliaGPU/CUDA.jl/issues/391">under way</a> to extend these capabilities to the rest of CUDA.jl, e.g., the CUDNN wrappers, or the native kernel programming capabilities.</p>
<h2 id="new_default_stream_semantics">New default stream semantics</h2>
<p>In CUDA.jl 2.0 we&#39;re <a href="https://github.com/JuliaGPU/CUDA.jl/pull/395">switching</a> to CUDA&#39;s <a href="https://developer.nvidia.com/blog/gpu-pro-tip-cuda-7-streams-simplify-concurrency/">simplified stream programming model</a>. This simplifies working with multiple streams, and opens up more possibilities for concurrent execution of GPU operations.</p>
<h3 id="multi-stream_programming">Multi-stream programming</h3>
<p>In the old model, the default stream &#40;used by all GPU operations unless specified otherwise&#41; was a special stream whose commands could not be executed concurrently with commands on regular, explicitly-created streams. For example, if we interleave kernels executed on a dedicated stream with ones on the default one, execution was serialized:</p>
<pre><code class="language-julia">using CUDAN &#61; 1 &lt;&lt; 20function kernel&#40;x, n&#41;
    tid &#61; threadIdx&#40;&#41;.x &#43; &#40;blockIdx&#40;&#41;.x-1&#41; * blockDim&#40;&#41;.x
    for i &#61; tid:blockDim&#40;&#41;.x*gridDim&#40;&#41;.x:n
        x&#91;i&#93; &#61; CUDA.sqrt&#40;CUDA.pow&#40;3.14159f0, i&#41;&#41;
    end
    return
endnum_streams &#61; 8for i in 1:num_streams
    stream &#61; CuStream&#40;&#41;    data &#61; CuArray&#123;Float32&#125;&#40;undef, N&#41;    @cuda blocks&#61;1 threads&#61;64 stream&#61;stream kernel&#40;data, N&#41;    @cuda kernel&#40;data, 0&#41;
end</code></pre>
<figure>
  <img src="https://juliagpu.org/post/2020-10-02-cuda_2.0/multistream_before.png" alt="Multi-stream programming (old)">
</figure><p>In the new model, default streams are regular streams and commands issued on them can execute concurrently with those on other streams:</p>
<figure>
  <img src="https://juliagpu.org/post/2020-10-02-cuda_2.0/multistream_after.png" alt="Multi-stream programming (new)">
</figure><h3 id="multi-threading">Multi-threading</h3>
<p>Another consequence of the new stream model is that each thread gets its own default stream &#40;accessible as <code>CuStreamPerThread&#40;&#41;</code>&#41;. Together with Julia&#39;s threading capabilities, this makes it trivial to group independent work in tasks, benefiting from concurrent execution on the GPU where possible:</p>
<pre><code class="language-julia">using CUDAN &#61; 1 &lt;&lt; 20function kernel&#40;x, n&#41;
    tid &#61; threadIdx&#40;&#41;.x &#43; &#40;blockIdx&#40;&#41;.x-1&#41; * blockDim&#40;&#41;.x
    for i &#61; tid:blockDim&#40;&#41;.x*gridDim&#40;&#41;.x:n
        x&#91;i&#93; &#61; CUDA.sqrt&#40;CUDA.pow&#40;3.14159f0, i&#41;&#41;
    end
    return
endThreads.@threads for i in 1:Threads.nthreads&#40;&#41;
    data &#61; CuArray&#123;Float32&#125;&#40;undef, N&#41;
    @cuda blocks&#61;1 threads&#61;64 kernel&#40;data, N&#41;
    synchronize&#40;CuDefaultStream&#40;&#41;&#41;
end</code></pre>
<figure>
  <img src="https://juliagpu.org/post/2020-10-02-cuda_2.0/multithread_after.png" alt="Multi-threading (new)">
</figure><p>With the old model, execution would have been serialized because the default stream was the same across threads:</p>
<figure>
  <img src="https://juliagpu.org/post/2020-10-02-cuda_2.0/multithread_before.png" alt="Multi-threading (old)">
</figure><p>Future improvements will make this behavior configurable, such that users can use a different default stream per task.</p>
<h2 id="sparse_array_clean-up">Sparse array clean-up</h2>
<p>As part of CUDA.jl 2.0, the sparse array support <a href="https://github.com/JuliaGPU/CUDA.jl/pull/409">has been refactored</a>, bringing them in line with other array types and their expected behavior. For example, the custom <code>switch2</code> methods have been removed in favor of calls to <code>convert</code> and array constructors:</p>
<pre><code class="language-julia-repl">julia&gt; using SparseArrays
julia&gt; using CUDA, CUDA.CUSPARSEjulia&gt; CuSparseMatrixCSC&#40;CUDA.rand&#40;2,2&#41;&#41;
2×2 CuSparseMatrixCSC&#123;Float32&#125; with 4 stored entries:
  &#91;1, 1&#93;  &#61;  0.124012
  &#91;2, 1&#93;  &#61;  0.791714
  &#91;1, 2&#93;  &#61;  0.487905
  &#91;2, 2&#93;  &#61;  0.752466julia&gt; CuSparseMatrixCOO&#40;sprand&#40;2,2, 0.5&#41;&#41;
2×2 CuSparseMatrixCOO&#123;Float64&#125; with 3 stored entries:
  &#91;1, 1&#93;  &#61;  0.183183
  &#91;2, 1&#93;  &#61;  0.966466
  &#91;2, 2&#93;  &#61;  0.064101julia&gt; CuSparseMatrixCSR&#40;ans&#41;
2×2 CuSparseMatrixCSR&#123;Float64&#125; with 3 stored entries:
  &#91;1, 1&#93;  &#61;  0.183183
  &#91;2, 1&#93;  &#61;  0.966466
  &#91;2, 2&#93;  &#61;  0.064101</code></pre>
<p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/421">Initial support for the COO sparse matrix type </a> has also been added, along with more <a href="https://github.com/JuliaGPU/CUDA.jl/pull/351">better support for sparse matrix-vector multiplication</a>.</p>
<h2 id="support_for_cuda_111">Support for CUDA 11.1</h2>
<p>This release also features support for the brand-new CUDA 11.1. As there is no compatible release of CUDNN or CUTENSOR yet, CUDA.jl won&#39;t automatically select this version, but you can force it to by setting the <code>JULIA_CUDA_VERSION</code> environment variable to <code>11.1</code>:</p>
<pre><code class="language-julia-repl">julia&gt; ENV&#91;&quot;JULIA_CUDA_VERSION&quot;&#93; &#61; &quot;11.1&quot;julia&gt; using CUDAjulia&gt; CUDA.versioninfo&#40;&#41;
CUDA toolkit 11.1.0, artifact installationLibraries:
- CUDNN: missing
- CUTENSOR: missing</code></pre>
<h2 id="minor_changes">Minor changes</h2>
<p>Many other changes are part of this release:</p>
<ul>
<li><p>Views, reshapes and array reinterpretations <a href="https://github.com/JuliaGPU/CUDA.jl/pull/437">are now represented</a> by the Base array wrappers, simplifying the CuArray type definition.</p>
</li>
<li><p>Various optimizations to <a href="https://github.com/JuliaGPU/CUDA.jl/pull/428">CUFFT</a> and <a href="https://github.com/JuliaGPU/CUDA.jl/pull/321">CUDNN</a> library wrappers.</p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/427">Support</a> for <code>LinearAlgebra.reflect&#33;</code> and <code>rotate&#33;</code></p>
</li>
<li><p><a href="https://github.com/JuliaGPU/CUDA.jl/pull/435">Initial support</a> for calling CUDA libraries with strided inputs</p>
</li>
</ul>
]]></content:encoded>
    
  <pubDate>Fri, 02 Oct 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[Paper: Flexible Performant GEMM Kernels on GPUs]]></title>
  <link>https://juliagpu.org/post/2020-09-28-gemmkernels/index.html</link>
  <guid>https://juliagpu.org/2020-09-28-gemmkernels/</guid>
  <description><![CDATA[General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA&#39;s Tensor Cores. In this paper we show how it is possible to program these accelerators from Julia, and present abstractions and interfaces that allow to do so efficiently without sacrificing performance.]]></description>  
  
  <content:encoded><![CDATA[
<p>General Matrix Multiplication or GEMM kernels take center place in high performance computing and machine learning. Recent NVIDIA GPUs include GEMM accelerators, such as NVIDIA&#39;s Tensor Cores. In this paper we show how it is possible to program these accelerators from Julia, and present abstractions and interfaces that allow to do so efficiently without sacrificing performance.</p>
<p>A pre-print of the paper has been published on arXiv: <a href="https://arxiv.org/abs/2009.12263">arXiv:2009.12263</a>. <br/> The source code can be found on GitHub: <a href="https://github.com/thomasfaingnaert/GemmKernels.jl">thomasfaingnaert/GemmKernels.jl</a>.</p>
<p>With the APIs from GemmKernels.jl, it is possible to instantiate GEMM kernels that perform in the same ball park as, and sometimes even outperform state-of-the-art libraries like CUBLAS and CUTLASS. For example, performing a mixed-precision multiplication of two 16-bit matrixes into a 32-bit accumulator &#40;on different combinations of layouts&#41;:</p>
<figure>
  <img src="https://juliagpu.org/post/2020-09-28-gemmkernels/mixed_precision.png" alt="Performance of mixed-precision GEMM">
</figure><p>The APIs are also highly flexible and allow customization of each step, e.g., to apply the activation function <code>max&#40;x, 0&#41;</code> for implementing a rectified linear unit &#40;ReLU&#41;:</p>
<pre><code class="language-julia">a &#61; CuArray&#40;rand&#40;Float16, &#40;M, K&#41;&#41;&#41;
b &#61; CuArray&#40;rand&#40;Float16, &#40;K, N&#41;&#41;&#41;
c &#61; CuArray&#40;rand&#40;Float32, &#40;M, N&#41;&#41;&#41;
d &#61; similar&#40;c&#41;conf &#61; GemmKernels.get_config&#40;
    gemm_shape &#61; &#40;M &#61; M, N &#61; N, K &#61; K&#41;,
    operator &#61; Operator.WMMAOp&#123;16, 16, 16&#125;,
    global_a_layout &#61; Layout.AlignedColMajor&#123;Float16&#125;,
    global_c_layout &#61; Layout.AlignedColMajor&#123;Float32&#125;&#41;GemmKernels.matmul&#40;
    a, b, c, d, conf;
    transform_regs_to_shared_d &#61; Transform.Elementwise&#40;x -&gt; max&#40;x, 0&#41;&#41;&#41;</code></pre>
<p>The GemmKernels.jl framework is written entirely in Julia, demonstrating the high-performance GPU programming capabilities of this language, but at the same time keeping the research accessible and easy to modify or repurpose by other Julia developers.</p>
]]></content:encoded>
    
  <pubDate>Mon, 28 Sep 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Thomas Faingnaert, Tim Besard, Bjorn De Sutter</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 1.3 - Multi-device programming]]></title>
  <link>https://juliagpu.org/post/2020-07-18-cuda_1.3/index.html</link>
  <guid>https://juliagpu.org/2020-07-18-cuda_1.3/</guid>
  <description><![CDATA[Today we&#39;re releasing CUDA.jl 1.3, with several new features. The most prominent change is support for multiple GPUs within a single process.]]></description>  
  
  <content:encoded><![CDATA[
<p>Today we&#39;re releasing CUDA.jl 1.3, with several new features. The most prominent change is support for multiple GPUs within a single process.</p>
<h2 id="multi-gpu_programming">Multi-GPU programming</h2>
<p>With CUDA.jl 1.3, you can finally use multiple CUDA GPUs within a single process. To switch devices you can call <code>device&#33;</code>, query the current device with <code>device&#40;&#41;</code>, or reset it using <code>device_reset&#33;&#40;&#41;</code>:</p>
<pre><code class="language-julia-repl">julia&gt; collect&#40;devices&#40;&#41;&#41;
9-element Array&#123;CuDevice,1&#125;:
 CuDevice&#40;0&#41;: Tesla V100-PCIE-32GB
 CuDevice&#40;1&#41;: Tesla V100-PCIE-32GB
 CuDevice&#40;2&#41;: Tesla V100-PCIE-32GB
 CuDevice&#40;3&#41;: Tesla V100-PCIE-32GB
 CuDevice&#40;4&#41;: Tesla V100-PCIE-16GB
 CuDevice&#40;5&#41;: Tesla P100-PCIE-16GB
 CuDevice&#40;6&#41;: Tesla P100-PCIE-16GB
 CuDevice&#40;7&#41;: GeForce GTX 1080 Ti
 CuDevice&#40;8&#41;: GeForce GTX 1080 Tijulia&gt; device&#33;&#40;5&#41;julia&gt; device&#40;&#41;
CuDevice&#40;5&#41;: Tesla P100-PCIE-16GB</code></pre>
<p>Let&#39;s define a kernel to show this really works:</p>
<pre><code class="language-julia-repl">julia&gt; function kernel&#40;&#41;
           dev &#61; Ref&#123;Cint&#125;&#40;&#41;
           CUDA.cudaGetDevice&#40;dev&#41;
           @cuprintln&#40;&quot;Running on device &#36;&#40;dev&#91;&#93;&#41;&quot;&#41;
           return
       endjulia&gt; @cuda kernel&#40;&#41;
Running on device 5julia&gt; device&#33;&#40;0&#41;julia&gt; device&#40;&#41;
CuDevice&#40;0&#41;: Tesla V100-PCIE-32GBjulia&gt; @cuda kernel&#40;&#41;
Running on device 0</code></pre>
<p>Memory allocations, like <code>CuArray</code>s, are implicitly bound to the device they were allocated on. That means you should take care to only use an array when the owning device is active, or you will run into errors:</p>
<pre><code class="language-julia-repl">julia&gt; device&#40;&#41;
CuDevice&#40;0&#41;: Tesla V100-PCIE-32GBjulia&gt; a &#61; CUDA.rand&#40;1&#41;
1-element CuArray&#123;Float32,1&#125;:
 0.6322775julia&gt; device&#33;&#40;1&#41;julia&gt; a
ERROR: CUDA error: an illegal memory access was encountered</code></pre>
<p>Future improvements might make the array type device-aware.</p>
<h2 id="multitasking_and_multithreading">Multitasking and multithreading</h2>
<p>Dovetailing with the support for multiple GPUs, is the ability to use these GPUs on separate Julia tasks and threads:</p>
<pre><code class="language-julia-repl">julia&gt; device&#33;&#40;0&#41;julia&gt; @sync begin
         @async begin
           device&#33;&#40;1&#41;
           println&#40;&quot;Working with &#36;&#40;device&#40;&#41;&#41; on &#36;&#40;current_task&#40;&#41;&#41;&quot;&#41;
           yield&#40;&#41;
           println&#40;&quot;Back to device &#36;&#40;device&#40;&#41;&#41; on &#36;&#40;current_task&#40;&#41;&#41;&quot;&#41;
         end
         @async begin
           device&#33;&#40;2&#41;
           println&#40;&quot;Working with &#36;&#40;device&#40;&#41;&#41; on &#36;&#40;current_task&#40;&#41;&#41;&quot;&#41;
         end
       end
Working with CuDevice&#40;1&#41; on Task @0x00007fc9e6a48010
Working with CuDevice&#40;2&#41; on Task @0x00007fc9e6a484f0
Back to device CuDevice&#40;1&#41; on Task @0x00007fc9e6a48010julia&gt; device&#40;&#41;
CuDevice&#40;0&#41;: Tesla V100-PCIE-32GB</code></pre>
<p>Each task has its own local GPU state, such as the device it was bound to, handles to libraries like CUBLAS or CUDNN &#40;which means that each task can configure libraries independently&#41;, etc.</p>
<h2 id="minor_features">Minor features</h2>
<p>CUDA.jl 1.3 also features some minor changes:</p>
<ul>
<li><p>Reinstated compatibility with Julia 1.3</p>
</li>
<li><p>Support for CUDA 11.0 Update 1</p>
</li>
<li><p>Support for CUDNN 8.0.2</p>
</li>
</ul>
<h2 id="known_issues">Known issues</h2>
<p>Several operations on sparse arrays have been broken since CUDA.jl 1.2, due to the deprecations that were part of CUDA 11. The next version of CUDA.jl will drop support for CUDA 10.0 or older, which will make it possible to use new cuSPARSE APIs and add back missing functionality.</p>
]]></content:encoded>
    
  <pubDate>Sat, 18 Jul 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDA.jl 1.1]]></title>
  <link>https://juliagpu.org/post/2020-07-07-cuda_1.1/index.html</link>
  <guid>https://juliagpu.org/2020-07-07-cuda_1.1/</guid>
  <description><![CDATA[CUDA.jl 1.1 marks the first feature release after merging several CUDA packages into one. It raises the minimal Julia version to 1.4, and comes with support for the impending 1.5 release.]]></description>  
  
  <content:encoded><![CDATA[
<p>CUDA.jl 1.1 marks the first feature release after merging several CUDA packages into one. It raises the minimal Julia version to 1.4, and comes with support for the impending 1.5 release.</p>
<h2 id="cudajl_replacing_cuarrayscudanativejl">CUDA.jl replacing CuArrays/CUDAnative.jl</h2>
<p>As <a href="https://discourse.julialang.org/t/psa-cuda-jl-replacing-cuarrays-jl-cudanative-jl-cudadrv-jl-cudaapi-jl-call-for-testing/40205">announced a while back</a>, CUDA.jl is now the new package for programming CUDA GPUs in Julia, replacing CuArrays.jl, CUDAnative.jl, CUDAdrv.jl and CUDAapi.jl. The merged package should be a drop-in replacement: All existing functionality has been ported, and almost all exported functions are still there. Applications like Flux.jl or the DiffEq.jl stack are being updated to support this change.</p>
<h2 id="cuda_11_support">CUDA 11 support</h2>
<p>With CUDA.jl 1.1, we support the upcoming release of the CUDA toolkit. This only applies to locally-installed versions of the toolkit, i.e., you need to specify <code>JULIA_CUDA_USE_BINARYBUILDER&#61;false</code> in your environment to pick up the locally-installed release candidate of the CUDA toolkit. New features, like the third-generation tensor cores and its extended type support, or any new APIs, are not yet natively supported by Julia code.</p>
<h2 id="nvidia_management_library_nvml">NVIDIA Management Library &#40;NVML&#41;</h2>
<p>CUDA.jl now integrates with the NVIDIA Management Library, or NVML. With this library, it&#39;s possible to query information about the system, any GPU devices, their topology, etc.:</p>
<pre><code class="language-julia-repl">julia&gt; using CUDAjulia&gt; dev &#61; first&#40;NVML.devices&#40;&#41;&#41;
CUDA.NVML.Device&#40;Ptr&#123;Nothing&#125; @0x00007f987c7c6e38&#41;julia&gt; NVML.uuid&#40;dev&#41;
UUID&#40;&quot;b8d5e790-ea4d-f962-e0c3-0448f69f2e23&quot;&#41;julia&gt; NVML.name&#40;dev&#41;
&quot;Quadro RTX 5000&quot;julia&gt; NVML.power_usage&#40;dev&#41;
37.863julia&gt; NVML.energy_consumption&#40;dev&#41;
65330.292</code></pre>
<h2 id="experimental_texture_support">Experimental: Texture support</h2>
<p>It is now also possible to use the GPU&#39;s hardware texture support from Julia, albeit using a fairly low-level and still experimental API &#40;many thanks to <a href="https://github.com/cdsousa">@cdsousa</a> for the initial development&#41;. As a demo, let&#39;s start with loading a sample image:</p>
<pre><code class="language-julia">julia&gt; using Images, TestImages, ColorTypes, FixedPointNumbers
julia&gt; img &#61; RGBA&#123;N0f8&#125;.&#40;testimage&#40;&quot;lighthouse&quot;&#41;&#41;</code></pre>
<p>We use RGBA since CUDA&#39;s texture hardware only supports 1, 2 or 4 channels. This support is also currently limited to &quot;plain&quot; types, so let&#39;s reinterpret the image:</p>
<pre><code class="language-julia">julia&gt; img′ &#61; reinterpret&#40;NTuple&#123;4,UInt8&#125;, img&#41;</code></pre>
<p>Now we can upload this image to the array, using the <code>CuTextureArray</code> type for optimized storage &#40;normal <code>CuArray</code>s are supported too&#41;, and bind it to a <code>CuTexture</code> object that we can pass to a kernel:</p>
<pre><code class="language-julia-repl">julia&gt; texturearray &#61; CuTextureArray&#40;img′&#41;julia&gt; texture &#61; CuTexture&#40;texturearray; normalized_coordinates&#61;true&#41;
512×768 4-channel CuTexture&#40;::CuTextureArray&#41; with eltype NTuple&#123;4,UInt8&#125;</code></pre>
<p>Let&#39;s write and a kernel that warps this image. Since we specified <code>normalized_coordinates&#61;true</code>, we index the texture using values in <code>&#91;0,1&#93;</code>:</p>
<pre><code class="language-julia">function warp&#40;dst, texture&#41;
    tid &#61; threadIdx&#40;&#41;.x &#43; &#40;blockIdx&#40;&#41;.x - 1&#41; * blockDim&#40;&#41;.x
    I &#61; CartesianIndices&#40;dst&#41;
    @inbounds if tid &lt;&#61; length&#40;I&#41;
        i,j &#61; Tuple&#40;I&#91;tid&#93;&#41;
        u &#61; Float32&#40;i-1&#41; / Float32&#40;size&#40;dst, 1&#41;-1&#41;
        v &#61; Float32&#40;j-1&#41; / Float32&#40;size&#40;dst, 2&#41;-1&#41;
        x &#61; u &#43; 0.02f0 * CUDA.sin&#40;30v&#41;
        y &#61; v &#43; 0.03f0 * CUDA.sin&#40;20u&#41;
        dst&#91;i,j&#93; &#61; texture&#91;x,y&#93;
    end
    return
end</code></pre>
<p>The size of the output image determines how many elements we need to process. This needs to be translated to a number of threads and blocks, keeping in mind device and kernel characteristics. We automate this using the occupancy API:</p>
<pre><code class="language-julia-repl">julia&gt; outimg_d &#61; CuArray&#123;eltype&#40;img′&#41;&#125;&#40;undef, 500, 1000&#41;;julia&gt; function configurator&#40;kernel&#41;
           config &#61; launch_configuration&#40;kernel.fun&#41;           threads &#61; Base.min&#40;length&#40;outimg_d&#41;, config.threads&#41;
           blocks &#61; cld&#40;length&#40;outimg_d&#41;, threads&#41;           return &#40;threads&#61;threads, blocks&#61;blocks&#41;
       endjulia&gt; @cuda config&#61;configurator warp&#40;outimg_d, texture&#41;</code></pre>
<p>Finally, we fetch and visualize the output:</p>
<pre><code class="language-julia-repl">julia&gt; outimg &#61; Array&#40;outimg_d&#41;julia&gt; save&#40;&quot;imgwarp.png&quot;, reinterpret&#40;eltype&#40;img&#41;, outimg&#41;&#41;</code></pre>
<figure>
  <img src="https://juliagpu.org/post/2020-07-07-cuda_1.1/imgwarp.png" alt="Warped lighthouse">
</figure><h2 id="minor_features">Minor features</h2>
<p>The test-suite is now parallelized, using up-to <code>JULIA_NUM_THREADS</code> processes:</p>
<pre><code class="language-julia">&#36; JULIA_NUM_THREADS&#61;4 julia -e &#39;using Pkg; Pkg.test&#40;&quot;CUDA&quot;&#41;;&#39;                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        &#40;Worker&#41; | Time &#40;s&#41; | GC &#40;s&#41; | GC &#37; | Alloc &#40;MB&#41; | RSS &#40;MB&#41; | GC &#40;s&#41; | GC &#37; | Alloc &#40;MB&#41; | RSS &#40;MB&#41; |
initialization                   &#40;2&#41; |     2.52 |   0.00 |  0.0 |       0.00 |   115.00 |   0.05 |  1.8 |     153.13 |   546.27 |
apiutils                         &#40;4&#41; |     0.55 |   0.00 |  0.0 |       0.00 |   115.00 |   0.02 |  4.0 |      75.86 |   522.36 |
codegen                          &#40;4&#41; |    14.81 |   0.36 |  2.5 |       0.00 |   157.00 |   0.62 |  4.2 |    1592.28 |   675.15 |
...
gpuarrays/mapreduce essentials   &#40;2&#41; |   113.52 |   0.01 |  0.0 |       3.19 |   641.00 |   2.61 |  2.3 |    8232.84 |  2449.35 |
gpuarrays/mapreduce &#40;old tests&#41;  &#40;5&#41; |   138.35 |   0.01 |  0.0 |     130.20 |   507.00 |   2.94 |  2.1 |    8615.15 |  2353.62 |
gpuarrays/mapreduce derivatives  &#40;3&#41; |   180.52 |   0.01 |  0.0 |       3.06 |   229.00 |   3.44 |  1.9 |   12262.67 |  1403.39 |Test Summary: |  Pass  Broken  Total
  Overall     | 11213       3  11216
    SUCCESS
    Testing CUDA tests passed</code></pre>
<p>A copy of <code>Base.versioninfo&#40;&#41;</code> is available to report on the CUDA toolchain and any devices:</p>
<pre><code class="language-julia-repl">julia&gt; CUDA.versioninfo&#40;&#41;
CUDA toolkit 10.2.89, artifact installation
CUDA driver 11.0.0
NVIDIA driver 450.36.6Libraries:
- CUBLAS: 10.2.2
- CURAND: 10.1.2
- CUFFT: 10.1.2
- CUSOLVER: 10.3.0
- CUSPARSE: 10.3.1
- CUPTI: 12.0.0
- NVML: 11.0.0&#43;450.36.6
- CUDNN: 7.6.5 &#40;for CUDA 10.2.0&#41;
- CUTENSOR: 1.1.0 &#40;for CUDA 10.2.0&#41;Toolchain:
- Julia: 1.5.0-rc1.0
- LLVM: 9.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_751 device&#40;s&#41;:
- Quadro RTX 5000 &#40;sm_75, 14.479 GiB / 15.744 GiB available&#41;</code></pre>
<p>CUTENSOR artifacts have been upgraded to version 1.1.0.</p>
<p>Benchmarking infrastructure based on the Codespeed project has been set-up at <a href="https://speed.juliagpu.org/">speed.juliagpu.org</a> to keep track of the performance of various operations.</p>
]]></content:encoded>
    
  <pubDate>Tue, 07 Jul 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[CUDAnative.jl 3.0 and CuArrays.jl 2.0]]></title>
  <link>https://juliagpu.org/post/2020-03-25-cudanative_3.0-cuarrays_2.0/index.html</link>
  <guid>https://juliagpu.org/cudanative_3.0-cuarrays_2.0/</guid>
  <description><![CDATA[This release of the Julia CUDA stack contains some exciting new features: automatic installation of CUDA using artifacts, full support for GPU method redefinitions, and experimental support for multitasking and multithreading. The release is technically breaking, but most end-users should not be affected.]]></description>  
  
  <content:encoded><![CDATA[
<p>This post is located at <a href="https://juliagpu.org/cudanative_3.0-cuarrays_2.0/">/cudanative_3.0-cuarrays_2.0/</a></p>
<p>This release of the Julia CUDA stack contains some exciting new features: automatic installation of CUDA using artifacts, full support for GPU method redefinitions, and experimental support for multitasking and multithreading. The release is technically breaking, but most end-users should not be affected.</p>
<h2 id="api_changes">API changes</h2>
<p>Changes to certain APIs require these releases to be breaking, however, most users should not be affected and chances are you can just bump your Compat entries without any additional changes. Flux.jl users will have to wait a little longer though, as the package uses non-public APIs that have changed and <a href="https://github.com/FluxML/Flux.jl/pull/1050">requires an update</a>.</p>
<h2 id="artifacts">Artifacts</h2>
<p>CUDA and its dependencies will now be automatically installed using artifacts generated by BinaryBuilder.jl. This greatly improves usability, and only requires a functioning NVIDIA driver:</p>
<pre><code class="language-julia-repl">julia&gt; ENV&#91;&quot;JULIA_DEBUG&quot;&#93; &#61; &quot;CUDAnative&quot;julia&gt; using CUDAnativejulia&gt; CUDAnative.version&#40;&#41;
┌ Debug: Trying to use artifacts...
└ @ CUDAnative CUDAnative/src/bindeps.jl:52
┌ Debug: Using CUDA 10.2.89 from an artifact at /depot/artifacts/...
└ @ CUDAnative CUDAnative/src/bindeps.jl:108
v&quot;10.2.89&quot;</code></pre>
<p>Use of a local installation is still possible by setting the environment variable <code>JULIA_CUDA_USE_BINARYBUILDER</code> to false. For more details, refer to <a href="https://cuda.juliagpu.org/stable/installation/overview/">the documentation</a>.</p>
<p>Relevant PRs: <a href="https://github.com/JuliaGPU/CUDAnative.jl/pull/492">CUDAnative.jl#492</a> and <a href="https://github.com/JuliaGPU/CuArrays.jl/pull/490">CuArrays.jl#490</a></p>
<h2 id="method_redefinitions">Method redefinitions</h2>
<p>CUDAnative 3.0 now fully supports method redefinitions, commonly referred to as <a href="https://github.com/JuliaLang/julia/issues/265">Julia issue #265</a>, and makes it possible to use interactive programming tools like Revise.jl:</p>
<pre><code class="language-julia-repl">julia&gt; child&#40;&#41; &#61; 0
julia&gt; parent&#40;&#41; &#61; &#40;@cuprintln&#40;child&#40;&#41;&#41;; return&#41;
julia&gt; @cuda parent&#40;&#41;
0julia&gt; parent&#40;&#41; &#61; &#40;@cuprintln&#40;child&#40;&#41; &#43; 1&#41;; return&#41;
julia&gt; @cuda parent&#40;&#41;
1
julia&gt; child&#40;&#41; &#61; 1
julia&gt; @cuda parent&#40;&#41;
2</code></pre>
<p>Relevant PRs: <a href="https://github.com/JuliaGPU/CUDAnative.jl/pull/581">CUDAnative.jl#581</a></p>
<h2 id="experimental_multitasking_and_multithreading">Experimental: Multitasking and multithreading</h2>
<p>With CUDAnative 3.0 and CuArrays 2.0 you can now use Julia tasks and threads to organize your code. In combination with CUDA streams, this makes it possible to execute kernels and other GPU operations in parallel:</p>
<pre><code class="language-julia">@sync begin
    function my_expensive_kernel&#40;&#41;
        return
    end
    @async @cuda stream&#61;CuStream&#40;&#41; my_expensive_kernel&#40;&#41;
    @async @cuda stream&#61;CuStream&#40;&#41; my_expensive_kernel&#40;&#41;
end</code></pre>
<p>Every task, whether it runs on a separate thread or not, can work with a different device, as well as independently work with CUDA libraries like CUBLAS and CUFFT.</p>
<p>Note that this support is experimental, and lacks certain features to be fully effective. For one, the CuArrays memory allocator is not device-aware, and it is currently not possible to configure the CUDA stream for operations like map or broadcast.</p>
<p>Relevant PRs: <a href="https://github.com/JuliaGPU/CUDAnative.jl/pull/609">CUDAnative.jl#609</a> and <a href="https://github.com/JuliaGPU/CuArrays.jl/pull/645">CuArrays.jl#645</a></p>
<h2 id="minor_changes">Minor changes</h2>
<p>GPU kernels are now name-mangled like C&#43;&#43;, which offers better integration with NVIDIA tools &#40;<a href="https://github.com/JuliaGPU/CUDAnative.jl/pull/559">CUDAnative.jl#559</a>&#41;.</p>
<p>A better N-dimensional <code>mapreducedim&#33;</code> kernel, properly integrating with all Base interfaces &#40;<a href="https://github.com/JuliaGPU/CuArrays.jl/pull/602">CuArrays.jl#602</a> and <a href="https://github.com/JuliaGPU/GPUArrays.jl/pull/246">GPUArrays#246</a>&#41;.</p>
<p>A <code>CuIterator</code> type for batching arrays to the GPU &#40;by @jrevels, <a href="https://github.com/JuliaGPU/CuArrays.jl/pull/467">CuArrays.jl#467</a>&#41;.</p>
<p>Integration with Base&#39;s 5-arg <code>mul&#33;</code> &#40;by @haampie, <a href="https://github.com/JuliaGPU/CuArrays.jl/pull/641">CuArrays.jl#641</a> and <a href="https://github.com/JuliaGPU/GPUArrays.jl/pull/253">GPUArrays#253</a>&#41;.</p>
<p>Integration with Cthulhu.jl for interactive inspection of generated code &#40;<a href="https://github.com/JuliaGPU/CUDAnative.jl/issues/597">CUDAnative.jl#597</a>&#41;.</p>
<h2 id="known_issues">Known issues</h2>
<p>With a release as big as this one there&#39;s bound to be some bugs, e.g., with the installation of artifacts on exotic systems, or due to the many changes to make the libraries thread-safe. If you need absolute stability, please wait for a point release.</p>
<p>There are also some known issues. CUDAnative is currently not compatible with Julia 1.5 due to Base compiler changes &#40;<a href="https://github.com/JuliaLang/julia/issues/34993">julia#34993</a>&#41;, the new <code>mapreducedim&#33;</code> kernel appears to be slower in some cases &#40;<a href="https://github.com/JuliaGPU/CuArrays.jl/issues/611">CuArrays.jl#611</a>&#41;, and there are some remaining thread-safety issues when using the non-default memory pool &#40;<a href="https://github.com/JuliaGPU/CuArrays.jl/issues/647">CuArrays.jl#647</a>&#41;.</p>
]]></content:encoded>
    
  <pubDate>Wed, 25 Mar 2020 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>

<item>
  <title><![CDATA[New website for JuliaGPU]]></title>
  <link>https://juliagpu.org/post/2019-12-12-new_site/index.html</link>
  <guid>https://juliagpu.org/new_site/</guid>
  <description><![CDATA[Welcome to the new landing page for the JuliaGPU organization. This website serves as an introduction to the several packages for programming GPUs in Julia, with pointers to relevant resources for new users.]]></description>  
  
  <content:encoded><![CDATA[
<p>This post is located at <a href="https://juliagpu.org/new_site/">/new_site/</a></p>
<p>Welcome to the new landing page for the JuliaGPU organization. This website serves as an introduction to the several packages for programming GPUs in Julia, with pointers to relevant resources for new users.</p>
<p>The sources for this website are hosted at <a href="https://github.com/JuliaGPU/juliagpu.org">GitHub</a> and generated using Hugo, feel free to open an issue or pull request if you think it could be improved.</p>
]]></content:encoded>
    
  <pubDate>Thu, 12 Dec 2019 00:00:00 +0000</pubDate>  
  
  
  <atom:author>
    <atom:name>Tim Besard</atom:name>
  </atom:author>
        
</item>
</channel></rss>