Persistent Code Caching in GPUCompiler.jl

After multiple prior attempts we just landed the last piece of the puzzle for persistent code caching for GPU codes in Julia.

There are two components to the persistent code caching infrastructure:

  1. Inference caching for GPU kernels.

  2. Disk caching for generated LLVM IR/object files.

TL;DR – Just the numbers

We evaluated the effectiveness on two computational fluid dynamics codes.

WaterLily TGV example

The full example is shown below as a blueprint to follow for your own examples.

BackendPrecompilationDisk CacheWhatTime
CPUFirst execution5.32s (80% compilation)
CPUSecond execution0.85s
CPUFirst execution2.48s (65% compilation)
CPUSecond execution0.9s
CUDAFirst execution11.07s (70% compilation)
CUDASecond execution0.02s
CUDAFirst execution6.38s (46% compilation)
CUDASecond execution0.02s
CUDAFirst execution2.70s (97% compilation)
CUDASecond execution0.02s

Here caching is effective, but there is still more to be gained, and more investigation is needed for why the first execution takes 2.7s, most of it in Julia host compilation.

Disk caching is effective since it seems to remove almost all "GPUCompiler" e.g. non-native compilation time.

Benchmark script

using WaterLilyTGV
import CUDA

@assert CUDA.functional()

# First execution
vortex = TGV(T=Float32, mem=CUDA.CuArray)
@time sim_step!(vortex, 1, max_steps=1)

# Second execution
vortex = TGV(T=Float32, mem=CUDA.CuArray)
@time sim_step!(vortex, 1, max_steps=1)

ClimaOcean.jl: Near Global Ocean

Using the example while mimizing the computational costs.

Julia 1.10.4

BackendPrecompilationDisk CacheWhatTime (s)
CUDANAInitialization69
CUDANATime step 10.438
CUDANATime step 21.245
CUDANATime step 34.993
CUDANATime step 40.017

Julia 1.11.0-beta2

BackendPrecompilationDisk CacheWhatTime
CPUInitialization422s
CPUTime step 175s
CPUTime step 221s
CPUTime step 375s
CPUTime step 40.008s
CPUInitialization52s
CPUTime step 10.021s
CPUTime step 20.290s
CPUTime step 30.014s
CPUTime step 40.008s
CUDAInitialization596s
CUDATime step 175s
CUDATime step 234s
CUDATime step 399s
CUDATime step 40.017s
CUDAInitialization44s
CUDATime step 10.339s
CUDATime step 21.171s
CUDATime step 33.659s
CUDATime step 40.018s
CUDAInitialization12s
CUDATime step 10.023s
CUDATime step 20.207s
CUDATime step 30.029s
CUDATime step 40.017s

Using Waterlily TGV example

Using WaterLily.jl and in particular their 3D Taylor Green Vortex we can set up a small Julia application. The application uses package extensions and PrecompileTools.jl for automatic caching of needed functionality and handling of GPU dependencies.

src/WaterLilyTGV.jlend

module WaterLilyTGV

using PrecompileTools
@recompile_invalidations begin
    using WaterLily
end

export TGV, sim_step!

import WaterLily: sim_step!

function TGV(; pow=6, Re=1e5, T=Float64, mem=Array)
    # Define vortex size, velocity, viscosity
    L = 2^pow; U = 1; ν = U*L/Re
    # Taylor-Green-Vortex initial velocity field
    function uλ(i,xyz)
        x,y,z = @. (xyz-1.5)*π/L               # scaled coordinates
        i==1 && return -U*sin(x)*cos(y)*cos(z) # u_x
        i==2 && return  U*cos(x)*sin(y)*cos(z) # u_y
        return 0.                              # u_z
    end
    # Initialize simulation
    return Simulation((L, L, L), (0, 0, 0), L; U, uλ, ν, T, mem)
end

@setup_workload let
    @compile_workload begin
        let vortex = TGV(T=Float32)
            sim_step!(vortex, 1, pow=1, max_steps=1)
        end
        let vortex = TGV(T=Float64)
            sim_step!(vortex, 1, pow=1, max_steps=1)
        end
    end
end

end # module WaterLilyTGV

ext/CUDAExt.jl

module CUDAExt

using PrecompileTools
@recompile_invalidations begin
    using WaterLilyTGV
    import CUDA
end

# TODO: How does Preferences work for this?
@setup_workload let
    if CUDA.functional()
        @compile_workload begin
            let vortex = TGV(T=Float32, mem=CUDA.CuArray)
                sim_step!(vortex, 1, max_steps=1)
            end
            let vortex = TGV(T=Float64, mem=CUDA.CuArray)
                sim_step!(vortex, 1, max_steps=1)
            end
        end
    end
end

end

Project.toml

name = "WaterLilyTGV"
uuid = "ff377938-d05a-4d40-b5b0-1f78a12130ea"
authors = ["Valentin Churavy"]
version = "0.1.0"

[deps]
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
WaterLily = "ed894a53-35f9-47f1-b17f-85db9237eebd"

[weakdeps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
CUDAExt = "CUDA"

[compat]
CUDA = "5.4.2"
PrecompileTools = "1.2.1"
WaterLily = "1.1.0"

Persistent Inference Caching

** To Be Written **

Disk Caching

** To Be Written **