Persistent Code Caching in GPUCompiler.jl

TL;DR – Just the numbers
Using Waterlily TGV example
Persistent Inference Caching
Disk Caching

After multiple prior attempts we just landed the last piece of the puzzle for persistent code caching for GPU codes in Julia.

There are two components to the persistent code caching infrastructure:

Inference caching for GPU kernels.
Disk caching for generated LLVM IR/object files.

TL;DR – Just the numbers

We evaluated the effectiveness on two computational fluid dynamics codes.

WaterLily TGV example

The full example is shown below as a blueprint to follow for your own examples.

Backend	Precompilation	Disk Cache	What	Time
CPU	✗		First execution	5.32s (80% compilation)
CPU	✗		Second execution	0.85s
CPU	✓		First execution	2.48s (65% compilation)
CPU	✓		Second execution	0.9s
CUDA	✗	✗	First execution	11.07s (70% compilation)
CUDA	✗	✗	Second execution	0.02s
CUDA	✓	✗	First execution	6.38s (46% compilation)
CUDA	✓	✗	Second execution	0.02s
CUDA	✓	✓	First execution	2.70s (97% compilation)
CUDA	✓	✓	Second execution	0.02s

Here caching is effective, but there is still more to be gained, and more investigation is needed for why the first execution takes 2.7s, most of it in Julia host compilation.

Disk caching is effective since it seems to remove almost all "GPUCompiler" e.g. non-native compilation time.

Benchmark script

using WaterLilyTGV
import CUDA

@assert CUDA.functional()

# First execution
vortex = TGV(T=Float32, mem=CUDA.CuArray)
@time sim_step!(vortex, 1, max_steps=1)

# Second execution
vortex = TGV(T=Float32, mem=CUDA.CuArray)
@time sim_step!(vortex, 1, max_steps=1)

`ClimaOcean.jl`: Near Global Ocean

Using the example while mimizing the computational costs.

Julia 1.10.4

Backend	Precompilation	Disk Cache	What	Time (s)
CUDA	✓	NA	Initialization	69
CUDA	✓	NA	Time step 1	0.438
CUDA	✓	NA	Time step 2	1.245
CUDA	✓	NA	Time step 3	4.993
CUDA	✓	NA	Time step 4	0.017

Julia 1.11.0-beta2

Backend	Precompilation	Disk Cache	What	Time
CPU	✗		Initialization	422s
CPU	✗		Time step 1	75s
CPU	✗		Time step 2	21s
CPU	✗		Time step 3	75s
CPU	✗		Time step 4	0.008s
CPU	✓		Initialization	52s
CPU	✓		Time step 1	0.021s
CPU	✓		Time step 2	0.290s
CPU	✓		Time step 3	0.014s
CPU	✓		Time step 4	0.008s
CUDA	✗	✗	Initialization	596s
CUDA	✗	✗	Time step 1	75s
CUDA	✗	✗	Time step 2	34s
CUDA	✗	✗	Time step 3	99s
CUDA	✗	✗	Time step 4	0.017s
CUDA	✓	✗	Initialization	44s
CUDA	✓	✗	Time step 1	0.339s
CUDA	✓	✗	Time step 2	1.171s
CUDA	✓	✗	Time step 3	3.659s
CUDA	✓	✗	Time step 4	0.018s
CUDA	✓	✓	Initialization	12s
CUDA	✓	✓	Time step 1	0.023s
CUDA	✓	✓	Time step 2	0.207s
CUDA	✓	✓	Time step 3	0.029s
CUDA	✓	✓	Time step 4	0.017s

Using WaterLily.jl and in particular their 3D Taylor Green Vortex we can set up a small Julia application. The application uses package extensions and PrecompileTools.jl for automatic caching of needed functionality and handling of GPU dependencies.

`src/WaterLilyTGV.jl`end

module WaterLilyTGV

using PrecompileTools
@recompile_invalidations begin
    using WaterLily
end

export TGV, sim_step!

import WaterLily: sim_step!

function TGV(; pow=6, Re=1e5, T=Float64, mem=Array)
    # Define vortex size, velocity, viscosity
    L = 2^pow; U = 1; ν = U*L/Re
    # Taylor-Green-Vortex initial velocity field
    function uλ(i,xyz)
        x,y,z = @. (xyz-1.5)*π/L               # scaled coordinates
        i==1 && return -U*sin(x)*cos(y)*cos(z) # u_x
        i==2 && return  U*cos(x)*sin(y)*cos(z) # u_y
        return 0.                              # u_z
    end
    # Initialize simulation
    return Simulation((L, L, L), (0, 0, 0), L; U, uλ, ν, T, mem)
end

@setup_workload let
    @compile_workload begin
        let vortex = TGV(T=Float32)
            sim_step!(vortex, 1, pow=1, max_steps=1)
        end
        let vortex = TGV(T=Float64)
            sim_step!(vortex, 1, pow=1, max_steps=1)
        end
    end
end

end # module WaterLilyTGV

`ext/CUDAExt.jl`

module CUDAExt

using PrecompileTools
@recompile_invalidations begin
    using WaterLilyTGV
    import CUDA
end

# TODO: How does Preferences work for this?
@setup_workload let
    if CUDA.functional()
        @compile_workload begin
            let vortex = TGV(T=Float32, mem=CUDA.CuArray)
                sim_step!(vortex, 1, max_steps=1)
            end
            let vortex = TGV(T=Float64, mem=CUDA.CuArray)
                sim_step!(vortex, 1, max_steps=1)
            end
        end
    end
end

end

`Project.toml`

name = "WaterLilyTGV"
uuid = "ff377938-d05a-4d40-b5b0-1f78a12130ea"
authors = ["Valentin Churavy"]
version = "0.1.0"

[deps]
PrecompileTools = "aea7be01-6a6a-4083-8856-8a6e6704d82a"
WaterLily = "ed894a53-35f9-47f1-b17f-85db9237eebd"

[weakdeps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
CUDAExt = "CUDA"

[compat]
CUDA = "5.4.2"
PrecompileTools = "1.2.1"
WaterLily = "1.1.0"

Persistent Inference Caching

** To Be Written **

Disk Caching

** To Be Written **