A note on benchmarking
Premature optimization is the root of all evil & If you don't measure you won't improve
Tools
BenchmarkTools.jl https://github.com/JuliaCI/BenchmarkTools.jl
Profiler https://docs.julialang.org/en/latest/manual/profile/
ProfileView.jl https://github.com/timholy/ProfileView.jl
PProf.jl https://github.com/JuliaPerf/PProf.jl
Other
VTunes/Perf https://docs.julialang.org/en/latest/manual/profile/#External-Profiling-1
LIKWID.jl
MCAnalyzer.jl
BenchmarkTools.jl
Solid package that tries to eliminate common pitfalls in performance measurment.
@benchmark
macro that will repeatedly evaluate your code to gain enough samplesCaveat: You probably want to escape $ your input data
sum (generic function with 1 method)
BenchmarkTools.Trial: 10000 samples with 998 evaluations.
Range (min … max): 13.170 ns … 33.679 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 13.191 ns ┊ GC (median): 0.00%
Time (mean ± σ): 13.347 ns ± 1.002 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▄ ▁
██▄▁▄▁█▆▃▁▁▁▁▁▁▄▄▄▄▄▃▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▃▃▁▁▁▁▄▁▁▁▃▁▁▁▁▁▁▁▃▁▄▆█ █
13.2 ns Histogram: log(frequency) by time 20.4 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Figuring out what is happening
The stages of the compiler
@code_lowered
@code_typed
&@code_warntype
@code_llvm
@code_native
Where is a function defined @which
& @edit
CodeInfo(
1 ─ acc = Core.typeassert(0, Main.var"workspace#3".Int64)
│ %2 = X
│ @_3 = Base.iterate(%2)
│ %4 = @_3 === nothing
│ %5 = Base.not_int(%4)
└── goto #4 if not %5
2 ┄ %7 = @_3
│ x = Core.getfield(%7, 1)
│ %9 = Core.getfield(%7, 2)
│ %10 = acc + x
│ acc = %10
│ Core.typeassert(%10, Main.var"workspace#3".Float64)
│ @_3 = Base.iterate(%2, %9)
│ %14 = @_3 === nothing
│ %15 = Base.not_int(%14)
└── goto #4 if not %15
3 ─ goto #2
4 ┄ %18 = acc
│ %19 = Core.apply_type(Main.var"workspace#3".Union, Main.var"workspace#3".Int64, Main.var"workspace#3".Float64)
│ %20 = Core.typeassert(%18, %19)
└── return %20
)
A simple example: counting
f (generic function with 1 method)
100000000
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.785 ns … 21.130 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.795 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.816 ns ± 0.424 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▇▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▂ ▂
2.78 ns Histogram: frequency by time 2.81 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
So we are doing 100.0 million additions in 2.785 ns...
That would mean our processor operates at 36 PHz
We wish...
Let's do a basic check, 10x bigger input.
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 2.785 ns … 25.678 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.795 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.817 ns ± 0.455 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▇▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▂ ▂
2.78 ns Histogram: frequency by time 2.81 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
Let's explore what code we are actually running.
Using Julia's reflection macros we can see all of the stages of code-generation.
CodeInfo(
1 ─ acc = 0
│ %2 = 1:N
│ @_3 = Base.iterate(%2)
│ %4 = @_3 === nothing
│ %5 = Base.not_int(%4)
└── goto #4 if not %5
2 ┄ %7 = @_3
│ i = Core.getfield(%7, 1)
│ %9 = Core.getfield(%7, 2)
│ acc = acc + 1
│ @_3 = Base.iterate(%2, %9)
│ %12 = @_3 === nothing
│ %13 = Base.not_int(%12)
└── goto #4 if not %13
3 ─ goto #2
4 ┄ return acc
)
CodeInfo( 1 ─ (acc = 0)::Const(0) │ %2 = (1:N)::PartialStruct(UnitRange{Int64}, Any[Const(1), Int64]) │ (@_3 = Base.iterate(%2))::Union{Nothing, Tuple{Int64, Int64}} │ %4 = (@_3 === nothing)::Bool │ %5 = Base.not_int(%4)::Bool └── goto #4 if not %5 2 ┄ %7 = @_3::Tuple{Int64, Int64} │ (i = Core.getfield(%7, 1))::Int64 │ %9 = Core.getfield(%7, 2)::Int64 │ (acc = acc + 1)::Int64 │ (@_3 = Base.iterate(%2, %9))::Union{Nothing, Tuple{Int64, Int64}} │ %12 = (@_3 === nothing)::Bool │ %13 = Base.not_int(%12)::Bool └── goto #4 if not %13 3 ─ goto #2 4 ┄ return acc )
Int64
CodeInfo( 1 ── %1 = Base.sle_int(1, N)::Bool └─── goto #3 if not %1 2 ── goto #4 3 ── goto #4 4 ┄─ %5 = φ (#2 => N, #3 => 0)::Int64 └─── goto #5 5 ── goto #6 6 ── %8 = Base.slt_int(%5, 1)::Bool └─── goto #8 if not %8 7 ── goto #9 8 ── goto #9 9 ┄─ %12 = φ (#7 => true, #8 => false)::Bool │ %13 = φ (#8 => 1)::Int64 │ %14 = Base.not_int(%12)::Bool └─── goto #15 if not %14 10 ┄ %16 = φ (#9 => %13, #14 => %24)::Int64 │ %17 = φ (#9 => 0, #14 => %18)::Int64 │ %18 = Base.add_int(%17, 1)::Int64 │ %19 = (%16 === %5)::Bool └─── goto #12 if not %19 11 ─ goto #13 12 ─ %22 = Base.add_int(%16, 1)::Int64 └─── goto #13 13 ┄ %24 = φ (#12 => %22)::Int64 │ %25 = φ (#11 => true, #12 => false)::Bool │ %26 = Base.not_int(%25)::Bool └─── goto #15 if not %26 14 ─ goto #10 15 ┄ %29 = φ (#13 => %18, #9 => 0)::Int64 └─── return %29 )
Int64