Konrad Reiche About Photos Talks

How to use b.RunParallel

Initially, I was at odds with Go’s parallel benchmark utility function—the documentation on b.RunParallel is somewhat sparse, but it is a tool that makes benchmarking in Go much easier.

RunParallel runs a benchmark in parallel. It creates multiple goroutines and distributes b.N iterations among them. The number of goroutines defaults to GOMAXPROCS.

func Benchmark(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		// set up goroutine local state
		for pb.Next() {
			// execute one iteration of the benchmark
		}
	})
}

It is essential to note that the code within func(pb *testing.PB) runs in multiple goroutines, and there are various ways to leverage this. Before exploring ways to utilize this, let’s look at an example where we benchmark write operations to sync.Map using the traditional sequential approach:

func BenchmarkMapStore(b *testing.B) {
	var m sync.Map
	keys := generateKeys(b, size)

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		j := i % size
		m.Store(keys[j], j)
	}
}

The loop will run until the benchmark stabilizes.

go test -bench=BenchmarkMapStore -count=10
goos: linux
goarch: amd64
pkg: github.com/konradreiche/benchmark
cpu: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
BenchmarkMapStore-16    	4686368	      271.6 ns/op
BenchmarkMapStore-16    	4415648	      267.0 ns/op
BenchmarkMapStore-16    	4438770	      269.7 ns/op
BenchmarkMapStore-16    	4471004	      267.8 ns/op
BenchmarkMapStore-16    	4319673	      269.8 ns/op
BenchmarkMapStore-16    	4460666	      268.6 ns/op
BenchmarkMapStore-16    	4439848	      267.7 ns/op
BenchmarkMapStore-16    	4456723	      267.5 ns/op
BenchmarkMapStore-16    	4402022	      269.5 ns/op
BenchmarkMapStore-16    	4421964	      290.5 ns/op
PASS
ok  	github.com/konradreiche/benchmark	14.836s

Ensuring a stable environment is important for consistently repeatable results. Although achieving perfect stability might not always be possible, we can increase benchmark accuracy by running the same benchmark multiple times using the -count option.

Speed up Benchmarks

One benefit of b.RunParallel is its capability of improve the overall speed of benchmark execution. The next benchmark is a modified version that utilizes b.RunParallel instead.

func BenchmarkMapStore(b *testing.B) {
	keys := generateKeys(b, size)
	b.ResetTimer()

	b.RunParallel(func(pb *testing.PB) {
		var (
			m sync.Map
			i int
		)
		for pb.Next() {
			m.Store(keys[i], i)
			i = (i + 1) % size
		}
	})
}

In this scenario, each goroutine operates with its own map. By comparing it to the previous execution, we notice a quicker average stabilization across runs. In slower benchmarks, this might mean the ability to iterate faster, testing different iterations quickly rather than waiting for results.

go test -bench=BenchmarkMapStore -count=10
goos: linux
goarch: amd64
pkg: github.com/konradreiche/benchmark
cpu: 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz
BenchmarkMapStore-16    	23450462	       49.32 ns/op
BenchmarkMapStore-16    	23560857	       47.49 ns/op
BenchmarkMapStore-16    	23439856	       48.32 ns/op
BenchmarkMapStore-16    	23390617	       47.83 ns/op
BenchmarkMapStore-16    	23071099	       48.97 ns/op
BenchmarkMapStore-16    	24118270	       48.30 ns/op
BenchmarkMapStore-16    	24307544	       47.80 ns/op
BenchmarkMapStore-16    	25475882	       46.27 ns/op
BenchmarkMapStore-16    	24652126	       46.97 ns/op
BenchmarkMapStore-16    	22172274	       45.47 ns/op
PASS
ok  	github.com/konradreiche/benchmark	11.908s

While the parallel benchmark indicates a quicker nanosecond-per-operation metric, it doesn’t necessarily imply faster individual operation execution. The parallel benchmark aggregates the total number of operations across various CPUs. With 16 goroutines in play, the benchmark accomplishes five times the iterations before stabilization. Consequently, it deems a specific store operation five times faster.

Benchmark Concurrency

b.RunParallel shines when it comes to benchmarking Go code for concurrent execution, such as evaluating the performance of synchronization primitives or seeing how a given data structure or function performs under concurrent access.

Introducing concurrency to tests and benchmarks typically involves starting goroutines, instantiating, and integrating sync.WaitGroup, etc. This can quickly compromise the clarity of your tests. In contrast, by using b.RunParallel, we can benchmark our code with multiple goroutines from the beginning. This changes our code structure to the following:

func BenchmarkMapStore(b *testing.B) {
   	var m sync.Map
	keys := generateKeys(b, size)

	b.ResetTimer()
	b.RunParallel(func(pb *testing.PB) {
		var i int
		for pb.Next() {
			m.Store(keys[i], i)
			i = (i + 1) % size
		}
	})
}

Note that sync.Map is now shared between multiple goroutines, while each goroutine still maintains its own local index accessor.

This allows us to benchmark our code and evaluate its performance under concurrent access. Discovering data races often necessitates writing tests incorporating goroutines, as sequential testing may not uncover data races. This, however, provides coverage when executed with the -race flag.

As your benchmarks grow more complex, you might be frustrated by the amount of code being added inside the pb.Next loop, which can heavily dilute the time measured. While RunParallel reports ns/op values as time for the entire benchmark rather than an average of time calculated per goroutine, you might still be interested in narrowing down the measurement.

To address this, you can make use of b.ReportMetrics, allowing you not only to report custom metrics but also to overwrite existing ones. This proves useful for obtaining clearer delta computations when comparing benchmark results using tools like benchstat.

For my work on an in-memory cache in Go, I wrote a benchmark that covers various aspects such as reads, writes, hit rate, and performance under concurrent access, all in one benchmark. While some argue for separate benchmarks for different objectives, it was a crucial tool for me to iterate on various parameters and understand how the values correlate and change as I optimized the underlying implementation.

func BenchmarkCache(b *testing.B) {
    cb := newBenchmarkCase(b, config{size: 400_000})
    b.SetParallelism(7)
    b.ResetTimer()

    b.RunParallel(func(pb *testing.PB) {
        var br benchmarkResult
        for pb.Next() {
            log := cb.nextEventLog()

            start := time.Now()
            cached, missing := cb.cache.Get(log.keys)
            br.observeGet(start, cached, missing)

            if len(missing) > 0 {
                data := lookupData(missing)
                start := time.Now()
                cb.cache.Set(data)
                br.observeSetDuration(start)
            }
        }
        cb.addLocalReports(br)
    })

    b.ReportMetric(cb.getHitRate(), "hit-rate/op")
    b.ReportMetric(cb.getTimePerGet(b), "read-ns/op")
    b.ReportMetric(cb.getTimePerSet(b), "write-ns/op")
}

Learnings

Here are further insights I discovered while delving into parallel benchmarking in Go:

  1. Attempting to prematurely stop iteration in parallel benchmarking results in failure: RunParallel: body exited without pb.Next() == false. If you want the benchmark to execute a specific number of times, use -benchtime instead, for example -benchtime=2000x.
  2. Don’t use b.ResetTimer, b.StartTimer or b.StopTimer. These functions operate globally and are not meant to be used in a concurrent context.
  3. GOMAXPROCS determines the number of goroutines. Experiment with different GOMAXPROCS values and see how the performance changes or better use -cpu=1,2,4,8 or similar values to have the benchmark run multiple times with different GOMAXPROCS values.
  4. The hardware on which you run your code may have fewer CPUs than the number of goroutines accessing shared code in production. To benchmark your code similarly to your production environment, set b.SetParallelism(p), which will multiply GOMAXPROCS by p.

Crafting benchmarks is as much an art as a science, and achieving accuracy in this domain is undeniably challenging, with many rabbit holes to get lost in. Nevertheless, try to focus on being practical and remember benchmarks are not just about numbers; they are about gaining insights into the behavior of our code.