Skip to content

perf(runtime): add typed-slice fast paths to in operator#960

Open
mingrammer wants to merge 1 commit into
expr-lang:masterfrom
mingrammer:perf/runtime-in-typed-slice-fast-path
Open

perf(runtime): add typed-slice fast paths to in operator#960
mingrammer wants to merge 1 commit into
expr-lang:masterfrom
mingrammer:perf/runtime-in-typed-slice-fast-path

Conversation

@mingrammer
Copy link
Copy Markdown

Summary

runtime.In (called by the OpIn opcode for the in operator) uses reflect to iterate its right-hand side. The reflect path is correct for any slice type, but it pays one heap allocation per element on every typed slice, because reflect.Value.Index(i).Interface() must box the element when the slice's element type is not already interface{}.

For []any the boxing is a no-op (the cell is already an interface), so the existing path is already zero-alloc-per-element. For []string, []float64, []int64, []int, and []bool it adds N heap allocations per in evaluation, which is a significant tax when in runs in a hot loop (e.g. rule engines or expression-based filters over candidate lists).

This PR adds a small type-switch at the top of In for those five common shapes. Each case uses a pure-Go for ... range loop, so no reflect, no per-element boxing, no Equal() round-trip.

Behavior preserved

On a needle/element type mismatch the case falls through to the existing reflect path, so Equal()'s cross-type promotion semantics are preserved. For example, an int needle against []float64 still matches via numeric promotion. The existing test suite is untouched and still passes; new tests in vm/runtime/runtime_test.go cover hit/miss for each fast path, an empty typed slice, cross-type needles, and unchanged []any semantics.

Benchmarks

Apple M4 Pro, darwin/arm64, go test -benchtime=1s ./vm/runtime/:

Bench before after speedup
StringSlice/N=8 112.8 ns, 6 allocs 18.96 ns, 1 alloc 6.0×
StringSlice/N=64 659.8 ns, 34 allocs 31.69 ns, 1 alloc 20.8×
StringSlice/N=256 2240 ns, 130 allocs 60.28 ns, 1 alloc 37.2×
Float64Slice/N=8 85.1 ns, 6 allocs 14.99 ns, 1 alloc 5.7×
Float64Slice/N=64 442.1 ns, 34 allocs 23.66 ns, 1 alloc 18.7×
Float64Slice/N=256 1794 ns, 130 allocs 169.1 ns, 1 alloc 10.6×
Int64Slice/N=8 82.0 ns, 6 allocs 14.77 ns, 1 alloc 5.6×
Int64Slice/N=64 973.8 ns, 34 allocs 23.15 ns, 1 alloc 42.1×
Int64Slice/N=256 1610 ns, 130 allocs 166.0 ns, 1 alloc 9.7×
AnySliceOfString/N=* unchanged unchanged

The remaining 1 alloc/op is the call-site itself boxing the needle into any when calling runtime.In; it lives outside the changed code.

Test plan

  • go test ./vm/runtime/... (new + existing)
  • go test ./... (full suite, all green)
  • go vet ./...
  • go test -bench=BenchmarkIn -benchmem ./vm/runtime/ (numbers above)

`in` dispatches through `runtime.In`, which uses reflect to iterate the
right-hand side. The reflect path is correct for any slice type but pays
one heap allocation per element on every typed slice, because
`reflect.Value.Index(i).Interface()` must box the element when the slice's
element type is not already `interface{}`.

For `[]any` this boxing is a no-op (the cell is already an interface), so
the existing path is already zero-alloc-per-element. For `[]string`,
`[]float64`, `[]int64`, `[]int`, and `[]bool` it adds N heap allocations
per `in` evaluation, which is significant when `in` runs in a hot loop
(e.g. rule engines or expression-based filters over candidate lists).

This patch adds a type-switch at the top of `In` for those five common
shapes. Each case uses a pure-Go `for ... range` loop, so no reflect, no
per-element boxing, no Equal() round-trip. On a needle/element type
mismatch the case falls through to the existing reflect path so Equal()'s
cross-type promotion semantics are preserved (e.g. an int needle against
a []float64 still matches).

Benchmarks (Apple M4 Pro, darwin/arm64, -benchtime=1s):

  bench (N elements)             before                     after                    speedup
  StringSlice/N=8        112.8 ns/op,   6 allocs    18.96 ns/op, 1 alloc       6.0x
  StringSlice/N=64       659.8 ns/op,  34 allocs    31.69 ns/op, 1 alloc      20.8x
  StringSlice/N=256     2240   ns/op, 130 allocs    60.28 ns/op, 1 alloc      37.2x
  Float64Slice/N=8        85.1 ns/op,   6 allocs    14.99 ns/op, 1 alloc       5.7x
  Float64Slice/N=64      442.1 ns/op,  34 allocs    23.66 ns/op, 1 alloc      18.7x
  Float64Slice/N=256    1794   ns/op, 130 allocs   169.1  ns/op, 1 alloc      10.6x
  Int64Slice/N=8          82.0 ns/op,   6 allocs    14.77 ns/op, 1 alloc       5.6x
  Int64Slice/N=64        973.8 ns/op,  34 allocs    23.15 ns/op, 1 alloc      42.1x
  Int64Slice/N=256      1610   ns/op, 130 allocs   166.0  ns/op, 1 alloc       9.7x
  AnySliceOfString/N=*  unchanged (already uses zero-alloc reflect path)

The remaining 1 alloc/op is the call-site boxing the needle into `any`
when calling runtime.In; it lives outside the changed code.

Tests in `vm/runtime/runtime_test.go` cover hit/miss for each fast path,
empty typed slice, cross-type needle (must fall through to reflect), and
unchanged `[]any` semantics. The existing test suite is untouched and
still passes.

Signed-off-by: MinJae Kwon <mingrammer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant