Skip to content

Dynamic partitioning: client side#9732

Open
dnr wants to merge 14 commits intotemporalio:mainfrom
dnr:dp6
Open

Dynamic partitioning: client side#9732
dnr wants to merge 14 commits intotemporalio:mainfrom
dnr:dp6

Conversation

@dnr
Copy link
Copy Markdown
Contributor

@dnr dnr commented Mar 28, 2026

What changed?

Add client-side of dynamic partition scaling: partitionCache, PartitionCounts, StalePartitionCounts error, handlePartitionCounts.

The intended behavior is:

For partition-aware calls (add+poll tasks), the client caches the partition count for active task queues, and uses those partition counts for load balancing (no change to load balancing algorithm yet). The client sends its cached partition counts to the server in a grpc header, and receives updated partition counts with the response in a grpc trailer. If the updated counts are different, it updates its cache it for subsequent requests. If the server indicates that the client's view of partition count is stale, it returns a special StalePartitionCounts error and then client makes one immediate retry with the newly received counts.

With just this PR by itself, the server will never send counts, the cache will always be empty, and the client will always fall back to dynamic config for partition counts, so there's no change in behavior yet.

Why?

Implement half of dynamic partition scaling.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s) ­– tests are in future PRs

@dnr dnr requested a review from rkannan82 March 28, 2026 04:47
@dnr dnr requested review from a team as code owners March 28, 2026 04:47
logger.Info("partition count trailer parse error", tag.Error(err2))
// continue with zero value for pc2
}
if pc2 != pc {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we overwrite it only if err2 = nil? Otherwise we are unnecessarily clearing the cache.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to clear the cache on parse error. that's a very exceptional situation: it's just a simple proto message between internal services. what could go wrong?

@@ -0,0 +1,117 @@
package matching
Copy link
Copy Markdown
Contributor

@rkannan82 rkannan82 Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding some metrics now itself?
in this file:

  • cache hit/miss rate
  • cache size

in partition.go

  • StalePartitionCounts retry count

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StalePartitionCounts errors will show up in service_error_with_type, right?

for the cache, I'm not sure hit/miss makes much sense, it's not a size-limited cache. size does make sense, I can do that.

@rkannan82
Copy link
Copy Markdown
Contributor

Suggest revising the pr description a bit so we get the gist of the flow also.
Example:
Behavior: Matching RPC sends its cached partition counts to the server and receives updated partition counts as part of the response (grpc header/trailer). If the updated counts are different, then it caches it for subsequent requests. If the server indicates that the client's view of partition count is stale, then client makes one more attempt with the newly received counts.

This new behavior does not take into effect yet since the sever does not send any counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants