Conversation
client/matching/partition_counts.go
Outdated
| logger.Info("partition count trailer parse error", tag.Error(err2)) | ||
| // continue with zero value for pc2 | ||
| } | ||
| if pc2 != pc { |
There was a problem hiding this comment.
Should we overwrite it only if err2 = nil? Otherwise we are unnecessarily clearing the cache.
There was a problem hiding this comment.
I think it makes sense to clear the cache on parse error. that's a very exceptional situation: it's just a simple proto message between internal services. what could go wrong?
| @@ -0,0 +1,117 @@ | |||
| package matching | |||
There was a problem hiding this comment.
What do you think about adding some metrics now itself?
in this file:
- cache hit/miss rate
- cache size
in partition.go
- StalePartitionCounts retry count
There was a problem hiding this comment.
StalePartitionCounts errors will show up in service_error_with_type, right?
for the cache, I'm not sure hit/miss makes much sense, it's not a size-limited cache. size does make sense, I can do that.
|
Suggest revising the pr description a bit so we get the gist of the flow also. This new behavior does not take into effect yet since the sever does not send any counts. |
What changed?
Add client-side of dynamic partition scaling: partitionCache, PartitionCounts, StalePartitionCounts error, handlePartitionCounts.
The intended behavior is:
For partition-aware calls (add+poll tasks), the client caches the partition count for active task queues, and uses those partition counts for load balancing (no change to load balancing algorithm yet). The client sends its cached partition counts to the server in a grpc header, and receives updated partition counts with the response in a grpc trailer. If the updated counts are different, it updates its cache it for subsequent requests. If the server indicates that the client's view of partition count is stale, it returns a special StalePartitionCounts error and then client makes one immediate retry with the newly received counts.
With just this PR by itself, the server will never send counts, the cache will always be empty, and the client will always fall back to dynamic config for partition counts, so there's no change in behavior yet.
Why?
Implement half of dynamic partition scaling.
How did you test it?