Scaler fails only when failing to get counts from all the interceptor endpoints#903
Scaler fails only when failing to get counts from all the interceptor endpoints#903Mizhentaotuo wants to merge 3 commits intokedacore:mainfrom
Conversation
Signed-off-by: Mizhentaotuo Signed-off-by: mingzhe <whitelmz@hotmail.com>
Signed-off-by: Mizhentaotuo Signed-off-by: mingzhe <whitelmz@hotmail.com>
|
Hey! |
|
Said that, I guess that we could try to figure out a better way to hit the interceptors, something like getting the ready pods and going through calculating the endpoints in scaler side instead of using k8s endpoints. |
Hey! Thanks a lot for the quick reply.
No, no I agree that is the expected behavior. But the behavior on our cluster is that the scaler failed because one node is removed by gcp, and it spin up another node which takes some time (could be 1 min). could be that the endpoints list is updated, but just because the node is not ready yet, so the pod is not ready either? This part I do not know much about. |
Provide a description of what has been changed
We observe behavior that the scaler fails and exit the loop when failing to get counts from any of the interceptor replica.
Not sure this is the intended behavior but sometimes one interceptor replica is down only because it is on a spot node. When the node is down and the endpoints of the interceptor service is not updated yet, the scaler still try to get from and endpoint which does not exist. And most of the time the killed interceptor pod will heal itself.
Checklist
README.mddocs/directoryFixes #
Change so that the scaler fails only when fetching all the counts failed.
Comment:
I am new to this, not sure the existing version is the intended behavior. Please let me know if there is a better way or it can be handled by any config value that I am not aware of. Appreciated.