Kubernetes upgrading from autoscaling/v2beta1 to autoscaling/v2beta2
We use a HorizontalPodAutoscaler
Kubernetes resource to scale Pods that work off items from our AWS SQS queues. We found the scale-up to be very aggressive and wondered whether the new version would help. I couldn’t find any documention about the syntax change in v2beta2 for object metrics. Since I spent more than a hour working it out from the raw spec, I thought I would put the changes here in case it helps anyone else.
Before (v2beta1):
apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: pph-notifications-listener namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: pph-notifications-listener minReplicas: 1 maxReplicas: 5 metrics: type: Object object: metricName: redacted_qname_sqs_approximatenumberofmessages target: kind: Namespace name: default targetValue: 250
After (v2beta2)
apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: pph-notifications-listener namespace: default spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: pph-notifications-listener minReplicas: 1 maxReplicas: 5 metrics: type: Object object: metric: name: redacted_qname_sqs_approximatenumberofmessages describedObject: kind: Namespace name: default target: type: Value value: 250
The HPA gets these metrics from our Kubernetes custom Metrics API which gets them from Prometheus which gets them via a ServiceMonitor from sqs-exporter which gets them from CloudWatch. Simple!
Look how aggressive the v2beta1
scale-up is:

See how the moment, the value goes over the target
, Pods are scaled up to the max! The problem is that, because we use EKS, which is a managed service, and the kube-controller-manager runs on a master node, we can’t change some of the key settings like --horizontal-pod-autoscaler-sync-period
or --horizontal-pod-autoscaler-downscale-stabilization
(ref).
Update: Unfortunately upgrading to v2beta2
didn’t help with our aggressive scale-up problem:

Figuring out this issue is important because it causes a surge in resource requests which causes our cluster size to grow and shrink needlessly and causes more Pod churn than necessary which makes observability harder and generates more logs than necessary.
No Comment