Kubernetes upgrading from autoscaling/v2beta1 to autoscaling/v2beta2
We use a HorizontalPodAutoscaler Kubernetes resource to scale Pods that work off items from our AWS SQS queues. We found the scale-up to be very aggressive and wondered whether the new version would help. I couldn’t find any documention about the syntax change in v2beta2 for object metrics. Since I spent more than a hour working it out from the raw spec, I thought I would put the changes here in case it helps anyone else.
Before (v2beta1):
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: pph-notifications-listener
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pph-notifications-listener
minReplicas: 1
maxReplicas: 5
metrics:
type: Object
object:
metricName: redacted_qname_sqs_approximatenumberofmessages
target:
kind: Namespace
name: default
targetValue: 250
After (v2beta2)
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: pph-notifications-listener
namespace: default
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: pph-notifications-listener
minReplicas: 1
maxReplicas: 5
metrics:
type: Object
object:
metric:
name: redacted_qname_sqs_approximatenumberofmessages
describedObject:
kind: Namespace
name: default
target:
type: Value
value: 250
The HPA gets these metrics from our Kubernetes custom Metrics API which gets them from Prometheus which gets them via a ServiceMonitor from sqs-exporter which gets them from CloudWatch. Simple!
Look how aggressive the v2beta1 scale-up is:

See how the moment, the value goes over the target, Pods are scaled up to the max! The problem is that, because we use EKS, which is a managed service, and the kube-controller-manager runs on a master node, we can’t change some of the key settings like --horizontal-pod-autoscaler-sync-period or --horizontal-pod-autoscaler-downscale-stabilization (ref).
Update: Unfortunately upgrading to v2beta2 didn’t help with our aggressive scale-up problem:

Figuring out this issue is important because it causes a surge in resource requests which causes our cluster size to grow and shrink needlessly and causes more Pod churn than necessary which makes observability harder and generates more logs than necessary.
No Comment