Be Lean, Go Far: leveraging Kubernetes for an elastic right-sized platform

Published in

BlaBlaCar

14 min readSep 22, 2022

BlaBlaCar is fully running its infrastructure on Google Cloud Platform, relying on the managed Kubernetes service called GKE. All our applications are deployed into containers, including our databases.

During our migration journey, building the teams, designing our architecture, making or buying our systems were quite challenging topics, so we did not focus on fully leveraging cloud opportunities from the first day. Despite our “Be lean go far” principle, application right-sizing was not part of our top five priorities versus delivering a robust scalable platform for our members.

Having fully completed the migration, and consolidated our cloud usage, we had the opportunity to step back and determine what would be a smart move to keep moving forward: building a right-sized elastic platform that would support our business with a more reliable and cost/CO2 efficient setup, leveraging our cloud provider offer.

This article will tell you the story of the technical and mindset evolution.

Problem statement

Despite a worldwide footprint, BlaBlaCar activity tends to be periodic: our members travel more during weekends than on Tuesday morning; local holidays periods and bank holidays add hard-to-anticipate usage peaks to a list of well-known events like New Year’s Eve. As an Operations team, we need to manage the traffic drops and spikes efficiently, and align allocatable and requested resources following a reliability objective. Our members’ experience shall not suffer from the potentially large activity on the platform while we should use our resources wisely during the low activity period.

Like many companies, by going to the cloud and prioritizing the migration, reliability and velocity over other options, we ended up facing the following problems:

falling into the trap of resources overprovisioning
not completely managing drops and spikes in demand
not leveraging the opportunity of volatile instances (preemptible or spot instances)
not deep diving into opportunities related to resources bin-packing and scheduling optimizations
getting into complex discussions when working on Committed Use Discounts
getting haunted by orphaned cloud resources, backup and snapshots policies for our persistant volumes

While acceptable in the first quarter following the GCP migration, we wanted to provide engineering teams clearer guidelines, leading to an enhanced situation. As Cloud providers offer the flexibility to provision infrastructure dynamically in a couple of minutes, most of the engineering work was on the infrastructure team side!

As a direct consequence, one of our main challenges was to design an elastic platform, adapting itself to our members’ activity, that would have just enough capacity to avoid wasting resources while preserving the experience.

“We don’t live any more in a static provisioning era.”

As going to the cloud is mostly associated with a “pay what you request model” and not a “pay what you use model”, solid engineering practices are needed to build a reliable, performant as well as cost efficient platform. By the way, it might be time to mention that this article is not presenting a “cost control” project.

Our main purpose was to design and deploy good engineering practices to level up our resiliency, optimize our workload performance, reduce our carbon footprint, and ideally optimize our costs. BlaBlaCar being a Tech4good company acting for CO2 reduction, we want to avoid starting servers in Cloud Datacenters that would consume a lot of energy, water and other kinds of limited resources. Preserving the environment and fighting climate change are deeply rooted in BlaBlaCar’s DNA and the company continues to engage in many different ways.

The problem and dreamed solution being identified, how did we start working on such a project ?!

“At the beginning of the project, we were only using 25% of the requested CPU !”

The pillars of our elastic platform toolkit

Having many infrastructure metrics, averaged values do not always bring valuable insights. Different service topology bring different sets of constraints (we run both apps and databases on GKE) so we needed to get detailed metrics per “GKE nodepools” to be able to detect meaningful trends. We started implementing a strong data-driven analytical approach using “you can only improve what you measure” as a mantra.

We then had to redesign our nodepools strategy, answering the “which nodepool for which usage” question better and deep dived into the Kubernetes scheduler settings and optimizations to leverage spot instances while eliminating the associated reliability risk.

Finally, we provided some strong-defaults to help service teams configure resources, improve low-traffic periods management and introduce suitable elasticity to handle traffic spikes… allowing the underlying infrastructure to automatically adjust.

Step 0: “Rally the troops”

A quick word on the organization: BlaBlaCar Foundations department is built around teams (to build and run technical stacks) and guilds (to promote practices, foster collaboration and get feedback). To deliver this project, we gathered a dedicated working group composed of one representative of each Foundation team, to have strong expertise on observability, system architecture and building standards. We organized our project steps around this working group, and leveraged our guilds to spread our project milestones, findings, and next steps, with the service teams using our product. We organized many “hands-on” workshops with service teams trying to ensure a maximum of autonomy on the stack.

Step 1: “You can only improve what you measure”

Our Observability stack is composed of an important set of metrics. However, they were too generic to clearly identify our problem statement.

pods that could be scheduled on preemptible servers were finally scheduled on OnDemand servers

We decided to develop a custom tool called resource-monitor to pre-compute all the metrics we would like to measure… and improve.

For example, we knew that some workloads had invalid settings in their Kubernetes manifests: some applications could be run on preemptible nodes but were scheduled on OnDemand nodes because of misconfiguration. How many services were impacted ? How much CPU was at stake? Well, quite a few!

We also wanted to understand our elasticity per nodepool because a CPU utilization average is not really helpful if we compared oversized workload (like MariaDB cluster) with application workers that can scale horizontally very easily.

For our right-sizing initiatives, we also deployed the VerticalPodAutoScaler recommender to help us look at applications that don’t scale horizontally:

how far were our workload resources from the recommendation regarding their resources?

How far were our workload resources from the recommendations regarding their requested resources ?

Here is the list of custom metrics that we added:

Over time, we based our communication on those metrics to check how we were improving the situation.

Step 2: (Re)Design our Nodepools strategy

GKE introduces the concept of nodepools to Kubernetes, i.e. nodes that share the same configuration. In our case, it is mostly about having multiple instance types within the clusters and having clear architecture patterns regarding their usage.

Those nodepools are managed with Autoscalers, meaning their number of nodes grow and shrink based on the sum of requested CPU / Memory of the k8s workloads (deployments, statefulsets and so on).

To better spread our workload over those nodepools, we have decided to rebrand them and to clarify their usage and specifications helping service teams better understand associated constraints and benefits.

Instances costs can vary a lot from preemptible to ondemand. Picking the wrong typology can lead to reliability issues, if the application is not fault tolerant but allocated to volatile instances, or to very high costs if only on demand instances are used.

This volatile concept is interesting from an engineering perspective as it requests to introduce some fault-tolerant mechanism for our cloud-native applications. On one hand, it’s coming with a huge discount compared to OnDemand servers. On the others: reclamation of Spot VMs is involuntary and is not covered by the guarantees of PodDisruptionBudgets, your applications have 25 seconds to gracefully terminate.

After several iterations, we ended up with the following nodepools:

common nodepool: the “per default” place for workloads which don’t request any specific tolerations. The nodes are OnDemand : expensive but reliable over time.
datastores nodepool: a dedicated nodepool to be used for data stores that are more sensitive to cpu & I/O issues and usually have more high mem requirements. The nodes are OnDemand, but covered by a Committed usage plan as we have a middle to long term vision on our datastore infrastructure requirements.
bursty nodepool : a nodepool to be used by fault tolerant, “CPU spiky” and not-latency-sensitive workloads. Those workloads can be scheduled here without CPU limits, so impacting only other bursty fault tolerant pods. Spot nodes : cheap but potentially removed by GCP at any time with only 25 second to get prepared
volatile node pool: used by fault tolerant workloads, using spot instances feature.
volatilefailover node pool: used as a failover mechanism by fault tolerant workloads if not spot instances are available. By default, it has 0 nodes and relies on OnDemand servers.

Step 3: Leverage spot instances

Spot is a “kind” of GCP’s volatile instance (as well as preemptible): those instances are cheap but can be taken by GCP at any time, with only 25 secondes. So if all spot VMs are taken back, how do we ensure the platform reliability?

Instead of creating a single nodepool (volatile), we decided to create a second one called volatilefailover having the same labels/taints but based on OnDemand servers. It can be scaled up and down by the Cluster Autoscaler, a standalone program that adjusts the size of a Kubernetes cluster to meet the current needs. It increases the size of the cluster when there are pods that failed to schedule on any of the current nodes due to insufficient resources.

During our Step #1 “you can improve what you can measure”, we understood that many workloads tolerating preemptible VMs were “faulty” scheduled on OnDemand servers (common nodepool) because of the capacity (already) available on those instances and the GKE orchestrator ways of working.

“From preferredDuringScheduling to requiredDuringScheduling settings”

Regarding Kubernetes machinery, our workloads were initially deployed to preemptible servers with a preferredDuringSchedulingIgnoredDuringExecution setting in their helm chart. However, we cannot really control how the scheduler is going to decide which host will be selected for a given pod. We decided to switch to a stronger pattern based on requiredDuringSchedulingIgnoredDuringExecution settings.

Now, our workloads having the flag volatileInstances: true are enabling these scheduler constraints:

By default, ClusterAutoscaler will scale up volatile nodepool (real spot) and not volatilefailover (onDemand).

“ClusterAutoscaler has an internal pricing expander to always provision servers on the cheapest nodepool.”

Going from “preferred” to “required” syntax, we tackle an issue and our volatile workloads will never be scheduled on the common nodepool anymore, even if some free resources are available there. Before implementing this change, we had to set a minNode value for the preemptible nodepool. The value was quite high to be sure that pods could be scheduled on it. Doing that, we were pushing the kube-scheduler to choose available preemptible nodes (with the “prefered” syntax) versus using on demand nodes but as a drawback, during low-traffic period, this nodepool was not scaling down, leading to over allocated resources versus our real usage. As a matter of fact, having this setup, the cost efficiency was quite good, but the infrastructure usage and associated environmental impact was not.

At this stage, we already managed to reach our platform elasticity objective and enforce spot instances usage. Globally, we went from a static number of nodes in May to an elastic platform composed of many nodepools that scale up & down efficiently.

we don’t live in static provisioning era anymore :-)

Oscillations represent ~20 nodes per day (e.g: night) which is not negligible ! To ensure the migration of all the identified applications, we started an initiative leveraging mutating controllers to auto-patch our Kubernetes resources specifications to transition from preemptible to spot servers.

Here is an overview of a Kyverno policy:

TLDR: for workloads tolerating before cloud.google.com/gke-preemptible, the above policy merges the new tolerations to spot instances and enforce scheduling (required) to those nodes.

Step 4: Right size your workloads

We all have this habit: when we don’t really know what we need, we always request something larger just-in-case but after a few times of running our apps, we should right size our workload and scale up/down the infra when needed.

Most of BlaBlaCar’s services can be horizontally scaled, but to ensure a global efficiency and reliability, our right-sized applications have to leverage many of the GKE orchestrator subtleties.

Now that our infrastructure can auto-adapt to our members’ activity, it’s time to right size our workloads and request only what is really needed.

Because sizing can be pretty subjective, so some general guidelines we tried to follow:

reliability before costs: we don’t want to risk impacting the reliability of our infra just to reduce costs, right-sizing is “only” about avoiding the waste of useless resources (and thus money)
balance engineering-time vs savings: resizing can be associated with a significant engineering-time (which has a cost). Thus, automate the resizes when possible (HPA for stateless apps, or maybe VPA for stateful in the future.

Resources usage could be related to:

a too high usage → dangerous 🔴 🔥: risk of performance / incident (ex: CPU throttling and higher latency when CPU is saturated, risk of OOMKilled) => we don’t want to go there
average → safe & efficient ✅✅: that’s the sweet spot that we target
too low → wasting resources: ⚠️ 💰: sizing is too large, so resources are provisioned but not really used. We can reach this state sometimes, but we want to detect it and scale-down to save energy, water and money

At the beginning of the project, our ratio describing “CPU Usage over Request” was not really efficient, going to around 20% during the night for example.

Our workgroup has worked to provide some strong-defaults for our different kinds of workloads (stateful / stateless) and proposed how to configure resources and HorizontalPodAutoscaler to scale up & down our cloud native applications.

“One of our recommendations is to rely on smaller pods: our objective is to have a CPU utilization that represents 70 % of our requested CPU, even during the night”

Regarding sampling challenges, we can understand that smaller pods lead to better “sampling”.. so better autoscaling. For workloads that can scale horizontally, our strong default setting for HorizontalPodAutoscaler is to scale up based on CPUUtilization: 70%. Below an example of right-sizing for our application, cpu requested resources may have been reduced to enhance infrastructure efficiency.

Additionally, we identify some opportunities around the GKE scheduler, a control plane component managed by Google : its default configuration is “balanced”.

“We have also decided to switch our GKE scheduler settings, from balanced to optimized.”

Doing this change, we leverage Bin Packing opportunities and nodes are filled up to 95% of CPU requests.

To support our service teams, we built a complete Datadog dashboard that helps to make right sizing decisions. We leveraged metrics showing the sizing efficiency but also analyzed the application elasticity behavior and present VPA recommendations for workloads that cannot scale horizontally.

Example:

If we have a look at this min/current/max HPA replicas screenshot, we can see that for this application, 25% of the time (meaning 6 hours over 24 hours) was spent with current == min pods replicas. This means that the service is at the minimum allowed and cannot scale down more than that limit. Applying our recommendations, we have reduced the requested CPU from 2.5 to 2 (our smaller pods approach), the min recplica limit from 3 to 2 and a targetCPUutilization from 80% to 70%.

As a direct consequence, this service minimum footprint on the infrastructure has been reduced from 7.5 (3pods x2.5cpu) to 4cpu (2pods x 2cpu). It is very important to keep in mind that it’s not because the settings define a minReplicas = 2 that this min replication will be reached, as long as the scaling setups (metrics, ratio) are well selected.

Finally, right sizing stuff is only done with 4 lines of code :-)

But the devil is in the details :-)

As a last item on reliability, we took this project opportunity to fine tune topologySpreadConstraints for some workloads. The main objective was to better spread our pods over our different nodes and eventually trigger a scale up of the infrastructure if the constraint cannot be satisfied (maxSkew attribute).

Find below an example of a deployment with those settings. Without fine tuning this setup, the orchestrator could decide to concentrate a given pod replicas on only a few nodes, making the user experience very at risk if one of those key nodes was to be removed by GCP (this is a situation we had before when we were deploying our workloads with preferredDuringScheduling settings and didn’t want to set required antiAffinity on hostname for service that had many replicas).

On the following screenshot, you can see for one of our service how the pods are spread over our different nods (and how many pods are running on each node).

pods are well spread on different hosts, respecting the topologySpreadConstraints

“More small pods, well spread, with suitable HPA settings to manage our traffic drops and spikes efficiently and align allocatable and requested resources following a reliability objective”

Conclusion

To avoid to fall into the pitfalls of cloud architecture, we have worked on many axis:

measuring what we want to improve with a strong data-driven analytical approach
choosing the most efficient instance types and size for our apps
autoscaling cloud resources to handle member’s activity spikes and drops
leveraging spot instances adding a failover mechanism for reliability purposes
fine-tuning our backup, disk snapshot policies enforced by our snapshot operator to reduce unnecessary expenses
eliminating resources that aren’t being used (stale resources are now under control, like detached PersistentVolume for more than N days: our approach could be presented in another article :-))

Our approach, “you can only improve what you measure” helps us to understand the benefits of the project! We improve our CPU sizing efficiency (from 25 to 53%), avoid wasting resources by getting a more elastic infrastructure. Resiliency, infrastructure usage, and as a consequence our costs have been improved. This setup allowed us to handle the summer period without any issue on our bin-packed, optimized GKE workers.

This project was quite challenging because of its traversal aspect but we succeeded and started many interesting discussions with our engineers to integrate their apps in the best conditions. Through a continuous improvement approach, the follow-up sessions allowed us to get feedback that were integrated in the right sizing framework.

“Right sizing is a never ending process :-)”

When service teams introduce new features (or decommission an old ones), used resources could be impacted and requested ones should be fine tuned. Our rituals (at team but also engineering level) should help to keep this process under control.

This project was successful thanks to a strong cross team collaboration, leading to a triple “win” situation for BlaBlaCar: both for the resiliency, for the CO2 reduction objective and for the billing costs :-)

Special thanks to Nicolas Salvy, Aurelien Beltrame, Guillaume Wuip, Frédéric Gaudet, Maxime Fouilleul, Rudy Deydier & Florent Rivoire for the review !

References: