Almudena Vivanco

Drowning on Metrics or why Jack could have fit by Almudena Vivanco

How to properly size a service through performance testing and take those metrics into production. The key lies in Observability

In the last 5 years, the Lidl Plus product has grown from 2 stores in Zaragoza to 13,000 stores across Europe. From 100,000 users in 2018 to 90million in 2024. To carry out this titanic work in an organized and budget-friendly manner, emphasis was placed on two relevant points:

Monitoring and Observability

Performance Testing

Basic monitoring transitioned to a culture of Observability, which not only provided visibility into system metrics but also into the complete flow and user experience. When we talk about observability, we no longer talk about isolated systems but about understanding what happens as a whole.

Performance testing was highly relevant throughout the rollout period, inferring the volume that each country would bring based on the number of tickets coming in from the stores. Performance tests were conducted for each critical product, and end-to-end tests were constantly performed to measure the user experience of the Lidl Plus app.

We lacked real-time visibility from the application to the backend. Over the past 5 years, we have worked on that traceability to measure the "happiness" of our users, moving from tools like Firebase or Dynatrace to the current solution based on OpenTelemetry.

We will show the current stack and the ability to infer performance data for a product before going into production, validating workload hypotheses and feedback to improve tests once they are in production.

Talk Questions

      
  • Question 836
    What tools do you recommend to run load tests? Specially free ones
  • Question 838
    Did you thought about implementing some kind of queueing mechanism (e.g. "you will be served in x minutes") for clients to manage peak requests, like many other companies, during black Fridays?
  • Question 839
    What tool do you use to make load testing? K6, locust?
  • Question 835
    Did you consider using the card number to identify if there were repeated users on the same day?
  • Question 840
    Do you apply some kind of special scaling policies or configurations to your infra before peak business periods?
  • Question 841
    Which load testing tool do you prefer?
  • Question 837
    Don't you apply "code freeze" csnpaigns times before critical time?
  • Question 842
    Why not instead of dropping metrics storing a small amount %? For example 5% to check how things behave or have a small sample at least
  • Question 843
    Do you use grafana cloud? Or on premise grafana?