Ara Pulido

Kubernetes at Datadog Scale by Ara Pulido

Container technologies, although not new, have increased their popularity in the past few years, with container orchestrators allowing companies around the world to adopt these technologies to help them ship and scale microservices with precision and velocity.

Kubernetes is currently the most popular container orchestration platform nowadays, and the one chosen by Datadog to run its infrastructure. We run dozens of clusters, with thousands of nodes, and we run them on different public clouds. How are our +1K engineers able to use this infrastructure platform successfully?

Join me in this talk for our story on what we learned while we scaled our Kubernetes clusters, the contributions to Kubernetes we made along the way, how we are building a development platform around it, and how you can apply those learnings when growing your Kubernetes clusters from a handful to hundreds or thousands of nodes.

Talk Questions

  • Question 134
    For what usecase are you using direct Pod to Pod communication, instead of Application to Application communication?
  • Question 135
    Kubernetes has some (unboubted) complexity. How did you get all your engineers up to speed with the core concepts they need to understand the system they are working with? Workshops? Peer groups? ...
  • Question 131
    How many people maintain the kubernetes infrastructure ?
  • Question 132
    You mentioned that you have 1000+ nodes per cluster. Is this a natural number, or do you "limit" your cluster size to avoid any performance issues (which could arise with 5k+ nodes)?
  • Question 138
    How do you fill the gap for a developer only wanting a new app and them knowing what is needed in the nodegroup they are creating
  • Question 133
    What requirement led you to self-manage kubernetes and not go for managed? Anything related to networking?
  • Question 130
    Did you considered nomad when moving to orchestration?
  • Question 145
    Where nodes managed by dev teams or platform team run in terms of accounts? Central account, team'saccounts? How costs and governance works?
  • Question 137
    What's the average size of Kubernetes clusters at Datadog? Is it preferable to don't go with not too big ones (and take it as volatile resources) or big ones but more stables?
  • Question 136
    With >10k pods/cluster scale, have you considered using ipv6 as the network layer to reduce the overhead of NAT and ALGs and such?
  • Question 140
    Have you considered Karpenter as a Cluster Autoscaller? AFAIM is also multi cloud and seems to ve getting more rellevant I'm the community.
  • Question 142
    Is nodeless just a default node group definition that you abstract developers from?
  • Question 146
    Could you give some examples on how Datadog’s control plane enables developers to work more efficiently? (tools, api, ci/cd, etc)
  • Question 143
    When you automatically update your pod resources through the vertical autoscaler, how do you manage the IaC?