Get ready for three days full of mind-blowing talks. One single track so you won't miss anything, the best speakers you could ever imagine and the most exciting content!
With an investment 4.1bn in boot technology, I want to tell you about the awesome innovations Nationwide Building Society UK have made and challenges we have overcome as a financial services institution, applying DevOps the right way, building a cloud platform. But wait! Let’s make it raw, I’ll tell you about the pain, the politics, huge consumers of K8s & ChatOps pus find out how we made a trading card game for the office the ended up driving some phenomenal behaviours! Why do almost all of our platform engineers write code.
Nationwide Building Society UK has invested 4.1bn in the future of technology for the society, cloud is a sensitive word when you’re a financial services organisation, it’s even more delicate when you’re regulated by the Financial Services Authority and the Prudential Regulation Authority.
I would love to tell you the story of how one of the most risk adverse fs institutions navigated cloud adoption, the technologies we have used and how, including kubernetes. Everyone has a heard of the Spotify model, if one thing is clear, you shouldn’t adopt it, let me tell you about how we took agile to the next level, what was right for us at our scale, maturity.
How does a trading card game have anything to do with agile or operating a cloud platform at scale, I don’t want to give everything away just yet but you should listen to the talk and think about giving it a shot your self.
ChatOps has been a huge technology enabler for us, I’l talk you through some of the stuff we’ve built into our bot such as automatic postmtportems and diary management, dashboards at a glance, how we automated our ingress kill mechanism and used the Shamir algorithm to distribute the key.
How do we organise our squads, demand flow, ensure we’re delivery, why scum didn’t work for us and why kanban make much more sense. How we built, chat-op’ed the hell out of the on-call rota using twilio and slack! Demo!!! Observability, nice as a concept but how did we drive adoption when we’re a platform, DevOps Dojos! Why do we encourage our engineers to take a couple of days out a month to teach!
We have a lot to cover but I’m sure you will find this insightful and interesting, fs orgs usually move like frozen snails and we’re dancing like matrix neo on a red pill.
ChatOps demo’s included, they are pretty quick which makes them excellent demos and seeds of inspiration.
In a perfect world we would run monolithic systems on machines with unlimited CPU, Memory, Disk and Network IO, and with 100% reliability.
The reality is that such a machine does not exist, and to deliver the demands of performance and availability our users demand from our systems, we have to be creative in partitioning workload over multiple machines, geographic regions, and failure domains. This approach introduces complexity to our systems, we expect failure and design accordingly.
In this talk, Nic will walk through the areas of complexity in a system we will then look at what patterns you can employ to ensure performance and availability even in a failing world.
This demo-driven talk will showcase the patterns that can help to adopt a distributed architecture for your applications, especially when moving from a monolithic architecture to a distributed one.
Machine-learning systems have become increasingly prevalent in commodity software systems. They are used through cloud-based APIs or embedded through software libraries. However, even ML systems just look like another data pipeline, they make systems sensible and might put systems health at risk without the proper control.
Through discussions with engineers engaged in deploying and operating ML systems, we arrived at a set of principles and best practices. These include from input-data validation, for fairness/quality on training; contextual alerting, deployment and rollback policies to privacy and ethics . We discuss how these practices fit in with established SRE practices, and how ML requires novel approaches in some cases. We look at a few specific cases where ML-based systems did not behave as did traditional systems, and examine the outcomes in light of our recommended best practices.
What's all this fuzz about ethics? And why should we care?
In the fastest-paced environments, where we're asked to deliver features as quickly as we can think of them, it's not easy to take a step back, stop for a minute and pause to consider the ethical implications of what we're doing.
Yet, consciousness about these concerns is growing lately.
Let's explore what we can expect to get from it, and why we would go down this road of endless questions and very few answers.
DevOps and Agile are great at speeding up development and providing quick results, however most organisations struggle to transition and adapt their team and leadership structures, which is made of FAIL.
In this talk I'll go through some of the experiences I had and interesting ways to solve it while keeping it light so you don't have to run for coffee.
In this talk, we show the audience how we combined our experience as an enterprise SaaS provider with our experience as an OpenShift IaaS provider, to create an easy to use, easy to update and easy to rollback `Infrastructure as Code` Toolchain/Monitoring solution.
This solution faces updating, misconfiguration and diversification issues, by using Docker Images in combination with OpenShift.
We further discuss our self-service system, which enables us to decrease maintenance and to avoid manual steps by using automated deployment of large amounts of instances.
We will then discuss the idea of saving all configuration inside version control, while totally ignoring the users' UI changes.
In this talk i will explain how to integrate a Jmeter in your continuous integration system.
I will demo how to create a stack with grafana, influxdb and jmeter in docker to show in real time results. And how to create more legible tests thanks to blazemeter taurus in a yaml format.
Transitioning a monolith to distributed microservices in the cloud is an excruciating process.
A stateless API Gateway can help you preserve your existing API contract while developers chop the monolith in different microservices, and publish the new specification transparently.
KrakenD API Gateway is a well-tested open source software with no external dependencies or moving pieces. A cluster will facilitate a journey to cloud and microservices without any supporting databases or single points of failure.
The API Gateway integration with Prometheus, Jaeger, Zipkin and other logging, metrics, and tracing systems will keep the system behavior observable at all times.
During this presentation, we are going to discover the architecture behind this design, how to create the endpoints in a declarative way (no coding), and HORROR STORIES!
Kubernetes has become the standard tool to deploy applications both in Cloud and on premise datacenters.
This has been possible thanks to its powerful API and its model that allows us to describe the lifecycle of an application.
In this talk we are going to explain how Kubernetes works internally, showing what Kubernetes controllers are and how they work. Once we understand that, we will learn how we can extend the features that Kubernetes already offers to tailor it to our needs, and how big companies are investing on projects and tools that are based on this mechanism.
Gone are the days of hand-typing commands into network devices one by one: the same benefits of Ansible seen on compute nodes can now be extended to the network nodes.
Through automation, CI/CD is not an application development concept anymore, but an enterprise culture that can also be extended to the network discipline. Learn about the difference between traditional and next-gen network operations, and why network teams need automation.
Tired of wrapping pickled models in server logic? me too! The biggest bottleneck in delivering machine learning services is the handover from data science to engineering.
Providing scientists with a predicable and safe way to deploy their own models will decouple the engineering and data science efforts allowing each departments to focus on delivering value instead of completing repetitive tasks.
This was achieved by exposing a pub/sub interface where data models just have to register as services and they are automatically exposed to the rest of the organisation.
What to do when you must monitor the whole infrastructure of the biggest European hosting and cloud provider? How to choose a tool when the most used ones fail to scale to your needs? How to build an Metrics platform to unify, conciliate and replace years of fragmented legacy partial solutions?
In this talk we will relate our experience building and maintaining OVH Metrics, the platform used to monitor all OVH infrastructure. We needed to go to places where most monitoring solutions hadn’t gone before, it needed to operate at the scale of the biggest European hosting and cloud providers: 27 data centers, more than 300k servers (bare metal!), and hundreds of products to fulfill our mission to host 1.3 million customers.
You will hear about time series, about open source solutions pushed to the limit, about HBase clusters operated at the extreme, and how about a small team leveraged the power of a handful of open source solution and lots of coding glue to build one of the most performant monitoring solutions ever.
Kubernetes based PaaS are becoming more and more popular in the Enterprise. This means large enterprises need to accommodate to and integrate the new paradigms into their current non-agile (and very often ITIL-based) operational model.
DevOps can help traditional IT operational groups make a smooth transition. In this process, automation technologies need to be considered to provide an automation API for every group and also, very important, to help clarify the chain of responsibility and improve traceability.
The goal of the session is to provide an overview of a modern cloud operational model for Kubernetes in a complex scenario. We will introduce Ansible as the automation key technology to provide an API for every context of the model and show how we can implement a Kubernetes life-cycle on OpenShift.
And finally, we will also discuss how this can still match the ITIL operational model.
You might be asking yourself why would you migrate in the first place? I will answer that in a bit, first let’s talk briefly about the cloud.
Since the advent of the "cloud" and with the rise of multiple players on this space (Azure, Google Cloud, AWS), tools have made their appearance too. Tooling is indispensable for automation and automation is fundamental for scaling. Tooling will also allow you to do something else: avoiding vendor lock in in favor of a more flexible concept: cost of migration. You can apply this very same concept when you are not as advanced or working with a minor or mid-sized player in the cloud providers arena, and give you the possibility to move to another provider that better suits your needs.
Why would you migrate? There could be several reasons: cost saving, service level and in our case, lack of tooling.
In this talk I’ll cover a one and a half year journey at Packlink, where we migrated from cloud provider and in the way went from:
Letting you know about the problems we found, solutions implemented and tradeoffs took.
Mobile app releases are a manual, tedious and error-prone process when done at scale.
It is not possible to follow a continuous delivery (CD) approach because binaries need to be submitted to the stores, and once published they can’t be rolled back. Moreover, mobile releases can take several days to complete.
In contrast, we deploy the web version of Shopify around 50 times a day. We identified this gap between web and mobile releases as a problem.
In this talk, I’ll explain how my team built a continuous integration (CI) system for our mobile apps (Android and iOS), and how we used this infrastructure as a foundation to build a Ruby on Rails application and other tools to automate and orchestrate releases to Google Play and the App Store, reducing the need for human interaction as much as possible.
We’ll see the challenges of standardizing the release process and the coordination with third-party services like GitHub and the store APIs, and how we made it easier for developers to test the apps.
At N26, we want to make sure we have resilience and fault tolerance built into our backend service-to-service calls.
Our services used a combination of Hystrix, Retrofit, Retryer, and other tools to achieve this goal. However, Netflix recently announced that Hystrix is no longer under active development.
Therefore, we needed to come up with a replacement solution that maintains the same level of functionality. Since Hystrix provided a big portion of our http client resilience (including circuit breaking, connection thread pool thresholds, easy to add fallbacks, response caching, etc.), we used this announcement as a good opportunity to revisit our entire http client resilience stack. We wanted to find a solution that consolidated our fragmented tooling into an easy-to-use and consistent approach.
This talk will share the approach we are currently implementing and the tools we analyzed while making the decision. Its aim is to provide backend devs (primarily working on JVM languages) and SREs with a comprehensive view on the state of the art for service-to-service call tooling (resilience 4j, envoy, gRPC, retrofit, etc), mechanisms to improve service-to-service call resiliency (timeouts, circuit breaking, adaptive concurrency limits, outlier detection, rate limiting, etc.) and a discussion on where these mechanisms should be implemented (client side, side-car proxy, server-side side-car proxy or server-side).
It seems clear to everyone that availability and performance are the main concerns when monitoring a platform.
But... what happens if suddenly you discover that you were hacked some weeks ago and you have intruders on your servers?
It has been always scary, from long time ago, but now it might affect personal data and the GDPR fine can be much more than "scary".
Do not ask "if" your servers can be hacked, ask "when" will happen instead. And my question is... how long will it take you to get notice that you are being or were hacked?
In this talk we want to show the architecture we have in place to monitor several different platforms from several different websites (such as Infojobs, Fotocasa, Milanuncios, Vibbo...), with distributed teams, diverse technologies, using a pragmatic approach for investing a very reasonable effort and money. We will explain the options when using an opensource but mature component such as ossec, combined with commercial software such as Splunk, heavily optimizing the costs. We will also approach the components that AWS provides to detect attacks and intrusions, both the success cases as the not-so-successful ones. We will address monitoring live HTTP requests, log analysis, intrusion detection on servers and also on the network.
Our global approach: monitor good-quality events, not gathering big quantity of "simple" logs.
All this with an expense lower than 1K per month, monitoring platforms with millions of monthly users. So it can work for both big and small pockets!
DevOps is growing in popularity in last years, particularly in (software) companies that want to reduce its lead time (time to business value from idea to production) measured in days/weeks instead of months/years?
If you want your software to do the right things and do these things right, you need to test it implacably.
The big problem is that companies see (and it is) the testing phase as the bottleneck of the process slowing down product release. To change that, we need a new way of testing our applications, making the release of an application a testing process as well, and involve QA since the beginning within the team. QAs are not a separate team anymore (DevTestOps).
What is the role of QAs in this new approach? How is the testing pyramid affected? How you can fail on trying to speed up release frequency?
In this session, we will not only describe but also actively demonstrate several techniques that you can use immediately following the session for testing applications like unicorns.
At Giant Swarm, we have been running Kubernetes clusters for enterprises for more than two years. Our approach is leveraging the Kubernetes philosophy to control the entire lifecycle of our managed clusters.
We have built a control plane cluster, which takes care of maintaining the tenant clusters in the state our users have defined.
In this talk, I will go through the key components of our design and how we apply DevOps practices to deliver value fast in a highly dynamic environment.
Operating a complex distributed system such as Apache Kafka could be a lot of work, so many moving parts need to be understood when something wrong happens.
With brokers, partitions, leaders, consumers, producers, offsets, consumer groups, etc, and security, managing Apache Kafka can be challenging.
From managing consistency, numbers of partitions, understanding under replicated partitions, to the challenges of setting up security, and others, in this talk we will review common issues, and mitigation strategies, seen from the trenches helping teams around the globe with their Kafka infrastructure.
By the end of this talk you will have a collection of strategies to detect and prevent common issues with Apache Kafka, in a nutshell more peace and nights of sleep for you, more happiness for your users, the best case scenario.
In any Cloud Native architecture there’s a seemingly endless stream of events that happen at each layer. These events can be used to detect abnormal activity and possible security incidents, as well as providing an audit trail of activity.
In this talk we’ll cover how we extended Falco to ingest events beyond just host system calls, such as Kubernetes audit events or even application level events. We will also show how to create Falco rules to detect behaviors in these new event streams. We show how we implemented Kubernetes audit events in Falco, and how to configure the event stream.
Finally, we will cover how to create additional event streams leveraging the generic implementation Falco provides. Attendees will gain deep understanding of Falco’s architecture, and how it custom Falco for additional events sources.
Most organizations feel the need to centralize their logs — once you have more than a couple of servers or containers, SSH and tail will not serve you well any more. However, the common question or struggle is how to achieve that.
This talk presents multiple approaches and patterns with their advantages and disadvantages, so you can pick the one that fits your organization best:
* Parse: Take the log files of your applications and extract the relevant pieces of information.
* Send: Add a log appender to send out your events directly without persisting them to a log file.
* Structure: Write your events in a structured file, which you can then centralize.
* Containerize: Keep track of short lived containers and configure their logging correctly.
* Orchestrate: Stay on top of your logs even when services are short lived and dynamically allocated on Kubernetes.
Each pattern has its own demo with the open source Elastic Stack (previously called ELK Stack), so you can easily try out the different approaches in your environment. Though the general patterns are applicable with any centralized logging system.
Data drives us, this is one of the most used mottos among all the organizations. Most of us would like to make our decisions data-driven. That will give us more confidence about what what we have decided. This is not always easy.
It’s even worse when a bunch of people needs to agree on the decision. Because each of them may bring their own data. Comparing apples to oranges is challenging at best and at worst really hard.
Latency Map was born to help the business to make better decisions, bringing latency metrics to the stage. If you achieve that these data become the source of truth, you nail it.
This is the first step to stop neverending discussions about if that measurement is better or worst than mine. At Adevinta, we have used Latency Map to decide on the best cloud regions to deploy our services while minimising end customer latency.
We’ll be glad to share our journey, including the tech stack we used, options we dismissed, lessons learnt, and some actual findings from the latency-measurement data we are collecting.
The journey that allowed us to come up with a complete product, and how the outcome, actual latency data, may help you to make the best decisions in your trip to the Cloud.
Have you ever heard of: ‘one apple a day keeps the doctor away’? Fact that makes each one of us responsible for doing a small action that should improve our life.
If we took this to the DevOps world, the proverb would be brought by DevSecOps. It adds security to the process and shifts security from reactive to proactive. Makes each team member responsible for the security of the development, the platform and the deployment, in short, of the entire product.
To eat an apple would be way too easy, and that’s not what we are here for, not to be conformist, though we are adaptable we are ready to take action based on these next terms:
* Teams: everyone is responsible, we must break down the barriers between us, no more traditional silos of expertise, build and deploy with security is everyone's concern.
* Process: teamwork is encouraged, never hearing again: “that’s not my problem”
* Technology: we need to fight against technical security debt because that’s the ballot paper we be in the news.
To sum it all up, security sets the requirements and DevOps manages the frequency of scan occurrences according to the development practices. Will see how to assess the level of maturity of our organization, what metrics should we review and which are the warning signs before is too late for an ‘apple a day’ or our company makes the front page.