Transcript
Wanielista: Welcome everyone to my presentation on adopting continuous deployment at Lyft. Let's talk about bad deployments. Maybe someone you know has had a bad deployment. Maybe it's happened to you. Let's face it, everyone's had a bad deployment here and there. It goes something like this. You start deploying your code. It seems to be working pretty well. Then something happens later in the deployment that causes your service to go down. Maybe it's a memory leak, and all of your instances start to run out of memory, and they all need restarting right around the same time. It's not great. Your service is down and you have an issue. Maybe about a week later, you have another one of these bad deployments. Maybe it's not as severe, but you're able to fix it pretty quickly. Maybe the week after that, or even the next day, you have another deployment, and this time it's a little bit less clear what's going wrong, but you know that the latency spiked up. You have 10 commits that you just deployed, and you can't quite figure out which one might have caused this latency spike. Maybe customers just start seeing some bad issues on edge cases here and there. You just don't quite know what to do, and you have to revert each one of these commits to figure out which one's the problem. Maybe eventually, you do figure it out. You have postmortems for all these, and you start talking about ways to mitigate this. How is it that we keep ending up in the situation where we have these bad deployments? Maybe you might hear the dreaded words of a deploy freeze. Maybe the reasonable action here is every time we deploy, something bad happens, so let's just deploy less often. Let's freeze this up, make sure things are stable, and then we could go ahead and deploy like we do normally.
Background
The point of my talk is, what if we did the opposite? Instead of freezing, we sped up deployments. We do them more often. We thaw them out. That's the idea around continuous deployment is, can we do basically the opposite, which is deploy more often, so that when we do deploy, it is not a painful experience anymore. My name is Tom Wanielista. I've been on the infrastructure team here at Lyft for about four years now. The majority of my time here has been spent on this problem, and improving the change situation and the ability to make changes in the infrastructure, and making that a safer thing to do. There's a lot more available online. There's a blog post by my colleague, Miguel, where he covers some other details, https://eng.lyft.com/comtinous-deployment-at-lyft-9b457314771a. This is more of a behind the scenes of how we made this happen. If you want more information, go ahead and search for this post online, and you should be able to find it on our engineering blog.
Outline
We'll cover why we needed continuous deployment. How we got to continuous deployment. We'll cover some of the technical aspects of our design. Then we'll cover the cultural aspects, which are also very interesting and very critical for doing something like this, the rollout of how we rolled out continuous deployment, and the effects. What are the results that we see after adoption?
What Is Continuous Deployment?
Let's talk about continuous deployment because it's easy to mix this up with continuous delivery. There is a difference between the two, and I decided to pick this AWS article online as the source of truth for this. The difference essentially is, for continuous delivery, everything is automatic, starting from the merge to the default branch. Your CI tests are automatic. You deploy your staging maybe automatically. Maybe you deploy your canary automatically. For the final deployment into production, you have to have a manual approval for that to happen. Continuous deployment is the same thing all the way up to the end where there's actually no human being involved whatsoever. Everything is completely automated and automatic. That's what we're going to be talking about, continuous deployment.
Why We Needed Continuous Deployment
Let's talk about why we needed continuous deployment. I'll take you back a little bit, probably about four years ago around the time I joined Lyft. This is the user interface a typical on-call or someone who is deploying, would use to deploy their service at Lyft. We had this Jenkins pipeline. Everything was built on Jenkins. A user who was going to deploy their code would basically go into the Jenkins user interface. This is an example of a completed pipeline. Essentially, if it was not completed, there will be little buttons to start a deployment. Each one of these are individual deployments. There's the staging job canary, the single AZ job, which is not interesting here. You could just consider it 20% of production. Then we have production, which is 100% of production. A user would essentially click through each one of these steps at the right time to make the deploy happen. Then depending on the result of that, they would maybe stop their deployment or roll it back.
That's not all, because before they started using this UI, they would actually have to take the ball. What that means is they would have to go into Slack, go into the channel for which the service is home to and talk to one of our Slack bots, and take the ball on that project. Essentially, users take the lock on a certain project or a service, and only then they can deploy it. There was nothing enforcing this, it was just best practice to do this. You could in theory, just deploy without taking the ball. People were so used to doing it, and it was considered a good practice, since everyone was aware that a deployment was happening if you would go into a channel and take the ball. My understanding is that before Lyft had multiple services, there was one big service that made up the entire system. There was a physical actual ball that if it was on your desk, you were good to deploy, and that became digital. That's not all, either, because as you're deploying, it was on you to make sure that the metrics looked right. This is service metrics. Things like, is your latency spiking up? Is your memory usage spiking up? Is your CPU usage spiking up? What are the health metrics of your service that are important? Are you processing your queue fast enough? Have those been affected by the code changes that you just deployed.
Then, in addition to all of that, you also had to be considered business analytics. If you had a very critical service that may be part of the main user path for the app, you might cause an issue that is increasing the number of cancellations that users might be doing, and that's not great. You have to keep your eye on both the business analytics, the metrics, make sure you're taking the ball. It's a lot to handle if you want to just deploy your code. This stressed people out. It was a big yikes. Every time you had to deploy something, especially a critical service, people were really scared to do it because it was really fraught with potential problems. What's important to know here is that this is happening many times a day. We had at this point hundreds of services. Therefore, there were many deploys happening at a time from different teams, and this just causes a number of stressed out engineers.
Results
What are the results of this type of system? Anecdotally, we never really measured this. Deploying essentially took half of a developer's day. The typical workflow would be, get in in the morning, get ready to deploy your changes, see what the changes are. Maybe you go have lunch, and then after lunch, you would go ahead and do your deployment. Basically, the second half of your day would be gone. What most importantly this did was it resulted in what we call deploy trains, where many commits were deployed at once. What I mean by this is by taking the ball, by having to have specialized understanding of the service to deploy it, it took time to do and therefore it wasn't done often. You would wait until there were either enough commits to deploy, enough changes to deploy, or important changes to deploy, and then you would commit.
What this resulted in is the changes that would go into production at the same time were many commits. They would be commits from many different authors, each of them changing different components of the service. There were all these changes mixed up together that would go into one deploy into production. This was forced by the workflow that's been established as a best practice, which is taking the ball, deploying the big change. Then dropping the ball and letting someone else pick it up if they need to deploy something later. This can cause a problem because if you have many of these uncoordinated changes, you put them all together. You deploy them out into production. If you have an incident, you can get into a situation where you don't know which change caused the issue, and you might have to revert certain things or just make a guess, start bisecting, while you have a production incident. That's not a great time. To tackle this problem and the deploy experience in general, there was a team formed around 2018, around the time I joined. I joined onto this team directly, six years into the founding of the company, that was focused entirely on improving the deploy experience. This north star for the team was to reach continuous deployment.
Goals
Let's talk about the goals for this team and for our project here. We knew what we wanted. We knew that we wanted to have continuous deployment. What we wanted as our north star was to have discrete changes going out individually. We figured, if we managed to just make it so that one commit was going to production at a time, that would be an improvement, because if there was an incident, you know which commit to roll back, or what to revert, or what went wrong. We figured that was a obvious objective improvement, and therefore that was our target. Then, in addition, we wanted to make sure that teams felt comfortable shipping often. This buildup of commits was essentially the wrong approach. We wanted to make it ok to ship at any time, ship quickly, and frequently be shipping. Then in addition to all this, we wanted a new user experience. We knew that Jenkins wasn't going to cut it. It requires manual intervention. You have to click things in there. We wanted to instead provide a system that provides ambient observability in a way where you're aware of what's going on in your deployment pipeline, but you don't really have to be there babysitting the deployment pipeline. Things are just happening, and if something goes wrong, you'll be notified.
Design Tenants
The design tenants of our system, we had these three major tentpoles. We needed to make sure it was automated. We wanted to have as little user interaction as possible. The system should be self-healing. The goal was to merge into the default branch for your project or your service. If you could just walk away from the computer, it should do the right thing. The other aspect was we wanted the system to not only scale for the number of services that we were having, it was constantly growing. We also wanted it to be extensible and pluggable. The idea here was we knew we had many services, and there was probably no way that we can make it safe for all of them 100% of the time. Maybe we could get to a good enough point for all the services, and then teams can make up the difference and get it to 100%. Teams know their service best. That's not going to go away. They should be able to codify the important metrics, things like that, that are important to their service, and plug that into the deploy system. Then also, ultimately, we wanted it to be responsive. It really is not enough to just have the system be automated. We didn't think it would be ok to have the system be automated, but slow. If you merge something into the default branch, and it gets deployed maybe six hours later, maybe the next day, because the test takes so long. That wasn't enough. We think it's really important to deploy as quickly as you can, because it is at that time that you merge that you still have the most detailed context about your change. Even just six hours later, right after lunch, it might be hard to remember exactly what was going on. Especially if you have to wait till tomorrow or something like that, you ruined it at that point. We want to capture that mental context that you have and we wanted to make sure that it was all quick.
How Would We Know It Works?
How would we know that our system was working? We were very focused on these smaller deploy sizes. These discrete changes make it obvious what has changed, so during an incident you could just revert back. We expected our mean time to recovery to be a lot quicker. In addition, we did monitor the time to production metrics. That's the time from a developer creating a pull request and it landing into production. This is a very popular metric to use for productivity reasons. A lot of people like monitoring this. We monitored it as well, and we figured this would improve as we deployed the system, as we adopted continuous deployment more. It was not our primary metric. It did improve over time, but it's a very hard number to really quantify, because even if you're looking at the same team over time, team makeups change. The type of work changes on a team. It's not a very accurate number. It was used to inform us if we were slowing down the process, but not necessarily as our golden metric for success.
System Design
Let's talk about the design of our system here. It was pretty simple. We came up with three major components. In the middle, we had DeployAPI, which stores all of the state on deployments and can actually conduct the deployments. Then we have DeployView, which is the user interface that's going to provide the ambient observability. For the little human interaction that's required, it would live in DeployView. Then we have AutoDeployer, which was the actual automated deployment system. AutoDeployer would essentially be this worker that would continuously try to make progress on deployments. It would update that state in DeployAPI, which would then get reflected inside DeployView.
Deploy System
The technical details here. The backend is in Go, the frontend is in React. We're using Postgres for our state storage. What's interesting here is that we actually use this Postgres system for both online storage and to do offline analytics queries. To do analytics and usage metrics, for example, of how far people go with adopting continuous deployment. We would just query against a replica of our deploy database and figure out, this team has gone this far and this team has gone this far. We use a REST API. This is a summary of exactly how these three components were working. I only bring this up, because I just want to clarify that this is not a complicated system to build. It is a very bare bones type system. Anybody can really do this.
Data Models
Let's talk about the data model. We implemented jobs where we modeled the jobs just as they exist inside of Jenkins. We have jobs just like we had before in that pipeline view where we have the staging job, the canary job, the 20% deployment job, and then the 100% deployment job in production. Each one would have an individual ID. If you link them together, that creates a pipeline. This pipeline would deploy one change set into production. This is mostly a model that we copied from Jenkins, because we didn't want to rock the boat too much when we were designing continuous deployment. We wanted to maintain as much as possible of the old stuff, so that it wasn't too new or too unfamiliar to developers when we finally brought in continuous deployment.
That brings us to the state machine. This is the interesting part of the data model, which is, how did we model the state transitions on jobs? Every job was created in the waiting state, that meant it could be deployed. If that job should be deployed at that time, it would transition into the running state. The AutoDeployer would set that. Then once the job was completed, it would either succeed or fail, and end up in the success or failure state. The interesting third final state is skipped. That's a case where a job might just not be deployed ever. If a job enters into the skip phase, it's just not going to be deployed. Let's say you want to actually deploy that change sometime later, you'd have to create a new job at that point.
One thing you might notice is that we actually didn't model rollbacks in our state machine. The reason for this is, one, they're not really in the Jenkins model. Also, it's simpler to model without them. You could still do it. You could still trigger a new deploy with an earlier build on a job, and that essentially creates a new rollback. We didn't want to have a state for a rollback, just to simplify things. The other question was around automated rollbacks. This was a very hot topic when we first started working on this project. We were not quite sure what to do here. If you think about it, an automated rollback system sounds great, but if you really dig into it, there's a lot of complexities under the hood. You can get into a situation where maybe you deploy new code that writes to a shared state that then old code can't read from anymore, so if you do roll back, you might actually cause a worse problem. What we did was we ultimately did not implement automated rollbacks, and we considered pausing the pipeline safe enough. If a deployment causes an issue, the pipeline will be paused and the on-call will be notified for that service, who will then either force a rollback manually or they could fix it forward. This was in line with our philosophy of having the system be as self-healing as possible, so we didn't want to cause more damage if we got into a tricky state.
When to Deploy
Let's talk about the specific transition here, which is from waiting to running. This is the real meat of the problem. When do you actually do this deployment? You want this light to go from yellow to green when it's good to go. That's essentially what we built. We built this system called the Gate system, where every job goes through this Gate Check. That's essentially like a checklist more than it is a gate, where you go from the Waiting phase, you go through your Gate Check. Then depending on the results of the Gate Check, you either go on to the Running phase, meaning you're being deployed, or you go on to the Skipped phase, meaning that you're not going to be deployed anytime. Most importantly, these gates are good defaults. They're basically very simple checks that we do on every single deploy, right before we do the deploy. Things like service alerts, those are integrated automatically. If your service is paging, we will not continue the deploy, and you need to either solve the page or force the deploy to go on further. You can override these gates. They scale pretty well, because they're very simple. They're just baked into the AutoDeployer. They're constantly reevaluated. Last time I checked, we were doing around 200,000 Gate Checks per minute. It seems like a lot. We were checking around 13,000 waiting jobs at any given point. It seems like a lot. It scales pretty well. It's not really a problem for us. Most of these are synchronous checks that are just quick and cheap to evaluate. That's the goal of a lot of these checks is that they have to have this quick SLA, quick turnaround, because we're running them all the time. We do have some asynchronous checks which check the results of like a long running integration test, but that's done out of band. If all the gates succeed, those move into the Running phase and are deployed. If there's a gate that suggests, we should never deploy this, then that moves to Skipped. Otherwise, if the gates are inconclusive, it just remains in the Waiting state, and it gets put back into the deploy queue for evaluation later.
Here's an example of some of the gates we have out of the box. Has this revision passed CI tests? Have business metrics been affected? Have the integration tests passed? If you want to add your own gate, it's actually very easy. All you have to do is get into the code base, and just go ahead and put it in for your service. That works just fine. What's important to know here is that the AutoDeployer will actually monitor all projects for all services at Lyft. There's a centralized system, that's the AutoDeployer. It's actually monitoring and running Gate Checks on every single service. Every single project at Lyft, they all run through the same system. The question is which one to check. If we want to be responsive, we need to make sure that we can actually do all these Gate Checks quickly enough so that it feels like the deployments are happening in real time. Which of these services or projects are actually important or most important to deploy is actually critical for that responsiveness.
For the way a lot of deploy systems work in Jenkins, for example, they model it as a queue. If there's a change that goes out, it gets into a queue, and then eventually gets deployed. They can have downsides. If you have a queue, and you have some urgent fix, and this has happened a lot to us, and you need to get that change out now, the urgent fix can get stuck in the middle of the queue, and someone has to come in and either cancel the jobs in the front of the queue or shuffle things around. Either way, a human being has to come in and adjust the queue so that the urgent fix goes out first. You might say, let's just make it parallel, and have multiple queues so that we reduce the chances of this happening. It can still happen. If that urgent fix is really urgent, you could still get stuck somewhere in the queue. You are just stuck there and a human has to come in and fix this so that the fix goes out.
Heaps
I want to talk a little bit about heaps. Heaps are great. They're an interesting little data structure that look like a queue. They have a pop mechanism just like a queue does, where you can pop the end of the queue, but for a heap, you're popping the top of the queue. The heap has this top item that you can get into at one time and pop that right off. Then the other items are semi sorted. What you could do is sort this by minimum or a maximum of a certain value. Depending on the values that are in the heap, it will actually re-sort so that the maximum or the minimum value is at the top of the heap. You can use this property to model a queue, but also have the additional ability to adjust the priority of different items inside the queue. You can actually easily change the priority and say, this one actually is more important and should get closer to the top of the queue, or maybe it should be at the top of the queue. With this, if you have an urgent fix, it's actually really simple to solve this. You actually don't even need parallelism or concurrency at all, you just utilize the data structure to your advantage. If you have this urgent fix, you could just bump the priority of your urgent fix so that it hits the top of the queue and it gets deployed right away.
This is how the deploy queue works inside of AutoDeployer. It's a priority queue. We model it in Postgres using a timestamp column. Depending on the value of your timestamp, it determines the position in the priority queue. Urgent changes can immediately jump the line if we have to. This is the system we use to trick the system to feel responsive or to actually be responsive. Things like human interaction can bump the priority of a project, and so therefore, that can make it deploy a little bit sooner. Also, merges to the default branch also bump priority. Things that are external to the system that might change the state of the project, actually inform when we check this project for deployment or not. This is what allows us to ensure that changes that need to go out come out faster than others that are not as important. Our SLA for this is simply a measurement of how often we could process each one of these projects so that we deploy their changes as quickly as possible.
Deploying Continuous Deployment
Let's talk about actually deploying continuous deployment, which is deploying it within our organization. Our plan was to deploy this in stages. The general idea was to first replace the underlying plumbing and change as little as possible as we can. The idea at the time was to change absolutely nothing at all. Then integrate into existing processes slowly. Onboard some projects voluntarily, onboard some more. I'm sure we were going to learn some scaling experiences or learn some lessons along the way. Eventually feel good enough to make continuous deployment opt-out for projects. Eventually make the new hire onboarding use AutoDeployer only. Eventually just force all projects to onboard to continuous deployment. It didn't quite happen that way. We had to change some part of the process midway through, namely, we had to remove that ball mechanism I mentioned earlier. That was really the only change that we had to make here.
At first, we wanted to replace the underlying plumbing. We wanted to make sure there was no deploy process change for developers. What we did was, we actually started deploying to staging on every merge to the default branch right away. There was nothing that really limited you to doing that. We didn't need gates for that. It was always safe to do essentially, so we just figured, let's send that right now. That helped ease people in to the idea. At the same time, we onboarded a few frequently deployed projects to test continuous deployment and stress test it. These are systems that ship automated configuration changes, that ship generated code. They're deployed many times a day. Those were our first customers, because we wanted to make sure that the system would scale well. That was where we came up with the idea of the priority queue, for example. Then we integrated into existing processes, in a sly way. Our idea here was to enhance the experience and not change the experience until we absolutely had to change the experience. We had this AutoDeployer that was now deploying every one of these changes. That means that we can use that central system to our advantage to notify people in more rich ways.
Now on every pull request, we just started adding this deployment status commit, which has been really nice. As your change is going out, the pull request comment here is updated in real time. If you have a change in a PR, and you want to know if it's been deployed, you just open up that PR and take a look at it. It's really nice that you don't even have to look at the deployment stack to see the progress of your deploy, you could just look at this commit, and see, I've hit staging or I've hit canary at that point. Then we also started integrating into the existing notification. This is an example of some notifications that we were using for the previous Jenkins deployment. Under the hood, we just replaced that with DeployView so that when a user would click into it, they would see, now I'm using the new user interface for continuous deployment, and that's neat. We didn't really change their process too much. We started, in a very little way, start introducing them to the idea.
Here's where things got a little bit weird was changing the deploy process. Here at the bottom, you'll see a Slack topic for a room where we were tracking the ball status. The topic would reflect who had the ball for a specific project. In this example, I have the ball for project 3. Then for project 1 and 2, we actually set it so that the AutoDeployer would hold the ball if no human was holding the ball. We figured this made sense. The AutoDeployer was making deployments too, and if the AutoDeployer wanted to deploy, it too would have to take the ball and do the deployment. This just caused more confusion. This was a bigger change, actually, to how people thought of their deployments, because they figured if no human was holding the ball, then there was no deploy happening. We changed the ball locking mechanism. We then eventually made other changes to try to accommodate. Ultimately, this was just confusing. What we did was remove the ball slowly from different projects, talk to teams, and had them feel comfortable. What turns out we found out was that as we told teams, we are working on continuous deployment where you don't need the ball anymore, teams were actually pretty excited about the idea, and were very on board. This was one thing we had to change was we had to remove the ball entirely so that things would work well when we introduced continuous deployment.
The other idea was approvals. One thing we didn't realize early on was, we didn't really have a way to ease people into continuous deployment. We had it on or off, or a pipeline. We came up with this idea for creating a gradual step to adopting continuous deployment. It was just another gate, it was called the approval gate. Depending on whether or not your deploy job is automatic, or not, it might require a human to approve the job in the DeployView UI. If the pipeline is not fully continuously deployed to production, when the pipeline is created, a human might have to go into DeployView and say, I approve this job and I approve this job. Then the AutoDeployer would go ahead and deploy those. This allowed for incremental adoption. What we did was we had staging be automatically approved, and then the next job, and the next job, and the next job. This was helpful as a forcing function, because we could just go to teams that had basically everything all the way up to production, automatic, and ask them and talk to them, what do you need to make it all the way to production automatically?
Lessons Learned from Cultural Rollout of Continuous Deployment
What did we learn from this process, our cultural rollout of continuous deployment? We were very delicate. We really did not want to disturb our developers in any way. Honestly, a less delicate rollout probably would have been fine and it would have been a little bit faster. We had less resistance than we thought to the idea. Teams, they had an opt-out. They could use the approval mechanism to change their pipeline at any time. It was safe to do this. Removing the ball, that caused a bit of disruption. Some teams were confused by the idea, but for the most part, developers were very excited to hear that we were working on continuous deployment. Then the other thing we learned was that no automated rollbacks has been fine. So far, this system has been out for about two years, and we have not been in a situation where we felt, we should have automated rollbacks, that would have saved us.
What else did we learn? The less positive aspects here. The pipeline concept, it's something we adopted from Jenkins. I know a lot of other deployment tools have a similar model as well. It can be ultimately confusing. Rollbacks can be hard to model, because they're just another job somewhere else in a different pipeline. Tying different pipelines together, especially if they change their shape, that can be a little confusing. Commits are associated with multiple deploy jobs. It ultimately is a little bit messy in terms of how a human being might think about doing their deployment. They just want to see that commit land into production, they don't really care about the job aspect of it. This is something that we want to improve over time.
Results of Continuous Deployment Adoption
Let's talk about the results of adopting continuous deployment. We managed to get to 90% adoption. We felt really good about that. This was essentially our target. We figured not everyone was going to need continuous deployment or want continuous deployment. We never really thought we were going to hit 100%, but 90% was better than we expected. What are some of the results we've seen? For one, we've seen less of these oopsies, basically. Before, it was pretty easy for a bad change, like a panic or a divide by zero error or some exception to just make it into production and take down an entire service if the deployer wasn't careful. You're a human. You can't be perfect. After we've rolled out continuous deployment, we've actually done a pretty comprehensive study of the incidents that we've had. We found that now, the oopsies incidents are a small minority of the type of incidents that we have today. A lot of the incidents now are subtle issues that happened over multiple deployments, like off-by-one errors. Those aren't great either, but at least we were able to reduce one class of error.
The other benefit we found is that everything is deployed the same now. We have this uniform deployment system. In the past, the on-call basically had to deploy their own service, because every service essentially had like a ritual that you had to follow. If you didn't know the exact steps of the ritual, or you didn't follow the runbook, or you didn't know there was a runbook, you might cause a problem during your deploy. Now everything deploys the same. It's all in code. Services, configuration, Envoy and our other sidecars, even infrastructure components, those are all deployed using the same system. Teams can make changes across projects when they need to and feel comfortable that the automated deploy system, AutoDeployer, will just take their change out and they don't have to worry about the on-call.
The other thing that was interesting, and this was a little bit of a surprise is we believe we've reduced our exposure to CVEs a little bit. In the past, the CVE announcements, they resulted in a scramble. The CVE would be announced, you'd have to find which projects are using the dependency that has the CVE. Then you have to go ahead and patch, and update it, and then force a rollout, and all that stuff. It required a lot of people involved to find the vulnerabilities and then patch and update them. What's really nice is that now the patch and update part is automatic. Most services deploy automatically so when there is a CVE, it's easy to update a library and know that that's going to go to production pretty quickly. The other thing that's interesting is that we have regular dependency updates that happen now. We're typically on one of the latest versions or near the latest versions of a given library. When CVEs are announced, what we find is that our exposure to them has slowly been less over time as we've adopted continuous deployment. That's pretty interesting.
Finally, the main goal that we always had was having discrete changes into production. Before we had those deploy trains where we'd have several commits all jumbled together, hit production all at once. That could be confusing. Now we actually have discrete changes that are being deployed at once to production. One thing that was interesting is we observed that even before we had the gate design, and we were just doing deployments into staging and doing deployments into production, for less critical projects. We found that just the fact that your deployment went out automatically and it went out so quickly, would change the way you thought about changes and your deploy behavior. We saw teams would make smaller commits. They would make smaller changes. They would introduce their changes in sequence, instead of just doing them all together and trying to deploy it all at once. When the friction for deploying a change is so low, and takes so little developer time, you could do them more often, and you could do them more safely. Here's a chart, where before for staging, we had around 1.8 commits per deploy, and now we have about 1.2. In production, we had about 3 commits per deploy, and now we have about 1.4. I think that's great. That's really great progress. For the most part, when our critical services are deploying, they're deploying one commit at a time.
What Isn't Using Continuous Deployment?
What isn't using continuous deployment? There's about 10% of services that aren't using it. I looked to see what these are. There are two classes, one of which is like specialized configuration that just isn't deployed often, and doesn't need continuous deployment. Then the other is systems that operate a data pipeline. These are systems that tend to not do well when they're interrupted. If you have a deployment, midway through your processing pipeline, it might cause an issue where you have to reprocess part of the pipeline, you might miss a deadline. For those things, we call these semi-stateful systems. For those systems, they don't use continuous deployment. A human being still does the approval, and they use continuous delivery, not necessarily continuous deployment.
Future Work
For our future work, we really want to get more adoption, we want to approach that 99%. Hopefully, we can get those data systems to deploy safely. We want to be able to detect lingering issues that span multiple deployments, like those off-by-one errors. We want smarter default gates like anomaly detection. Right now we kind of just have, is the alarm firing or is it not firing? That means you have to set the alarm yourself. What we would like is to not have to set the alarm yourself but detect issues like your CPU has spiked compared to your last deploy, automatically. We also want to expand the gate data. Right now it's just a string bow. It'd be nice to be able to render things like a chart or an SVG to the user if a gate fails, or it succeeds, or has to block. Then, eventually, disallow human approval completely. For those projects where we feel really good about our automated system, let's disable human approval completely and have it forcefully be fully automatic.
Summary
Continuous deployment is worth it. If there's anything you take from this talk, that's the number one thing. Automation and safety in the automation is really, truly a force multiplier. We've given so much time back to our developers as a result of this. In addition, speed really is a feature. It is really critical to make sure your system works well. It has to be fast. The other is, designing a gradual on-ramp is important for continuous deployment, but you don't have to overthink it. We overthought it. We were very careful. It turns out, we probably could have saved a lot of time by just rolling this out a little bit sooner. Continuous deployment, do it for the sanity of your engineering teams. Don't stress out all of your on-calls by to having to do scary deployments, let them do continuous deployment.
Questions and Answers
Tucker: What was the biggest hurdle to folks letting go of their manual approval steps? How did you help them work through that?
Wanielista: It was interesting. There are different types of people in this world. There were some people who were very excited to come on board. We were lucky in the sense that we had a team that was running a critical service that wanted to adopt continuous deployment or delivery, immediately, once they heard we were doing it. We worked very closely with them to make sure all the gates were ensuring that all their checks were going to be done the way they want them to be. We basically tuned the system work really well for them. Once we had them on, they were such a critical service or a critical component of the stack that the fear, uncertainty, and doubt that any other team might have, went away, essentially, because if they can do it, you could do it too. That helped a lot with that.
Tucker: What about your security practices or vulnerability analysis stage, do you drive that through your delivery or deployment workflows? Do you have gates in there to catch those things?
Wanielista: We don't have gates right now. It's actually a good idea. We might want to do something like that, where if we detect that you have some known CVE, maybe we block your deployment. We haven't needed to put that in, because the typical flow right now is if there's a known CVE, or if we find something, we have this other system called Cartography, an open source system from Lyft, that actually tracks all the dependencies. Our security team can very quickly see which projects are affected by the CVE, issue a library update. Typically, they're not really a difficult thing to do. We'll update the library and that deployment goes out immediately. They can get that done in less than 24 hours, typically.
Tucker: Are the gate checks centralized or customized per application/service?
Wanielista: They're centralized right now. There are definitely lines of code in AutoDeployer that just say like, if service equals this name, do this. We have tons of those. It's not great, but at least you go into one file and you could see everything. We do want to break that out to have people pick and choose what gate checks they want. For the most part, people are happy with the defaults that we have. There's not a lot of those if checks, but we will make that cleaner.
Tucker: What checks do you have on pull requests before something gets merged to prod?
Wanielista: We call those CI tests. We have the typical ones, integration tests, unit tests. Those are not too complicated, because once that code passes those CI tests, it gets merged, gets put into the pipeline. Then there will actually be a gate check to confirm that you passed your own CI test. You can't just deploy a commit even before you've passed the CI test. They're very light. A lot of the work is done in the deployment pipeline.
Tucker: Are you deploying every Git commit for every PR merge? I think the concern is, if you're doing commits, and you want to do incremental changes, but you don't necessarily want those all to go out to prod, how do you manage that?
Wanielista: Basically, people just hold off, they don't merge. It's as simple as that. It's pretty simple, but you change your development model. Where instead of just putting in changes into the default branch, or main, and then just having some other system put them out, we've deferred that to the PR stage. That's a very conscious decision. We'd rather have people have discussions where if someone has a change that depends on someone else's change. We'd rather humans just talk to each other and synchronize when they merge the PR and send that out, versus just blindly merging things and hoping it all works out. The simple answer is, don't merge until it's ready. With the don't merge until it's ready philosophy, teams have to change their development model a little bit, because you can't work like you did before. You actually are forced to break down a larger change into multiple pull requests, and think about how you're going to roll that out, which is a blessing and a curse. It's a curse, because you have to take the time to think about it. It's a blessing because your stability is much better, because you actually are forced to think about how this is going to work and how it all plays together.
See more presentations with transcripts