In this episode, Thomas Betts talks with Micha “Mies” Hernandez van Leuffen about observability and incidents, and the roles of developers, SREs and other team members. One challenge is knowing what metrics to track in the first place. A developer-first approach to observability means focusing on metrics that are specific to your application.
Key Takeaways
- A common first step for system observability is creating a dashboard. However, because they are configured in advance to track known metrics such as CPU and memory they are not as useful for investigating active issues.
- Dashboards don’t tell you why the system is down or how urgent an active alert is.
- For investigation needs, observability is more of a data science problem.
- Autometrics is an open-source framework for observability that allows developers to easily add common, function-level metrics.
- Taking the time to understand what to track in the first place is the best advice to improve your incident support process.
Subscribe on:
Transcript
Intro [00:05]
Thomas Betts: Hello, and thank you for joining us on another episode of the InfoQ podcast. Today I'm chatting with Mies Hernandez van Leuffen, the founder and CEO of Fiberplan, which makes collaborative notebooks for SREs and instant management. He previously founded Worker, a container native continuous delivery platform, which was acquired by Oracle in 2017. He also runs a pre-seed fund called NP-Hard Ventures.
We'll be discussing observability, incident management and post-mortems, the roles of developers, SREs, and other team members in that process and maybe a little bit about his journey through that space. Mies, welcome to the InfoQ podcast.
Micha Hernandez van Leuffen: Thank you for having me. Great to be here.
Incident detection [00:36]
Thomas Betts: I want to frame our discussion around the timeline of an incident happening. I don't want to go to the ideal solution where we've got a great process in place, but I want to go with the normal boots on the ground reality of something happens and we need to respond to it. We were talking a little bit before we hit record, sometimes it even happens with wondering, how do I know that there's an incident? How do I know there's a problem going on in my system?
Micha Hernandez van Leuffen: I think that's a great point. Well, I think the worst cases where the customer tells you that something is wrong with the system. I will say the step before that is are you measuring the right things? Are you capable of tracking that something is up and you need to respond accordingly? And what we've seen is that getting to observability in the first place is a challenge for people.
Thomas Betts: How do people improve their process of getting to the observability? How do you even know, where do you start looking? What do most people have? Is it just I have logs and I wait for the server to fall over and then some alarm goes off. A really, really big level of, "Oh, we have a big problem." Or like you said, the worst case is a customer calls up and says, "This isn't working," and you had no idea it was happening.
Micha Hernandez van Leuffen: I think indeed, to your point, logs are a often used observability piece of data that people tend to use. I think it comes from the fact that it's kind of the first thing that you do when you program. When you code your first application, you actually to get the debugging information to see what's going on you write these print line statements and that's kind of ingrained I think in our being and in our DNA to sort of use that as a debugging primitive.
So indeed there's a lot of people that look for 500 errors in their logs and have some kind of notification from that, be it in Slack or some other system, and then they respond accordingly. I think in our opinion, we actually think that metrics are underutilized because they're cheap. It's usually a number.
Instead of looking for the 500 in your log, you actually get the metric, the number 500, right? There's an error in your system. That should trigger some kind of event, an alarm that goes off and then the incident, investigation process kicks off.
The need for actionable intelligence [02:42]
Thomas Betts: Once that alarm goes off and for whatever triggered it, what starts the incident? In your examples and your experience, what have you seen as people then jump on board? Is it one developer starts working on it, a call gets set up because it's a wider impact? What's the current state of how people deal with those things?
Micha Hernandez van Leuffen: I think there's sort of varying ways of doing this, right? So there's now a lot of Slack tooling on top of incident investigation where a workflow gets kicked off with "Hey, we're seeing this behavior. Who's the commander? What's the status of the incident? What's the impact? How urgent is it?" So you see a lot of workflow tooling around that.
There's people that kick off a Zoom call or a Google Meet or create a virtual war room around that. But that's only the kicking off of the investigation, right? We need to do something. And then the next question, how do you support that actual investigation? And I think that's where it gets interesting because that comes back to can you get actionable intelligence that will help you in this journey to solve the root cause of the problem?
Modern incident handling and virtual war rooms [03:41]
Thomas Betts: And you mentioned a virtual war room. Pre pandemic, people would often go into the conference room and say, "Okay, everybody jump over here." There might be a phone call set up, a phone bridge set up, but it was just a phone call that people were on and maybe they're at their desks. Has this changed at all because everyone's pretty much remote these days and is that better to the process?
Micha Hernandez van Leuffen: It's an interesting question. I think back in the day where you would have these dashboards and they would be placed on the TV in the office where indeed, pre pandemic, everybody was working from and are people looking at the dashboard? "Yes, no, I don't know." And then something kicks off, something happens.
And maybe at some point the TV went away and it's sort of me scooting over in my chair and looking at your monitor. You see some kind of behavior and we try to solve that thing in tandem. But indeed nowadays with everybody being remote, what's the tool set that you can utilize to support that sort of remote investigation? And obviously with Fiberplane, a bit bias, but we think a collaborative notebook form factor is a good tool to support that investigation.
Imagine Google Docs environment that is actually capable of pulling in your observability data. I think the other thing with dashboards, because they kind of stem from this cockpit view of your infrastructure, dashboards do tend to be sort of set up in advance assuming that you know what you need to track. It's kind of like this static entity where you think CPU utilization is important, memory consumption and you're tracking all these things.
Then reality hits and it turns out that you need to be tracking something else. That kind of set us on this path for a more explorative way, a less static form factor. So dashboards still have their place for the top lines that you want to measure and visualize, but when it comes to an investigation, it's more of a data science problem where you kind of want to pull in the different types of observability data, be it metrics and logs and traces and kind of come to that root cause.
Incident analysis is a data science problem [05:33]
Thomas Betts: I like the idea of it shifts from seeing the problem on a dashboard to just here's the current state or the flow of the system to, you said a data science problem. You had to do investigation. Because if the dashboard just told you, usually they're good at showing you it's healthy, but once the system is not healthy, it doesn't tell you why. That's what we need to understand next is why is it unhealthy?
Micha Hernandez van Leuffen: It doesn't tell you why. It doesn't sort of tell you how urgent this might be. It doesn't tell you necessarily if the thing that you're looking at is actually the underlying problem. It could very well be that users are having trouble maybe resetting their password or something and it turns out that the entire identity service or some other function in the identity service is misbehaving.
I think the thing that's kind of lacking is the intelligence piece. Can you get actual intelligence from all these systems that you've put in place? Can you get some more intel to get to that conclusion quicker?
Thomas Betts: I think that's where we tended to rely on the expert in the room. Like you said, looking over someone's shoulder, two people are looking at the same data and you're thinking different things. Or maybe there's four people swarming on it and one's off in a corner doing deep analysis and then goes away for five minutes and comes back and says, "Hey, I figured something out."
Again, all that data isn't automatically surfaced. Someone has to do deep dives and someone has to be looking at the big picture.
Micha Hernandez van Leuffen: And I do think a bit of a cultural challenge as well, whereby what we've seen often is that there's one or two people in the team that kind of have that institutional knowledge or that sort of tribal knowledge and how do you surface that to the rest of the team? How do you make that more explicit and actually sort of build up a system of record that also when new people join the team for instance, they have access to that knowledge as well.
And I think that goes hand in hand with perhaps a bit of DevOps fatigue where it's always the same people that need to solve the problem.
Creating a system of record for incidents [07:20]
Thomas Betts: I think when you're talking about that expert knowledge, capturing it so that future people can see, I've seen that dashboard evolve over time. I actually remember personally installing the TV on a wall and then programming the computer with here's the dashboards we're going to display. It's based on the data you have at the time.
So you start with, "I think we're going to check for CPU and memory consumption." But then it turns out we have this other behavior of the system that we found out because of an incident and then that gets added to that dashboard. So the dashboard sort of has the archeology trail of things that have happened, but sometimes those are a one-off scenario. You want to make sure it never happens again. But that line of the dashboard never changes.
How do you make sure that what is kept around for future people to understand is what's still relevant?
Micha Hernandez van Leuffen: That's a super good question and I think that comes back to building up some kind of system of record. Either how that dashboard has evolved and the new metrics or the new pieces of data that you're gathering and why those are important. It also deals a lot with coming back to the developers where a new service gets created.
As unit testing and integration testing should be part of any new service that gets developed, I would say that the observability piece, what are the metrics that we're tracking for this new service? How do we make sure that it has availability? What does uptime for that thing look like? What thresholds are important?
Thomas Betts: I think the problem I saw with dashboards is it's fine for, "We have one server or we have a cluster and it was a monolith." Once you get into, "I'm going to add more services," and whether it's full microservices and we have hundreds of them or we just have five or six that are connected. Being able to see this is a problem affecting the user, but it's actually three services down the stack. Seeing those different pictures.
Like you said, raising those alerts and saying, "You must have these standard uptime and other metrics that all of our services in our platform require so that they all look somewhat the same."
Micha Hernandez van Leuffen: To be honest, that's kind of how we started the new company. The worker itself was a distributed system consisting of different microservices. We were doing CICD, so we were actually literally running users arbitrary code inside containers, which I can tell you can lead to interesting results.
So for us, that kind of inspired us like, "Hey, indeed a user might experience that his or her job doesn't get run, but it's actually somewhere else in the system that this problem exists." That kind of set us on this path of there's something to a more explorative investigation and figuring out the root cause of a problem.
But on the dashboard explosion, I think it deals also a lot where people necessarily don't know what to track in the first place as I kind of said. You start off with, "Hey, I guess CPU might be important," and then all of a sudden you come to really say, "Oh, we actually need to track the application level metrics or observability as well." And you kind of layering the application level metrics on top of the machine level metrics and how do these correlate and how do they relate to each other?
Developers need application-specific metrics [10:11]
Thomas Betts: I think that's the good distinction of the infrastructure isn't your application. Monitoring CPU and memory consumption is somewhat easy. We get that for free almost in any of the cloud providers, but you're building your product to satisfy your customer needs and all of that logic is specific and custom to your application.
The cloud providers don't know that. You have to then make that part of your stuff. You have to make those opinionated metrics for your application and sometimes you don't know what those are upfront. I think we can do a better job with defining those requirements and outcomes of the system and behavior for product management general.
Micha Hernandez van Leuffen: What should we track in the first place, right? So this is again a bit biased, but why we started this companion open source project called Autometrics, which does a whole lot of that. So it's an open source framework implemented in different programming languages that captures functional level metrics.
We actually think error rates, latency, request rate. Those are three very important metrics. We're going to track these for you automatically for every function. So the way that works is, say you're working in Python, you would add autometrics decorator and rust would be a macro. So we kind of use meta programming to make that happen. Decorate that function with the autometrics decorator and then automatically you get these metrics for free.
That's a very powerful concept because that means you don't need to think about, "Hey, what should I track in the first place?" You're at the very least doing this for all these functions. It allows you to do some interesting things that we just spoke about in terms of the call graph. It might be that dysfunction exhibits some kind of behavior, but it's the underlying function that's actually misbehaving.
I think that's a great place to start where what are we tracking? What do we think is important? What defines our uptime? And then actually try to layer these SLOs, service level objectives on top. What thresholds are important for us? Is it 95%? Is it 99%? And I think in the industry we talk a lot about those percentages, X number of nines, but how do you get there?
Thomas Betts: It sounds like those functional metrics you're adding on, that's one thing that helps you with that investigation, that data science approach of, "We have a problem. Well, I still don't know where the line of code is or if it's an infrastructure problem, but what can help me find the path to where the bug is, where the issue is?" So any easier way to start giving those breadcrumbs so that incidents are easier to manage seems helpful.
Micha Hernandez van Leuffen: We will actually say with the autometric framework, we can actually tell you which line number the misbehaving function is at because we sit inside your code, right? Which comes back a bit I think the approach around developer first observability.
Putting those DevOps superpowers, I would say sort of back in the hands of developers and making it part of, as we just spoke about, when you're creating a new service, making it part of that journey versus the add-on later, right? "Oh, we have a service. Now, we need to do observability." This sort of afterthought of DevOps activities and observability and SRE stuff.
Various DevOps responsibilities require different metrics [13:04]
Thomas Betts: I think that's where DevOps started as an idea of developers are also responsible for doing operations and anyone who's worked in a small shop understands there's no one else to hand it over the wall to. It's only when you get to a big enough company that you can say, "I don't have to support this operations has to do it." And that's where we got the chasm of whose problem is it?
DevOps as developers and operations work together. And then we somehow evolved into having this DevOps as a role that you can apply for and somebody has that responsibility. And SRE is another variation of that, like a specialization of DevOps. I wonder, have we shifted too far into the developers are less responsible for operations now?
Micha Hernandez van Leuffen: I would agree. I think that we have another term now platform engineering, which gets also the bit of a spinoff from that. And I think in large organizations it is, of course you've got more people, you've got different teams, it's very likely that there is a team responsible for infrastructure.
But having said that, I think one, the developer has written the service so she knows best how that service behaves, how the code was written, what behavior it might have, right? So it does make sense to sort of shift those powers a bit to the developer in terms of observability such that even operations can make sense of it as well. And I think that's the friction, right? How do you get them to work in tandem?
That's a hard problem because maybe operations says, "Thow shall use this instrumentation library to measure what's going on." But it doesn't really make that developer own that experience, have that knowledge be ingrained of observability.
The limitations of observability tools [14:42]
Thomas Betts: I think the challenge I've seen is getting the developers to take the ownership of maybe creating their application dashboard, their metrics that say, "Hey, for my slice of this application, here's what I need to know so that if something goes wrong, I can say I understand the health of it." The platform teams, the SREs, I still see that as more of the interconnectedness, the infrastructure.
Our services are down, our cloud hosting provider has an outage. This is affecting the big things, but once it comes to the application isn't working, my platform engineering team doesn't understand my specific use cases, that's my problem as a developer. So I need tools that can help me do that, but also tools that can help me work with them so we're on the same page and we're not just, "This is what I see and that's what you see." And is that where you're seeing the developer notebooks or the incident notebooks come into play?
Micha Hernandez van Leuffen: But I would also say that the tooling is an interesting facet to this challenge as well, where I think a lot of the tools around observability kind of treat your system as a black box. Install this agent and then we'll do a bunch of stuff and then we'll layer these dashboards on top versus that the developers should know what metrics are we tracking, right?
"I've written this function, I should know how it behaves and how to observe it." So there's also a bit of a discrepancy I'd say around the tooling, how to empower developers more to make observability part of their daily work.
Thomas Betts: How do we improve that? Is it a matter of finding tools that help us or is it a cultural shift to say the developers have to want to make it better and then they will find a solution whether they build it themselves or they go and find a tool and ask them to use it?
Micha Hernandez van Leuffen: I think there's a few aspects to it. I do think there is a bit of a weird situation where, I mean I'm oversimplifying it a little bit, and it depends on the team. But where you start off as a small team, how do you get started with observability? The on-ramp doesn't really exist. Observability and SRE activities are kind of this series B plus activity. All of a sudden we've got a team and that team is responsible for uptime and availability.
I think it deals with a cultural shift around how can we create more ownership for the developers and engineering teams to bring in observability into their daily activities. Then also the tooling around it that is more geared towards the developer. The way we phrase it for AutoMetrics is, "Developer first." A micro framework for developer first observability because it literally uses functional level metrics. It's the developer writing some additional lines of code to make this work.
Good developer-first metrics also measure business objectives [17:08]
Thomas Betts: It's developer first, but it's also very business focused in that sense. Is that because the developer is the person closely aligned to what the product is supposed to do rather than the services how they're supposed to run, that's a business level measurement.
Micha Hernandez van Leuffen: That's a super interesting point where with the framework, you now in code, you can also define your SLOs or your objectives where you say, "Hey, 99% of the time we should respond with a 200 okay and all our requests should be returned under 250 milliseconds." That's a business objective that you've now defined in your code as a developer. I think that this whole topic of SLOs is, it's almost to your point, it's somebody else's problem.
Create a common language to describe what success looks like [17:50]
Thomas Betts: You're agreeing to how I interact with the other people, the other services. We talked a little bit about the roles and responsibilities, and I think you're right about you don't need SREs on day one. It's sort of like you shouldn't do microservices on day one. Start with a monolith, maybe segment it out and make it modular so you can break pieces out.
But you can't do that unless you have a platform that makes it easy to deploy a hundred services and then you need the team to do that. And if you're a small startup, you don't have that team yet. Or even if you're a large company that's making this an established late adopter maybe, but you're trying to do these patterns and practices that the smaller agile companies are doing, where's the place to get started?
Is it, can I get started with just throwing in that autometrics to one service? Is it good on its own or do I need to have it across my platform before it's useful?
Micha Hernandez van Leuffen: No, sorry. I will say I think it's super good for when you're starting out, you add the framework to your application, you're now tracking all of these metrics and it grows along with you. So you're creating another service. Now, you keep implementing the functional level metrics and you're building up this observability stack and the call graph around dysfunction is related to dysfunction.
For larger organizations, it's also useful because what you see in larger organizations are different teams may be programming in different languages, maybe each with a dedicated SRE for that specific team, and what's the common language around observability? Which framework are we using? Which instrumentation library even for logs? How do we define the log levels and the structure of the log? What are we counting for the metrics?
I think this common language is actually an important aspect for larger organizations as well.
Thomas Betts: I think even if you don't have people that are SREs, somebody has to start thinking about those common metrics because you have one team that does one thing and a second team does something else. Well, maybe you can figure it out, but once you have 10 and they're all different metrics, nothing makes sense. You need to have that. "Here's how we as a company," all these products have to standardize on something.
Micha Hernandez van Leuffen: Exactly. And in essence, what you have is multiple services created by multiple teams, multiple people. How do you make sense of that all and indeed create that common language that everybody understands and is on the same page when something does happen?
Thomas Betts: And that's the baseline. Everyone has to agree to that and then you augment it with your service specific details that you care about, but that still has to be around a standard way of doing that. So maybe I preface all of my stuff with a short code that says, "Here's my service versus someone else," and we have some way to determine where's my data versus your data, but each team can then customize it to be their needs as long as they're on that same basic foundation.
Micha Hernandez van Leuffen: Exactly, and I think there's also a lot of value in codifying that. You actually know, "Hey, it's this team and Thomas actually built this subset of this service." Even for alert routing, who do we route the alert to? I think there's a lot of value in coming up with a common language and then codifying that.
Postmortems are a good start [20:38]
Thomas Betts: We started the discussion with something happens and we didn't know it was happening or there's an incident or we found out about it. Let's jump ahead. I know some companies do post-mortems. There was an incident, it's now resolved and now it's a week later or whenever it is, we say, "Okay, we're going to analyze this incident." What are your thoughts on that post-mortem process? How do we improve that and how do we make it flow back to having a better incident management process in the future?
Micha Hernandez van Leuffen: So let me start off with I think companies and teams are doing post-mortems as a very good practice. Similar maybe with feature launching that you're doing the retro, it's good to sort of look back and what really happened, what went well, what didn't go well, how can we make it better the next time?
I will say though, I think with writing post-mortems for an incident, and again, I'm a bit biased, ideally you're writing the post-mortem as you are solving the incident. That's what we're aiming for with Fiberplane, this notebook where you're doing this data science approach towards the incident investigation, pulling all these different data and that investigation is the post-mortem, right?
What behavior did we see? What did we do to surface the right information? And eventually what did we do to solve it? And that version of the post-mortem might be a rough cut where there might be passive investigations that didn't lead anywhere. It's also fresh in the mind. Ideally you write the thing as it's happening and maybe a week later you give it a bit more context and polish.
But I think it's hard to two weeks later to write a post-mortem like, "Okay, what did we really see two weeks ago? What did Thomas do again to make the problem go away?
Don’t rely only on Splunk as the system of record [22:08]
Thomas Betts: I have a very small memory when it comes to the details, and you might leave out the one thing that was really important that you didn't realize was really important at the time, but that was actually the key. Getting all that into one standard place and whether it's the product you guys have or like you said, having Google Doc, a lot of people just go to the Slack thread. It's like, "Oh, we're going to set up a new channel and this is for this incident."
But sometimes that also is just one view of what's happening. Okay, here's everyone who was talking in the Slack channel, but in my experience, that might be the shared space and I've got three or four side conversations going on that aren't part of that. Or I'm looking in Honeycomb or Splunk or wherever for the data and that's not getting captured unless I manually copy it over.
Micha Hernandez van Leuffen: I think that there's a few things that you touch upon, right? One, I think sort of the Slack, obviously you're chatting to maybe to your colleagues to solve this thing, but it is ephemeral, right? It will go away. You have to scroll back in order to find what you were talking about. So the way ideally you capture that information. And then second, the design space is rather limited, right? There's only so much you can do inside of Slack, plus you're taking screenshots of charts, you're throwing them into Slack. So there's a lot of Slack glue around that as well to solve the incident.
Thomas Betts: And then what about the other data? Like you said, you can't design it in Slack. I know you can post links to it, but what do you find is useful for people to think about capturing? Are you just focused on the incident at hand or do you have the culture mindset to say, "I want to make sure this doesn't happen in the future, and what will be useful if I come back and look at this in two weeks?"
Micha Hernandez van Leuffen: I guess, a good question. Again, I think it's useful to codify that knowledge and being able to sort of go back and also for new team members joining where they might run into this issue and what has the rest of the team done in the past to solve it? So I don't think Slack is a great space to point people towards in that regard. So it could be Google Docs or Wikis or Notebooks. I think that makes a lot of sense.
During an incident, think about future actionable intelligence needs [24:00]
Thomas Betts: But you're thinking there's something that should be captured as when we have an incident, we start following this process so that we build that up and so it's not a new fire drill every single time. That we get better over time.
Micha Hernandez van Leuffen: So one, I think that deals with actually being able to query real data and being able to get the relevant log lines and get the relevant metrics. I think even more ideal is that that gets created for you in advance. It's not a blank page where you arrive and you start doing queries and building up that report. Actually, the report gets generated for you.
I think there's already a lot that we can do in that space where again, for instance, the help of autometrics, we know which function is misbehaving, we know what other functions get called by that function. We can build up some kind of report for you and give you actual intelligence around, "Hey, here's some spikes for these and these functions. Is this a great place to start your investigation?"
Thomas Betts: I think capturing what's going on and that eye for the future. Then you do that a couple times and the same people tend to show up in these incidents. You might have the SREs who always jump on it, but the team that's involved might change or somebody shows up one time to the next. Is there anything that if you don't have a specific answer to why this bug is happening, someone else can just be that mentor to say, "Here's where I went and looked." Or seen that written down somewhere and saying, "Here's how you go and look for that."
Micha Hernandez van Leuffen: Yes, and that revolves around capturing these incidents and capturing the data that went alongside that incident. And then of course the solution as well and being able to attack that data and the labels for the service that went down. Being able to go back and look at your knowledge base, a system of record of how you solved that in the past.
Take time to understand what metrics need to be tracked to help solve an incident [25:38]
Thomas Betts: I think we're almost out of time. Is there one last little key bit of advice of someone saying, "Hey, we have our incidents, they happen, we want to get better." What would be your advice for where to focus on the first thing today to try and make your process better?
Micha Hernandez van Leuffen: I would 100% focus on taking the time to, what do we want to track in the first place? Are we measuring the right things to even solve an incident? I think sort of adding a lot of workflow, virtual war rooms and all of that is nice, but it begins with, "Hey, can you supercharge that experience once it does happen with actionable intelligence?" Having the right metrics, having the right log levels, investing that upfront, and then the rest will follow.
Thomas Betts: That's great. If you help yourself understand what the problem is, it's always going to lead to much better outcomes. Well, Mies, I want to thank you for your time today, join us on the InfoQ podcast.
Micha Hernandez van Leuffen: It's been a pleasure. Thank you for having me.
Thomas Betts: And listeners, we hope you join us again for a future episode. Have a good day.