Giant Robots Smashing Into Other Giant Robots: 456: Jeli.io with Laura Maguire

Laura Maguire is a Researcher at Jeli.io, the first dedicated incident analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems.

Victoria talks to Laura about incident management, giving companies a powerful tool to learn from their incidents, and what types of customers are ideal for taking on a platform like Jeli.io.

Jeli.io
Follow Jeli.io on Instagram, Twitter or LinkedIn.
Follow Laura Maguire on Twitter or LinkedIn.
Follow thoughtbot on Twitter or LinkedIn.

Become a Sponsor of Giant Robots!

Transcript:

VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Laura Maguire, Researcher at Jeli, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Laura, thank you for joining me.

LAURA: Thanks for having me, Victoria.

VICTORIA: This might be a very introductory level question but just right off the bat, what is an incident?

LAURA: What we find is a lot of companies define this very differently across the space, but typically, it's where they are seeing an impact, either a customer impact or a degradation of their service. This can be either formally, it kind of impacts their SLOs or their SLAs, or informally it's something that someone on the team notices or someone, you know, one of their users notice as being degraded performance or something not working as intended.

VICTORIA: Gotcha. From my background being in IT operations, I'm familiar with incidents, and it's been a practice in IT for a long time. But what brought you to be a part of building this platform and creating a product around incidents?

LAURA: I am a, let's say, recovering safety professional.

VICTORIA: [chuckles]

LAURA: I started my career in the safety and risk management realm within natural resource industries in the physical world. And so I worked with people who were at the sharp end in high-risk, high-consequence type work. And they were really navigating risk and navigating safety in the real world. And as I was working in this domain, I noticed that there was a delta between what was being said, created safety, and helped risk management and what I was actually seeing with the people that I was working with on the front lines.

And so I started to pull the thread on this, and I thought, is work as done really the same as work as written or work as prescribed? And what I found was a whole field of research, a whole field of practice around thinking about safety and risk management in the world of cognitive work. And so this is how people think about risk, how they manage risk, and how do they interpret change and events in the world around them.

And so as I started to do my master's degree in human factors and system safety and then later my Ph.D. in cognitive systems engineering, I realized that whether you are on the frontlines of a wildland fire or you're on the frontlines of responding to an incident in the software realm, the ways in which people detect, diagnose, and repair the issues that they're facing are quite similar in terms of the cognitive work.

And so when I was starting my Ph.D. work, I was working with Dr. David Woods at the Cognitive Systems Engineering Lab at The Ohio State University. And I came into it, and I was thinking I'm going to work with astronauts, or with fighter pilots, or emergency room doctors, these really exciting domains. And he was like, "We're going to have you work with software engineers."

And at first, I really failed to see the connection there, but as I started to learn more about site reliability engineering, about DevOps, about the continuous deployment, continuous integration world, I realized software engineers are really at the forefront of managing critical digital infrastructure. They're keeping up the systems that run society, both for recreation and pleasure in the sense of Netflix, for example, as well as the critical functions within society like our 911 call routing systems, our financial markets.

And so the ability to study how software engineers detect outages, manage outages, and work together collaboratively across the team was really giving us a way to study this kind of work that could actually feed back into other types of domains like emergency response, like emergency rooms, and even back to the fighter pilots and astronauts.

VICTORIA: Wow, that's so interesting. And so is your research that went into your Ph.D. did that help you help define the product strategy and kind of market fit for what you've been building at Jeli?

LAURA: Yeah, absolutely. So Nora Jones, who is the founder and CEO of Jeli, reached out to me at a conference and told me a little bit about what she was thinking about, about how she wanted to support software engineers using a lot of this literature and a lot of the learnings from these other domains to build this product to help support incident management in software engineering.

So we base a lot of our thinking around how to help support this cognitive work and how to help resilient performance in these very dynamic, these very changing large scale, you know, distributed software systems on this research, as well as the research that we do with our own users and with our own members from learning from incidents in software engineering Slack community that Nora and several other fairly prominent names within the software community started, Lorin Hochstein, John Allspaw Dr. Richard Cook, Jessica DeVita, Ryan Kitchens, and I may be missing someone else but...and myself, oh, Will Galego as well.

Yeah, we based a lot of our understandings, really deep qualitative understandings of what is work like for software engineers when they're, you know, in continuous deployment type environments. And we've translated this into building a product that we think helps but not hinders by getting in the way of engineers while they're under time pressure and there's a lot of uncertainty. And there's often quite a bit of stress involved with responding to incidents.

VICTORIA: Right. And you mentioned resilience engineering. And for those who don't know, David Woods, who you worked on with your Ph.D., wrote "Resilience Engineering: Concepts and Precepts." So maybe you could talk a little bit about resilience engineering and what that really means, not just in technology but in the people who were running the tools, right?

LAURA: Yeah. So resilience engineering is different from how we think about protecting and defending our software systems. And it's different in the sense that we aren't just thinking about how do we prevent incidents from happening again, like, how do we fix things that have happened to us in the past? But how do we better understand the ways in which our systems operate under a wide variety of conditions? So that includes normal operating conditions as well as abnormal or anomalous operating conditions, such as an incident response.

And so resilience engineering was kind of this way of thinking differently about predicting failure, about managing failure, and navigating these kinds of worlds. And one of the fundamental differences about it is it sees people as being the most adaptive component within the system of work. So we can have really good processes and practices around deploying code; we can institute things like cross-checking and peer review of code; we can have really good robust backup and failover systems, but ultimately, it's very likely that in these kinds of complex and adaptive always-changing systems that you're going to encounter problems that you weren't able to anticipate.

And so this is where the resilience part comes in because if you're faced with a novel problem, if you're faced with an issue you've never seen before, or a hidden dependency within your system, or an unanticipated failure mode, you have to adapt. You have to be able to take all of the information that's available to you in the moment. You have to interpret that in real-time. You have to think of who else might have skills, knowledge, expertise, access to information, or access to certain kinds of systems or software components. And you have to bring all of those people together in real-time to be able to manage the problem at hand.

And so this is really quite a different way of thinking about supporting this work than just let's keep the runbooks updated, and let's make sure that we can write prescriptive processes for everything that we're going to encounter. Because this really is the difference that I saw when I was talking about earlier about that work is done versus work is prescribed. The rules don't cover all of the situations. And so you have to think of how do you help people adapt? How do you help people access information in real-time to be able to handle unforeseen failures?

VICTORIA: Right. That makes a lot of sense. It's an interesting evolution of site reliability engineering where you're thinking about the users' experience of your site. It's also thinking about the people who are running your site and what their experience is, and what freedom they have to be able to solve the problems that you wouldn't be able to predict, right?

LAURA: Yeah, it's a really good point, actually, because there is sort of this double layer in the product that we are building. So, as you mentioned earlier, we are an incident analysis platform, and so what does that mean? Well, it means that we pull in data whenever there's been an incident, and we help you to look at it a little bit more deeply than you may if you're just following a template and sort of reconstructing a timeline.

And so we pull in the actual Slack data that, you know, say, an ops channel or an incident channel that's been spun up following a report of a degraded performance or of an outage. And we look very closely at how did people talk to one another? Who did they bring into the incident? What kinds of things did they think were relevant and important at different points in time? And in doing this, it helps us to understand what information was available to people at different points in time.

Because after the incident and after it's been resolved, people often look back and say, "Oh, there's nothing we can learn from that. We figured out what it was." But if we go back and we start looking at how people detected it, how they diagnosed it, who they brought into the event, we can start to unpack these patterns and these ways of understanding how do people work together? What information is useful at different points in time? Which helps us get a deeper understanding of how our systems actually work and how they actually fail.

VICTORIA: Right. And I see there are a few different ways the platform does that: there's a narrative builder, a people view, and also a visual timeline. So, do you find that combining all those things together really gives companies a powerful tool to learn from their incidents?

LAURA: Yeah. So let me talk a little bit about each of those different components. Our MVP of the product we started out with this understanding of the incident analyst and the incident investigator who, you know, was ready to dive in and ready to understand their incident and apply some qualitative analysis techniques to thinking about their incidents. And what we found was there are a number of these people who are really interested in this deep dive within the software industry. But there's a broader subset of folks that they work with who maybe only do these kinds of incident analysis every once in a while, and they're not as interested in going quite as deep.

And so the narrative builder is really this kind of bridge between those two types of users. And what it does is helps construct a timeline which is typically what most companies do to help drive the discussion that they might have in a post-mortem or to drive their kind of findings in their summary report. And it helps them take this closer look at the interactions that happened in that slack transcript and raise questions about what kinds of uncertainties there were, point out who was involved, or interesting aspects of the event at that point in time.

And it helps them to summarize what was happening. What did people think was happening at this point in time to create this story about the incident? And the story element is really important because we all learn from stories. It helps bring to life some of the details about what was hard, who was involved, how did they get brought in, what the sources of technical failure were, and whether those were easy or difficult to understand and to repair once the source of the failure was actually understood. And so that narrative builder helps reconstruct this timeline in a much richer way but also do it very efficiently.

And as you mentioned, the visual timeline is something that we've created to help that lightweight user or that every once in a while user to go a little bit deeper on their analysis. And how we do that is because it lays out the progression of the event in a way that helps you see, oh, this maybe wasn't straightforward. We didn't detect it in the beginning, and then diagnose it, and then repair it at the end. What happened actually was the detection was intermittent. The signals about what was going wrong was intermittent, and so that was going on in parallel with the diagnosis. The diagnosis took a really long time, and that may have been because we can also see the repair was happening concurrently.

And so it starts to show these kinds of characteristics about whether the incident was difficult, whether it was challenging and hard, or whether it was simple and straightforward. This helps lend a bit more depth to metrics like MTTR and TTD by saying, oh, there was a lot more going on in this incident than we initially thought.

The last thing that you mentioned was the people view, and so that really sets our product apart from other products in that we look at the sociotechnical system. So it's not just about the software that broke; it is about who was involved in managing that system, in repairing that system, and in communicating about that system outwardly. And so the people view this kind of pulls in some HR data. It helps us to understand who was involved. How long have they been in their role? Were they on-call? Were they not on-call? And other kinds of irrelevant details that show us what was their engagement or their interaction with this event.

And so when we start to bring in the socio part of the sociotechnical system, we can identify things like what knowledge do we have within the organization? Is that knowledge well-distributed, or is it just isolated in one or two people? And so those people are constantly getting pulled into incidents when they may be not on-call, which can start to show us whether or not these folks are in danger of burning out or whether their knowledge might need to be transferred more broadly throughout the organization.

So this is kind of where the resilience piece comes in because it helps us to distribute knowledge. It helps us to identify who is relevant and useful and how do they partner and collaborate with other people, and their knowledge and skill sets to be able to manage some of the outages that they face?

VICTORIA: That's wonderful because one of my follow-up questions would be, as a CEO, as a founder, what kind of insights or choices do you get to make now that you have this insight to help make your team more resilient? [laughs]

LAURA: So if this is a manager, or a founder, or a CEO that is looking at their data in Jeli, they can start to understand how to resource their teams more appropriately, as I mentioned, how to spread that knowledge around. They can start to see what parts of their system are creating the most problems or what parts of their system do they have maybe less insight into how it works, how it interacts with other parts of the system, and what this actually means for their ability to meet their SLOs or their SLAs. So it gives you a more in-depth understanding of how your business is actually operating on both the technical side of things, as well as on the people side of things.

VICTORIA: That makes a lot of sense. Thank you for that overview of the platform. There's the incident analysis platform, and you also have the bot, the response chatbot. Can you tell me a little bit more about that?

LAURA: Yeah, absolutely. We think that incident management should be conducted wherever your work actually takes place, and so for most of our customers and a lot of folks that we know about in the industry, that's Slack. And so, if you are communicating in real-time with your team in Slack, we think that you should stay there.

And so, we built this incident management bot that is free and will be free for the lifetime of the product. Because we think that this is really the fundamental basis for helping you manage your incidents more efficiently and more effectively. So it's a pretty lightweight bot. It gives kind of some guardrails or some guidance around collaboration by spinning up a new incident channel, helping you to bring the right kinds of responders into that, helping you to communicate to interested stakeholders by broadcasting to channels they might be in.

It kind of nudges you to think about how to communicate about what's happening during different stages of the event progression. And so it's prompting you in a very lightweight way; hey, do you have a status update? Do you have a summary of what the current thinking is? What are the hypotheses about what's going on? Who's conducting what kinds of activities right now? So that if I'm a responder that's coming into the event after 20-30 minutes after it started, I can very quickly come up to speed, understand what's going on, who's doing what, and figure out what's useful for me to do to help step in and not disrupt the incident management that's underway right now.

Our users can choose to use the bot independently of the incident analysis platform. But of course, being able to ingest that incident into Jeli it helps you understand who's been involved in the incident, if they've been involved in similar incidents in the past, and helps them start to see some patterns and some themes that emerge over time when you start to look at incidents across the organization.

VICTORIA: That makes sense. And I love that it's free and that there's something for every type of organization to take advantage of there. And I wonder if at Jeli you have data about what type of customer is it who'd be targeted or really ideal to take on this kind of platform.

LAURA: So most organizations...I was actually recently at SREcon EMEA, and there was a really interesting series of talks; one was SRE for Enterprise, and the next talk was SRE for Startups. And so it was a very thought-provoking discussion around is SRE for everyone, so site reliability engineering? Even smaller teams are starting to have to be responsible for reliability and responsible for running their service.

And so we kind of have built our platform thinking about how do we help not just big enterprises or organizations that may have dedicated teams for this but also small startups to learn from their incidents. So internally, we actually call incidents opportunities as in they are learning opportunities for checking out how does your system actually work? How do your people work together? What things were difficult and challenging about the incident? And how do you talk about those things as a team to help create more resilient performance in future?

So in terms of an ideal customer, it's really folks that are interested in conducting these sort of lightweight but in-depth looks at how their system actually works on both the people side of things and the technical side of things. Those who we found are most successful with our product are interested in not so much figuring out who did the thing and who can they blame for the incident itself but rather how do they learn from what happened?

And would another engineer, or another product owner, another customer service representative, whoever the incident may be sort of focused around, would another person in their shoes have taken the same actions that they took or made the same decisions that they made? Which helps us understand from a systems level how do we repair or how do we adjust the system of work surrounding folks so that they are better supported when they're faced with uncertainty, or with that kind of time pressure, or that ambiguity about what's actually going on?

VICTORIA: And I love that you said that because part of the reason [laughs] I invited you on to the podcast is that a lot of companies I have experience with don't think about incidents until it happens to them, and then it can be a scramble. It can impact their customer base. It can stress their team out. But if you go about creating...the term obviously you all use is psychological safety on your team, and maybe you use some of the free tools from Jeli like the Post-Incident Guide and the Incident Analysis 101 blog to set your team up for success from the beginning, then you can increase your customer loyalty and your team loyalty as well to the company. Is that your experience?

LAURA: Yeah, absolutely. So one thing that I have learned throughout my career, you know, starting way back in forestry and looking at safety and risk in that domain, was as soon as there is an accident or even a serious near miss, right away, everybody gets sweaty palms. Everybody is concerned about, uh-oh, am I going to get blamed for this? Am I going to get fired? Am I going to get publicly shamed for the decisions that I made when I was in this situation?

And what that response, that reaction does is it drives a lot of the communication and a lot of the understanding of the conditions that that person was in. It drives that underground. And it's important to allow people to talk about here's what I was seeing, here's what I was experiencing because, in these kinds of complex systems, information is not readily available to people. The signals are not always coming through loud and clear about what's going on or about what the appropriate actions to take are. Instead, it's messy; it's loud, it's noisy.

There are usually multiple different demands on that person's attention and on their time, and they're often managing trade-offs: do I keep the system down so that I can gather more information about what's actually going on, or do I just try and bring it up as quickly as I can so that there's less impact to users? Those kinds of decisions are having to be made under pressure. So when we create these conditions of psychological safety, when we say you know what? This happened. We want to learn from it. We've already made this investment.

Richard Cook mentioned in the very first SNAFU Catchers Report, which was a report that came out of Ohio State, that incidents are unplanned investments into understanding how your system works. And so you've already had the incident. You've already paid the price of that downtime or of that outage. So you might as well extract some learning from it so that you can help create a safer and more resilient system in the future.

So by helping people to reconstruct what was actually happening in real-time, not what they were retrospectively saying, "Oh, I should have done this," well, you didn't do that. So let's understand why you thought at that moment in time that was the right way to respond because, more than likely, other people in that same position would have made that same choice. And so it helps us to think more broadly about ways that we can support decision-making and sense-making under conditions of stress and uncertainty. And ultimately, that helps your system be more resilient and be more reliable for your customers.

VICTORIA: What a great reframing: unplanned investment. [laughs] And if you don't learn from it, then you're going to lose out on what you've already invested that time in resolving it, right?

LAURA: Absolutely.

MID-ROLL AD:

Are you an entrepreneur or start-up founder looking to gain confidence in the way forward for your idea? At thoughtbot, we know you’re tight on time and investment, which is why we’ve created targeted 1-hour remote workshops to help you develop a concrete plan for your product’s next steps.

Over four interactive sessions, we work with you on research, product design sprint, critical path, and presentation prep so that you and your team are better equipped with the skills and knowledge for success.

Find out how we can help you move the needle at: tbot.io/entrepreneurs.

VICTORIA: Getting more into that psychological safety and how to create that culture where people feel safe telling about what really happened, but how does that relate to...Jeli says that they are a people software. [laughs] Talk to me more about that. Like, what advice do you give founders and CEOs on how to create that psychological safety which makes them be more resilient in these types of incidents?

LAURA: So you mentioned the Howie Guide that we published last year, and this is our guidance around how to do incident analysis, how to help your team start to learn from their incidents, and Howie stands for how we got here. And that's really important, that language because what it says is there's a history that led up to this incident. And most teams, when they've had an outage, they'll kind of look backwards from that outage, maybe an hour, maybe a day, maybe to the last deploy.

But they don't think about how the decisions got made to use that piece of software in the first place. They don't think about how did engineers actually get on-boarded to being on-call. They don't necessarily think about what kinds of skills, and knowledge, and expertise when we're hiring a DevOps engineer, and I'm using air quotes here or an SRE. What kinds of skills and knowledge do they actually have? Those are very broad terms. And what it means to be a DevOps engineer or an SRE is quite underspecified.

And so the knowledge behind the folks that you might hire into the company is going to necessarily be very diverse. It's going to be partial and incomplete in many ways because not everyone can know everything about the system. And so, we need to have multiple diverse perspectives about how the system works, how our customers use that system, what kinds of pressures and constraints exist within our company that allow us some possibilities over others. We need to bring all of those perspectives together to get a more reflective picture of what was actually happening before this incident took place and how we actually got here.

This reframing helps a lot of people disarm that initial defensiveness response or that initial, oh, shoot; I'm going to get in trouble for this kind of response. And it says to them, "Hey, you're a part of this bigger system of work. You are only one piece of this puzzle. And what we want to try and do is understand what was happening within the company, not just what you did, what you said, and what you decided."

So once people realize that you're not just trying to find fault or place blame, but you're really trying to understand their work, and you're trying to understand their work with other teams and other vendors, and trying to understand their work relative to the competing demands that were going on, so those are some of the things that help create psychological safety.

About ten years ago, John Allspaw and the team at Etsy put out The Etsy Debriefing Facilitation Guide, which also poses a number of questions and helps to frame the post-incident learnings in a way that moves it from the individual and looks more collectively at the company as a whole. And so these things are helpful for founders or for CEOs to help bring forward more information about what's really going on, more information about what are the real risks and threats and opportunities within the company, and gives you an opportunity to step back and do what we call microlearning, which is sharing knowledge about how the system works, sharing understandings of what people think is going on, and what people know about the system.

We don't typically talk about those things unless there's a reason to, and incidents kind of give us that reason because they're uncomfortable and they can be painful. They can be very public. They can be very disruptive to what we think about how resilient and reliable we actually are. And so if you can kind of step away from this defensiveness and step away from this need to place blame and instead try and understand the conditions, you will get a lot more learning and a lot more resilience and reliability out of your teams and out of your systems.

VICTORIA: That makes sense to me. And I'd like to draw a connection between that and some other things you mentioned with The 2022 Accelerate State of DevOps Report that highlights that the people who are often responding to those incidents or in that high-stress situation tend to be historically underrepresented or historically excluded groups. And so do you see that having this insight into both who is actually taking on a lot of the work when these incidents happen and creating that psychological safety can make a better environment for diversity, equity, inclusion at a company as well?

LAURA: Well, I think anytime you work to establish trust and transparency, and you focus on recognizing the skills that people do have, the knowledge that they do have, and not over assuming that someone knows something or that they have been involved in the discussions that may have been relevant to an incident, anytime you focus on that trust and transparency you are really signaling to people within your organization that you value their contributions and that you recognize that they've come to work and trying to do a good job. But they have multiple competing demands on their attention and on their time.

And so we're not making assumptions about people being complacent, or people being reckless or being sloppy in their work. So that creates an environment where people feel more willing to speak up and to talk about some of the challenges that they might face, to talk about the ways in which it's not clear to them how certain parts of the system work or how certain teams actually operate.

So you're just opening the channels for communication, which helps to share more knowledge. It helps to share more information about what teams are doing at different points in time. And this helps people to preemptively anticipate how a change that they might be making in their part of the system could be influencing up or downstream teams. And so this helps create more resilience because now you're thinking laterally about your system and about your involvement across teams and across boundary lines.

And an example of this is if a marketing team...this is a story that Nora tells quite a bit; if a marketing team is, say, launching a Super Bowl commercial for their company but they don't actually tell the engineers on-call that that is about to happen, you can create all sorts of breakdowns when all of a sudden you have this surge of traffic to your website because people see the Super Bowl commercial and they want to go to the site. And then you have a single person who's trying to respond to that in real-time.

So, instead, when you do start thinking about that trust and transparency, you're helping teams to help each other and to think more broadly about how their work is actually impacting other parts of the system. So from a diversity and inclusion and underrepresented groups perspective, this is creating the conditions for more people to be involved, more people to feel like their voice is going to be heard, and that their perspective actually matters.

VICTORIA: That sounds really powerful, and I'm glad we were able to touch on that. Shifting gears a little bit, I wanted to talk about two different questions; so one is if you could travel back in time to when Jeli first started, what advice would you give yourself, your past self?

LAURA: I would encourage myself to recognize that our ability to experiment is fundamental to our ability to learn. And learning is what helps us to iterate faster. Learning is what helps us to reflect on the tool that we're building or the feature that we're building and what this actually means to our users. I actually copped that advice to myself from CEO Zoran Perkov of the Long-Term Stock Exchange. They launched a whole new stock market during the pandemic with a fully remote team.

And I had interviewed him for an article that I wrote about resilient leadership. And he said to me, like, "My job as a CEO is 100% about protecting our ability to experiment as a company because if we stop learning, we're not going to be able to iterate. We're not going to be able to adapt to the changes that we see in the market and in our users." So I think I would tell myself to continually experiment.

One of the things that I talk to our customers about a lot because many of them are implementing new incident management programs or they're trying to level up their engineering teams around incident analysis, and I would say, "This doesn't have to be a fully-fleshed out program where you know all of the ways in which this is going to unfold." It's really about trying experiments, conduct some training, start small. Do one incident analysis on a really particularly spicy incident that you may have had or a really challenging incident where a lot of people were surprised by what happened.

Bring together that group and say, "Hey, we're going to try something a little bit different here. We'll use some questions from the Howie Guide. We'll use the format and the structure from the Etsy Debriefing Guide. And we're just going to try and learn what we can about this event. We're not going to try and place blame. We're not going to try and generate corrective actions. We just want to see what we can learn from this." Then ask people that were involved, "How did this go? What did we learn from it? What should we do differently next time?"

And continually iterate on those small, little experiments so that you can grow your product and grow your team's capacity. I think it took us a little bit of time to figure that out within the organization, but once we did, we were just able to collaborate more effectively work more effectively by integrating some of the feedback that we were getting from our users.

And then the last piece of advice that I would give myself is to really invest in cross-discipline coordination and collaboration. Engineers, designers, researchers, CEOs they all have a different view of the product. They all have a different understanding of what the goals and priorities are. And those mental models of the product and of what the right thing to do is are constantly changing. And they all have different language that they use to talk about the product and to talk about their processes for integrating this understanding of the changing conditions and the changing user into the product.

And so I would say invest in establishing common ground across the different disciplines within your team to be able to talk about what people are seeing, to be able to stop and identify when we're making assumptions about what other people know or what other people's orientation towards the problem or towards the product are. And spend a little bit of time saying, "When I say this is important, I'm saying it's important because of XYZ, not just this is important." So spending a little bit of time elaborating on what your mental model is and where you're drawing from can help the teams work more effectively together across those disciplines.

VICTORIA: That's pretty powerful advice. You're iterating and experimenting at Jeli. What's on the horizon that you are...what new experiments are you excited about?

LAURA: One of the things that has been front and center for us since we started is this idea of cross-incident analysis. And so we've kind of built out a number of different features within the product, being able to help tag the incident with the relevant services and technologies that were involved, being able to identify which teams were involved, and also being able to identify different kinds of themes or patterns that emerge from individual incidents.

So all of this data that we can get from mostly just from the ingested incident itself or from the incident that you bring into Jeli but also from the analysis that you do on it this helps us start to be able to see across incidents what's happening not just with the technical side of things. So is it always Travis that is causing a problem? Are there components that work together that kind of have these really hidden and strange interdependencies that are really hard for the team to actually cope with? What kinds of themes are emerging across your suite of opportunities, your suite of incidents that you've ingested?

Some of the things that we're starting to see from those experiments is an ability to look at where are your knowledge islands within your organization? Do you have an engineer who, if they were to leave, would take the majority of your systems knowledge about your database, or about your users, or about some critical aspect of your system that would disappear with all of that tacit knowledge? Or are there engineers that work really effectively together during really difficult incidents?

And so you can start to unpack what are these characteristics of these people, and of these teams, and of these technologies that offer both opportunities or threats to your organization? So basically, what we're doing is we're helping you to see how your system performs under different kinds of conditions, which I think as a safety and risk professional working in a variety of different domains for the last 15 years, I think this is really where the rubber hits the road in helping teams be more reliable, and be more resilient, and more proactive about where investments in maintenance, or training, or headcount are going to have the biggest bang for your buck.

VICTORIA: That makes a lot of sense. In my experience, sometimes those decisions are made more on intuition or on limited data so having a more full picture to rely on probably produces better results. [laughs]

LAURA: Yeah, and I think that we all want to be data-driven, thinking about not only the quantitative data is how many incidents do we have around certain parts of the system, or certain teams, or certain services? But also, the qualitative side of things is what does this actually mean? And what does this mean to our ability to grow and change over time and to scale? The partnership of that quantitative data and qualitative data means we're being data-driven on a whole other level.

VICTORIA: Wonderful. And it seems like we're getting close to the end of our time here. Is there anything else you want to give as a final takeaway to our listeners?

LAURA: Yeah. So I think that we are, you know, as a domain, as a field, software engineering is increasingly becoming responsible for not only critical infrastructure within society, but we have a responsibility to our users and to each other within our companies to help make work better, help make our services more reliable and more resilient over time.

And there's a variety of lessons that we can learn from other domains. As I mentioned before, aviation, healthcare, nuclear power all of those kinds of domains have been thinking about supporting cognitive work and supporting frontline operators. And we can learn from this history and this literature that exists out there.

There is a GitHub repo that Lorin Hochstein has curated with a number of other folks with the industry that points to some of these resources. And as well, we'll be hosting the first Learning From Incidents in Software Engineering Conference in Denver in February, February 15 and 16th. And one feature of this conference that I'm super excited about is affectionately called CasesConf. And it is going to be an opportunity for software engineers from a variety of organizations to tell real stories about incidents that they had, how they handled them, what was challenging, what went surprisingly well, and just what is actually going on within their organizations.

And this is kind of a new thing for the software industry to be talking very publicly about failures and sharing the messy details of our incidents. This won't be a recorded part of the conference. It is going to be conducted under the Chatham House Rule, which is participants who are in the room while these stories are being told can share some of the stories but not any identifying details about the company or the engineers that were involved.

And so this kind of real-world situations helps us to, as I talked about before, with that psychological safety, helps us to say this is the reality of operating complex systems. They're going to fail. We're going to have to learn from them. And the more that we can talk at an industry level about what's going on and about what kinds of things are creating problems or opportunities for each other, the more we're going to be able to lift the bar for the industry as a whole. So you can check out register.learningfromincidents.io for more information about the conference. And we can link Lorin's resilience engineering GitHub repo in the notes as well.

VICTORIA: Wonderful. Well, I was looking for an excuse to come to Denver in February anyways.

LAURA: We would love to have ya.

VICTORIA: Thank you. And thank you so much for taking time to share with us today, Laura.

You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @victori_ousg.

This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore.

Thanks for listening. See you next time.

ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

Support Giant Robots Smashing Into Other Giant Robots

Sponsors

thoughtbot

Episode Link

Embeddable Audio Player

Download URL

Social Network Quick Links