Jan. 1, 2026

#560 Why DevOps Alone Is No Longer Enough: Michael Ferranti on FeatureOps and Reliability

Show Notes
Transcript

In this episode of The CTO Show with Mehmet, Mehmet sits down with Michael Ferranti, a seasoned tech executive and product leader at Unleash, to explore why DevOps alone can no longer meet the reliability, speed, and risk demands of modern software systems.

From real-world outages at Google and Cloudflare to the rise of AI-driven delivery, this conversation introduces FeatureOps as the missing control plane that allows teams to move faster without breaking production.

⸻

👤 About the Guest

Michael Ferranti is a tech executive with over a decade of experience across DevOps tooling, infrastructure software, open source, and enterprise platforms. He has played key roles in scaling developer-focused technologies and advises organizations on balancing innovation, reliability, and governance at scale. Today, he focuses on FeatureOps as a foundational capability for modern engineering teams.

⸻

🧠 Key Takeaways

• DevOps optimizes deployment, but FeatureOps governs runtime behavior

• Many large-scale outages are caused by “big bang” releases without kill switches

• Feature flags are not just for UI experiments, they are safety mechanisms

• FeatureOps enables faster shipping and lower risk at the same time

• AI-driven engineering increases the need for runtime control, not less

⸻

🎯 What You’ll Learn

• Why DevOps alone breaks down at scale

• How FeatureOps differs from traditional feature flagging

• Lessons from Google and Cloudflare outages

• When open source helps and when it complicates GTM

• How AI changes release management and reliability decisions

• Why human-in-the-loop control still matters in autonomous systems

⸻

⏱️ Episode Highlights & Timestamps

• 00:02 – Michael’s journey from early cloud evangelism to FeatureOps

• 04:00 – Scaling Portworx and why technology alone is not enough

• 07:30 – Open source as a GTM strategy, myths and realities

• 15:00 – Kubernetes, scale assumptions, and overengineering traps

• 21:30 – What FeatureOps actually is and why it matters

• 24:30 – Google outage case study and the cost of big bang releases

• 27:30 – Cloudflare, kill switches, and runtime control

• 31:00 – FeatureOps vs DevOps explained clearly

• 35:00 – AI in release decisions and risk management

• 43:00 – Human-in-the-loop engineering and future architectures

⸻

🔗 Resources Mentioned

• Unleash Feature Management Platform: https://www.getunleash.io/

• Google SRE Handbook

• DORA Reports on High-Performing Engineering Teams

• ThoughtWorks Feature Management Practices

⸻

🔗 Connect with the Guest

• Michael Ferranti on LinkedIn: https://www.linkedin.com/in/ferrantim/

[00:00:00]

Mehmet: Hello, and welcome back to a new episode of the CTO Show with Mamet today. I'm very pleased joining me, Michael Ferranti from Unleash. Michael, the way I love to do it is I keep it to my guests, introduce themselves. So a little bit more about you, [00:01:00] you know, your journey, what you're currently up to. Just I like to give teaser to the audience.

We're gonna talk a lot about DevOps and Feature ops, ai and, you know, a lot of like, uh, topics and about, you know, also Michael's. Maybe you can give us some of your experience working on new technologies and scale it so I don't want to steal much from you. Time the floor is yours.

Michael: Yeah. Thank you. Um, uh, really, uh, excited to be here.

I'm, I'm eager for a conversation. This is, um, you know, ev everybody's busy these days and, and, um, yeah, I look at this as like a little bit of a holiday, um, from my calendar just to, to talk about tech and to talk about big ideas. Um, and so eager to dive into all of those topics. I guess by way of background, um, I am, um.

Kind of a, I would say a tech executive, um, that, that really specializes in kind of the, the product and marketing side of, um, uh, of infrastructure software, uh, [00:02:00] DevOps, tooling, um, and, and related technologies. Uh, I've been doing that for, um, well over a decade now. Um, in fact, in just in my LinkedIn thread or my LinkedIn feed recently, um, I saw from a, a friend of mine.

May, you know, over a decade ago, um, we did a lot of work together in evangelizing the cloud back when the cloud needed evangelizing. Uh, maybe it does again as well, uh, 'cause there's kind of some, you know, some, some challenges with the cloud as well. But that goes all the way back to my days at Rackspace, um, when, believe it or not.

Rackspace was the number two public cloud provider, um, after Amazon. Um, it seems like, wait, like that can't possibly be true, like that actually was true. Um, this was before Azure, uh, this was before GCP. Um, you know, Amazon only I think, launched, um, uh, S3, if I'm not mistaken, in 2013. So it seems like.

Amazon's always been there, but it hasn't always been [00:03:00] there. Uh, and I was part of the cloud team on Ed Rackspace in those early days. And so kind of, you know, fast forward, I've worked in dev tools companies, I've done open source, I've done enterprise software. Um, and so, um. Really here to kind of bring that perspective to the CTOs and, and other technical, uh, people that are listening to the show.

Mehmet: Great. And thank you again, Michael, for being here with me today. You know something, um, regarding little bit your, your, uh, journey if you want. So you helped, uh, scale Port works. You know, from 500 k to 370 million acquisition. Now when, when you look back, right. Uh, and because, you know, we're talking about evangelizing the technology, right?

And then, you know, talking to, to, uh, adopters, I would say. What do you think, you know, mattered more the technology itself or how developers were enabled to trust and adopt the technology?

Michael: Um, that is a really, really good question. Um, and, [00:04:00] um, I feel lucky to be, to give my perspective on it because this is a conversation that I actually had multiple times with the Port Works founders.

Um, it's definitely both like it, I would be lying if I said that one was more important than the other. However, the, um, the, the, the beaches, um, up and down Northern California are lined. With very expensive homes, um, from, from technology companies that were not the best in the market. Um, like you, you can be very, very successful with the number two or number three or number four product in any market.

That is categorically true and, um, in some ways when a product is too good. Um. I would say that founders can, can mistakenly believe that if you build it, people will come. Or that people will [00:05:00] automatically understand how this superior, why this technology is superior to another technology. Um, when in reality buying decisions are much more complex than that.

Um, there, there's, relationships are certainly part of it. Um, kind of perceived risk is part of it. Fashion is part of it. That's very, fashion is very, very important, um, in the DevOps community. That's not a criticism. Um, that's simply a, uh, a statement of what I believe to be reality that a lot of times, technologies that seem to be the right fit and become very popular, that's what I'm describing as fashion, um, may not in fact be the most practical, right.

Great looking pair of shoes might look great, and yet, like your feet hurt at the end of the day, uh, that's Kubernetes. Um, which Kubernetes actually fantastic for many use cases, but it's not perfect for every use case. So that's where fashion comes in. Um, and when I was at Port Works, um, I would, I joined [00:06:00] after, um, an unsuccessful run at another startup.

For everything that we lacked in product, we made up for invisibility, we were perceived as the number one storage solution in on the market for first docker containers and then Kubernetes. Mm-hmm. Mm-hmm. There was kind of nothing underneath from a product perspective, but we excelled at kind of the pr, um, public relations, analyst relations game community.

We had an open source, um, uh, solution. We brought in partners. Were like orders of magnitude bigger than us in terms of Dell EMC in terms of VMware, in terms of Huawei and Pure Storage and others. Um, it order to build on this open source base that we had. And meanwhile, this, this company Port Works, who unbeknownst to me at the time had a fantastic product that actually solved real enterprise pain points that was extremely well engineered, that had depth of, um, defensible [00:07:00] value from an IP perspective.

We're perceived as number two, and it's like, well, how can these things be? Well, the reason is because of all of these things that are not technology also really matter. And I think that's something that's super important for founders to understand, which is just like you want to invest in having the best technologies, I would, I would always recommend that if you could do it, hire the best engineers, right?

Like these, these are, these are all trade offs. And I'm speaking in general. General, right? Generalities. But yeah, you wanna, you wanna hire the best engineers, but you also wanna hire, you know, the best product people and the best marketing people and the best, you know, communicators. Um, 'cause it all is part of it.

And, and when I joined Port Works as part of that team, we were able to build on this solid foundation of technology, become the market leader, and then have a very successful exit out as a result of that body of work. It wasn't just one thing.

Mehmet: Great. Now, you, you mentioned about like it's fashion and, um, you know, the [00:08:00] adoption.

Uh, in general, do you see Michael, like some founders, they do the mistake of thinking that, uh, open source is kind of a go-to market shortcut, like. Uh, we, we just like put it open source, especially in the DevOps space, right? We, we, we, we see that a lot. Um, so from your experience, like, like is it like building the open core communities that, you know, do most teams get wrong, uh, about open source as a go-to market strategy, or is it something else?

Because, you know, sometimes we've seen like companies who did fantastically well, but they weren't able to. Uh, to alize it in, in other sense. And because at the end of the day, you know, and people sometimes, you know, they, I think they forget. I don't know. Like you, you need, you need to. And this is what you exactly mentioned about having the best product team.

You have the best marketing team, the best engineers, of course. So, so this is, this is money. Like you, you, you need to get some revenue over [00:09:00] there. Uh, and I've seen like sometimes people where they say, okay, like, let's make it free. Let's make it open source. Uh, and they think like, that will solve all their problems.

What's your take on that?

Michael: Uh, it is not a silver bullet for sure. Um, I think, um, I, um, I would say I've worked for more open source companies than closed source companies. Um, but I've also worked for closed source companies and I've seen both work well and I've seen both not work well. Um, so I think it takes, like, it, there's some strategic thinking that should go into whether or not you open source.

Or not. And one kind of law of open source is that it's much easier to open source something than it is to then close source it. And so it's not like any of these decisions are, um, permanent, however, the pain that you need to go through to undo certain decisions is greater than others. And so the way, the way I like [00:10:00] to encourage founders to think about it when you're first making that decision, um, is.

What is the goal of open source? Which it sounds like, like, well, duh. Well, not, not really. It's, it's not a duh. Um, there are some companies for which open source actually is a key part of the development product project. Um, this is how we're actually gonna build, um, ip. Open Source is still ip, it's just, you know, licensed differently.

Uh, I would say, you know, Linux is the, you know, the, the biggest example of this, um, as a pure. Open source, um, uh, uh, project. Um, and, you know, then had commercial backers like Red Hat being, being, uh, first among a, first among many, um, uh, software development happens through the open source, um, often unpaid volunteerism.

Uh, Docker is another great example, uh, of a organization that really built the core engine through an open source means, um. [00:11:00] And if you're gonna do that, you should have people in your organization that enjoy that type of work and you should want to get feedback, um, and contributions. And hear no from your community.

Otherwise, it's gonna be very, very painful. If you, for instance, have a very strong, um, CTO that's very strong working with customers and you have a very commercial driven model, even for open source, sometimes it can be quite painful to hear from your community, Hey, we wanna prioritize X, Y, Z. When you're hearing from your biggest customers, we want to prioritize A, B, C.

Like if, if you're, if you are open source. That can be hard, um, to, to manage if you're, if you're expecting, um, kind of contributions to your core from open source. Um, on the other hand, um, if open source is really about how do we, how do we make it possible for people to try this software in ways that would be challenging otherwise, then [00:12:00] my first question will be, okay, if it's about trying the software, what other ways besides open source, are there also.

Free trial, for instance, uh, right. A, a freemium model, um, a free forever version. Um, and if you have a, a solution that lends itself to a SaaS model, meaning it's kind of obvious that people would want you to run this for them and that, and that long term is your big business model, then I would encourage you to think really twice about whether or not open source is the best model for you.

Um, because again, there are many ways to make, make it easier for people to try software. Um, open source doesn't have to be one of 'em. Um, my, my current company Unleash, the reason we open sourced, um, is because we're actually quite unique in the feature flag market in that you can run us on premises and there there are certain barriers to running software on [00:13:00] premise.

That's not open source, that don't exist in a SaaS model. Um, our biggest competitor, LaunchDarkly is SaaS only. And there are many large organizations. We, we, we kind of are our ICP or our, where we specialize is very large organizations, financial services, um, as an example that love the idea of using LaunchDarkly and like I work at Unleash.

I'll just say that like it, like it's a good product. Mm-hmm. However, it's a non-starter for them to use that sas. They want to be able to run it themselves and open source makes that very, very easy. But it also creates challenges, which is how do you make sure that you are slicing up your open source in a way that when a customer gets value, if they want to continue to get incremental value, then they, they, they need to help fund.

Company so that you can continue to exist in, in, in, in the first place. Um, and so it's really a conversation. There's no perfect answer on any of these things. And in fact, you know, licensing models evolve over time. Um, sometimes I would say for the [00:14:00] better, sometimes. For the worst, potentially, but it depends.

Again, who on your perspective is Hashi Corp? As as, um, a Post IBM acquisition has mm-hmm. Experimented, let's say with a lot of their licensing model, and that's created friction within the community. Like, that's normal. That's a, that's a part of growth. I think. Um, you know, we're certainly, um, uh, experimenting with, with our own licensing model.

I think that's just part of it. Um, but it's good to to know why you're doing it first.

Mehmet: This is, you know. Probably the, the best, uh, explanation I've heard since long time about, you know, the, the best approach to do it, which is like, you know, you do your, your, your, I would say strategy and, and you choose what works with you and there's no one size fits all.

Uh, and depends on, on, on exactly your goals and your, your capacity Also as well as you mentioned Michael, so, so that makes a lot of sense. Now let's do some reality checks, especially in the DevOps, uh, space. And I won't, I, I would touch on, you know, Kubernetes mainly. [00:15:00] Right. So, um, so Kubernetes, you know, is no longer new.

Right. We've been talking about Kubernetes for a long time, and yet I hear still, and I see a lot of, you know, debates. I see a lot of organizations that they want to adopt. They struggle with it, you know, from, from your point of view, do you see that the biggest gap is. Between how the technology is marketed, uh, versus how to actually use it in production.

Michael: Um, that's a, it's a good question. I mean, I think the Kubernetes came out of Google. Um, I. And it was kind of, you know, from Borg, which is their, their early implementation of, um, of containers. And clearly Google operates at a scale where you could imagine that having a fleet of millions of containers takes.

[00:16:00] Its own kind of, um, uh, management platform. And that's, that's what became Kubernetes. And um, uh, like I remember being in the room, um, very early in the days. This was probably 2014. This was before. Kubernetes was open sourced, um, before Docker, uh, responded with Swarm and in the room, Solomon Hays there, you know, the team from Google and kind of talking about what happens when there's more than one container, and how do we manage that and what it, what is the platform to do that?

That's what ended up being Kubernetes. And then, you know, put Docker a little bit on the back foot with how do, how do they manage multiple containers that became swarm and you had Mesosphere and like, like all, all of these things. Mm-hmm. The unifying factor is that there's, there's like more than one of these things that you need to manage.

Um, and it's easy to imagine that most companies have more than one essentially compute resource that they want to manage. Um, but very few have [00:17:00] millions, very few in, in, in the grand scheme of things. So I think what's really, really important. To understand is are we a scale company or do we want to dramatically increase our revenue?

And many companies, and I'll, I'll, I'll, I'll take an example of say a, a retailer, like a traditional bricks and mortar retailer that needs to reinvent themselves for the, um, 21st century, the digital economy, like whatever buzzword you want to use, and like, they need to be able to, you know. You, um, kind of grocery delivery and make sure their entire catalog and all of their sales are like, people can go and see what it is.

They can maybe do click and collect. They can do all of these digital experiences. They want to, they want to do that. Um, and this is a national chain. Like they're doing this in order to dramatically increase revenue, increase ex uh, improve [00:18:00] customer experience. That doesn't mean that they're gonna be at Google scale anytime soon, and it might be more effective.

I'm saying might because this, this is, um, um, it really just depends. But if you're not a scale company, you might not need technology that was designed exclusively for scale companies. And you can still do, you know, microservices, you can still do, you know, DevOps and CICD and continuous del uh, well, CICD, continuous delivery.

That like, that whole could in the bootle. You can get, you can do feature ops, which I'm gonna talk about, um, here in a second, hopefully. But like, you don't necessarily need Kubernetes. And this is not saying like this is, I'm not, you know, dumping on Kubernetes. I'm saying these are technology decisions.

And just like it might be appropriate to not open source, it might be appropriate to not go on [00:19:00] Kubernetes. You might still want to use containerization and you might elect to use, I don't know, like, um, you know, a a, a service on Amazon that abstracts the abstracts, excuse me, or, or Google or Microsoft.

That abstracts away the Kubernetes layer. It's like I just, I have these processes that I wanna be to run, run them for me. Well, um, Amazon has a service that does that where you're not running Kubernetes that might be sufficient, or you might look at, um, you know, a serverless, um, to be able to do more Lambda functions and like that would enable you to solve these problems.

So I think hiring smart engineers who understand that the best engineered solutions. Take into consideration requirements and goals and then come up with an optimal solution is a really smart way, um, to look at it. And, you know, who wants to be part of, even from a career perspective, part [00:20:00] of a three or four year digital transformation program that like, you know, Kubernetes is the, is the defacto platform that we're gonna use.

You get two years into that and you realize that, okay, we don't actually need this. We, we can do something that's much simpler, we can get there a lot faster. Um, so I I, I've seen it work both ways where it's been, thank God we moved to Kubernetes because it really has enabled us to scale. And I've seen it elsewhere where it's like, hmm, this may not have been the best decision for us.

Mehmet: Right. One of the things that, uh, you know, every, every company, right, especially building software want to do, is to move faster, right? And part of this is, you know, they, they need to work on what we call it, the feature management, right? So I, I had the chance to interact with people who, who manage these things, usually in the, in the product team, right?

Mm-hmm. Um. And one of the things, you know, uh, is about like, how do we flag, you know, a [00:21:00] feature, you know, and, uh, make sure that, you know, we, we do what we do. So what I wanted to do, Michael, first to, to explain this to us, like the feature management as a strategic player, and you, we know, like this is started as a developer, convenience.

But at Unleash you position as an infrastructure, right? So it it's not only just a simple tag or something like this. So when did feature management, uh, stopped being a tool also and start becoming kind of a full fledge control plane?

Michael: Yeah. Um, excellent question. Um. I, I think we can ground the conversation in a couple of recent, um, uh, incidents.

Uh, sure. That really crystallize it. I think it's certainly for our customers. Um, but I think increasingly for, for people who yet haven't adopted what we call feature ops, um, which we see as like the evolution of DevOps, [00:22:00] where just very quickly on that distinction, DevOps is really about getting code into production in our view.

It is CICD. Um, it is, it's certainly operating that the, the services that, that serve that code, for instance, containers or container dies. How do I restart it? Like those, those are DevOps principles. Um. It doesn't really have as much to say about what, what that code is doing in production. Uh, that's where feature ops, um, uh, takes over.

And the reason this is important, this distinction between this, this pipeline that gets code into production and manages that code as an object when it's in production. Being distinct from what is that code actually enabling from the end user perspective, right? What, what capability or what feature is that code either providing or not providing?

Thinking about these in two distinct layers is really, really important because there are [00:23:00] limits to how effective DevOps can be in maintaining, um, I would say, uh, systems that are available. And reliable at all times. So a couple couple of examples of that. Um, back in June, uh, Google, uh, GCP had a major outage, uh, related to BigQuery, um, brought down Gmail, um, brought down lots of services that were running on GCP.

Um, kind of one of those, you know, half the internet go, goes offline for a few hours, types of incidents. Yeah. Um, and in their postmortem, first of all, so, so, so that happened. Well, this is Google, right? They invented SRE. So they, they certainly have people that know how to respond to these types of incidents.

Um, they've automated everything. Of course, they have CICD in place. Of course, they can click a button and do a global rollout, right? It's not like, you know, they have these SREs that are SSHing into individual servers and [00:24:00] like, okay, restart this one. Now restart. Like, like everything is automated, pretty much as automated as you could possibly get.

And yet this outage took. Four plus hours to, to, to recover from. Right. This is an organization that basically is as far along on that DevOps journey as you could possibly get. And yet a, a recovery for that type of incident for them still takes four hours. Um, and in their postmortem, one of the things that they called out explicitly, they said, if this feature, which by the way was a backend capability.

It was kind of what they, just some, what they call critical binary, um, a backend feature. If this feature had been flag protected, it would've been caught in staging. Why? Well, because then they could have done a gradual rollout. They could have realized that, Ooh, when this flag is on, we see a spike in [00:25:00] errors.

Let's take a look at that. Let's fix it. But instead, they did what's called a big bang release, released it everywhere. Mm-hmm. And then the. Complicated dependency laden kind of, you know, reality that is a modern application meant that even when you fixed it using your DevOps processes, it still took many, many hours for those changes to propagate through the system.

Um. Fast forward to just a couple of weeks ago, and CloudFlare had another one of these, um, kind of internet breaking outages, and this one was just like, I think a week or 10 days before Black Friday. Like, oh my God. Right? Like, you know, um, what would've happened if this exact same, like uneventful change?

It was just, it was, I think a database permission in, in, in this particular instance had happened like on Black Friday or on on Thanksgiving Day. Um, it would've been catastrophic for the businesses that rely on CloudFlare to protect, um, uh, their, [00:26:00] their, um, uh, their, their websites and their applications.

And in Cloudflare's, um, uh, postmortem, they said essentially the same thing as Google, different words, but they said, we're gonna enable more global kill switches. A kill switch is a feature flag by a different name, um, and a feature flag for those who are not familiar. I mean, feature flags are really the technical foundation of feature ops, um, which extends this technology with cultural and engineering practices to optimize for it, um, is basically an on-off switch that happens at runtime.

So configuration, you can, you can apply changes kind of broadly to applications, but it, it requires an application restart. Right. I have some configuration. I make an update to it. I'm gonna push that out. My application needs to restart in order to read that new configuration. Um, a feature flag is a runtime mechanism.

I press on now my application serves [00:27:00] that feature, I press off. Now my application goes to a different branch of that code that doesn't include that feature. And so you can recover from incidents a lot faster because you already have this mechanism. That turns off and turns on capabilities instantly.

And there's a, it's a little bit of a misnomer, um, or unfortunate naming to call these feature flags because a feature, the connotation is. My, my website, my application, things that users interact with increasingly, like the, the real guts of an application are the things that users never see. It is these, um, uh, these various binaries, these various services, these unit API server, like these types of things also need this runtime control.

And when you don't have it in cloud, for instance, in Google's instance, you can end up with these multi-hour. Downtime events that, um, like to be frank, I'm not exaggerating, costs hundreds of [00:28:00] millions of dollars. Yeah. On both from the organization that's responsible for it as well as their customers. And just like, you know, you, I'm gonna take version control as an example.

It's like we, like, we just, everybody, like if you're doing software development right now and you, you are applying for a job. That company doesn't use version control, like you're gonna go and look for another job. Like being part of an organization that doesn't do these fundamental engineering practices means that you are guaranteed to be woken up in the middle of the night to fix someone else's problem, and it's gonna be very, very painful in the process.

Feature flags are, and feature ops more broadly are becoming that level of, um, uh, capability where, um, so first of all, like ThoughtWorks and Martin Fowler, big advocates of, of, of feature ops, um. You've got a Google SRE handbook. Um, talks about feature ops go, [00:29:00] um, Amazon's well architected, um, framework talks about feature ops.

The Dora report mentions feature ops as kind of a ubiquitous practice of high performing teams, and the reason is simply there's just different tools for the jobs. DevOps is great for getting code into production once that code is in production. Controlling it in real time, meaning rolling back, rolling out requires a different mechanism.

And, and that's what feature flags will really, uh, excel at and what feature ops enables for teams.

Mehmet: Great explanation. Thank you, Michael. Even for myself, it was very indicative now how this affect, um, decision making, uh, from engineering perspective and can it help also? You know, in the governance on of, on who takes responsibility, uh, you know, because you called you, you talked about the kill switches is also right.

So if I want to think about it also from a governance layer, uh, without slowing teams, uh, can, can [00:30:00] I put it in this context also as well? Like, basically without having trade offs, still have the safety, uh, when, when, when pushing new updates.

Michael: Yeah. The um. Absolutely. And this is, um, like, this is what I call the, the, the magical one plus one equals three of feature ops, which is one of these rare technologies where you, you, you're mitigating risk for reasons we just described.

Right? It's really. Advantageous in the middle of an outage to be, to just like turn off that outage. Um, but usually when we think about risk mitigation, it's we're gonna introduce some friction for the benefit of mitigating risk. And often that is a sound business decision, right? This is, you know, an approval and is is an example of that, right?

I'm gonna get an approval so that I can move forward in [00:31:00] some process and that approval is gonna take time, but I'm gonna make sure that I'm making the right decision. Um, and, and people introduce approvals all the time. However, they, you, you'd think twice about, Hey, if our goal as an organization is to increase our efficiency, which means moving faster for the same, um, uh, uh, amount of investment, do we really wanna be introducing friction?

Um, feature ops, it actually help you move faster while you're also mitigating risk. And it's like, well, how is that possible? Well, uh, I'll, I'll use Dora as a, um, as a way to explain this. One of, one of the, one of the things that Dora, the Dora the Dora metrics, Dora report from, from Google. Points out is that teams that ship in smaller batch sizes, um, ship faster.

Mm-hmm. Bring new, new features and new capabilities to market faster. [00:32:00] And what do you need to do in, um, to ship in smaller batch sizes? You need to ship more frequently. The more frequently you can ship the, actually the less, um, uh, production issues. You end up having that's like, hmm, that's really, really interesting.

'cause you would normally think that if I'm doing things more frequently, I'm gonna see more errors. Actually, when you do things more frequently in smaller batch sizes, you end up with less errors. This is very, very well researched. This is not Michael saying this is, go go to any number of Dora reports, um, as, as they just, uh, launched the, the 20 25 1 20 24 20 like it is.

This is just a core. Kind of feature of, it's like, you know, it, it, um, it's a core feature of how the best engineering teams in the, in the world work. In order to ship that frequently. Oftentimes you have capabilities that are not ready for primetime. Why are you shipping? Well, we wanna ship so we can get feedback.

That could be from an [00:33:00] internal, um, dev test environment. It could be from a cohort of beta users, it could be from internal employees. Like when you ship something, you can see whether it's fit for purpose or not. And when you do that frequently behind a feature flag, you can control who sees that capability.

And this is the mechanism that enables engineering teams to one ship faster, to experiment more, to see what's working and what's not. And at the same time mitigate risk because let's say I'm experimenting with this new capability, and, and again, when I say capability, I mean a backend change, a front end change.

It really doesn't matter. And we encourage teams to use feature flags for both of these things. Um, in fact, if I would say, if you have to choose, do more on the backend side than the front end side simply 'cause those are typically the things that, you know, it a CSS bug. Is less likely to bring down your entire system than something that's happening at kind of a, you know, um, at, at, at a key con control layer on the backend.

[00:34:00] Um, but if something does go wrong, click of a button or an agent can, um, monitor, um, uh, what we describe as impact metrics, like the error rate for instance, um, of associated with the application that's running on that particular feature. You can say, Hmm, there's a strong correlation between. This feature being turned on, this error rate going up, let's roll that back.

All of these things happen at the same time, and right. It's why teams that adopt this way of working actually have less outages and ship faster, which normally we think about these as in opposition to one another. We can either do one or or the other, but in this case, we can do both.

Mehmet: Right now, I can't skip any episode nowadays without asking about the ai, and I want to ask you about AI in taking these decisions.

So, in the DevOps, and correct me if I'm wrong, Michael, like I, I, I'm not the [00:35:00] guru of this, but I can imagine that. People today are using AI to write configurations to, of course, with automation, to, to get certain tasks done right, and probably similar to coding. I think we are reaching somewhere right now when it comes to feature ops, right?

Do you see AI becoming also a decision engine behind deciding, you know, when to expose rollback or you know, all the things that you just mentioned before, taking these hard decisions. What's the role of AI in all this, in your opinion?

Michael: Yeah, it's, it, it's already starting to happen. Um, and. Couple, couple of things that I'll point out.

One is we're hearing about organizations that are saying, Hey, like about 30% of our engineering work is keeping the lights on. You know, we, there's a patch that needs to be [00:36:00] applied, or, you know, we need to rotate all of our, um, uh, credentials, um, or just various work that isn't really strategic, however, is needed in order for the platform to continue to operate.

Uh, well operate securely, et cetera. Um, we're hearing that AI is gonna be used from an ENT perspective first in, um, well, I shouldn't say first because. Co code reviews, um, documentation, um, uh, refactoring kind of first, um, first pass at, you know, um, uh, you know, filling out a PR or, or writing code that responds to a PRD.

Like those types of things. Like it's already happening when we're talking about like the, the, how do we make a significant impact in our overall engineering effort. We're hearing that, you know, this go, this, keep the lights on work, um, is gonna be, um, is gonna leverage ai. So that's, that's the first thing.

Um, [00:37:00] we're also hearing that a lot of teams are nervous about, um, about this type of work. Um, and so a lot of the tooling that would make this in practice, you know, um, effective, let's say. Signals coming from a variety of systems. Um, those signals are increasingly, uh, made available to the LLM via, you know, an MCP, um, protocol.

Um, a lot of organizations are also not allowing MCP, um, to be installed by individual developers. Um, so for instance, we've heard multiple instances where kind of all teams are using, um, Microsoft copilot, um, in order to do, um, AI assisted development. However, um, they're not able to install, um, an MCP server.

Um, and for instance, unleash has an MCP that can do things like when I'm, when I'm, um, gonna create a pr it can analyze the code repository and tell me, is this change trivial? In which case, like, is just not even worth [00:38:00] putting it in a feature flag, or is it on a critical path where I do want to. Um, use the feature flag.

I mean, if I do, it can add that feature flag automatically to that change as part of the pr. Um, it can create the, the flag within unleash. Um, it can, can create based on your, um, uh, kind of your, your process. It can create a release plan that defines certain stage gates that, um, that a feature would need to go through in order to progress.

Um, and ultimately it can use. What we describe as impact metrics, which is this, these signals that are coming back from the application to say, okay, we've gone from 10% to 20% to 50% to 80%, and like there's, there's no digital screaming that's happening, right? The, the, the error rates have not gone up above a certain, so I'm comfortable, or, ooh, when I went from 20 to 40%, I did heat.

I did see an increase in my P 95 error rate, so I'm gonna roll it back. Ultimately, you teams will be [00:39:00] able to do this. Um, when it happens, uh, is I think it's going to vary dramatically between companies in different industries. The type that, you know, even within, let's say a company that's in financial services, they might have some services that are considered low risk in which they'll be first to adopt this higher, higher risk services will be last to adopt this.

But the, the public service announcement that I, that I need to make right now is that you don't have to use unleash in order to use feature flags to mitigate the risk of all of this stuff happening. But please, as someone who has a bank that they, that they rely on as someone who like seeks medical care, like, like, please, please, please, the, the services that your listeners provide.

Are the things that not only make our lives enjoyable at the moment, but that we rely on, that our children rely on. And I think [00:40:00] there's a certain level of responsibility that comes with that to make sane decisions. And there are many, many ways that don't include using Unleash in order to make feature management and feature ops a core part of your, um, of your engineering process.

And when we add AI into it. When we add a non-deterministic system into our core engineering processes, where then the potential errors themselves are also non-deterministic. I think it requires a certain level of, I guess, humility to say, and this is, this is the thing with the Google outage and the CloudFlare outage.

No one at Google. No one at CloudFlare when they were driving to work that morning or when they were walking down into their home office with a cup of coffee said, today is the day that I'm not gonna be able to buy something online because my company brought down half the internet. Like that's by definition, these are unplanned outages.

[00:41:00] They were not expected,

Mehmet: right?

Michael: The unexpected will continue to happen. And feature ops is simply one more tool in the tool chest that we can do to, uh, to avoid this. So AI is definitely happening. I think feature ops can mitigate a lot of the risk and help us get there faster, which is the promise that all organizations want to be more efficient.

Um, but it's, it's up to engineering teams to say, Hey, I'm gonna prioritize this work. Um, and again, it's like, it's one of those things that you, you actually move faster and have less risk. Like, what could be, what could be better? But I, I also realize it's a change and sometimes the change poured portion of it is hard, even if the outcome is, is kind of objectively better.

Mehmet: And I think, correct me if I'm wrong, Michael, so as we're gonna, as we are calling it now today, we're gonna have like the human in the loop all the time. It's not like full autonomous AI that can take, you know, the whole decision because as you said, it's undetermined. [00:42:00] We can't, we, we can't know, right?

Sometimes, uh, and this is the nature of life, I would call. It's not only in software probably that, you know, things happen. Uh, yeah. Outages can happen and then you need to take these decisions. Maybe AI can probably narrow. To us, maybe root cause analysis. It's good at, at, you know, getting us the, you know, it's like when I, I, I make the similarity at this again 'cause I'm reading a book about it now.

So it's like, you know, when, when radiologists, they rely on AI to predict, like if someone is diagnosis. With certain disease, but actually it's still the human that we're gonna, you know, give that final thing. So keeping the human in the loop, I think it's, uh, it's important also. So if, if you want to, to look forward, Michael, about like, maybe future, um, architectural decisions that, you know, teams should be aware about today to be ready, uh, for this [00:43:00] AI driven delivery era, I would call it.

So, so what that would, would look like.

Michael: Um, what we're having a lot of conversations now, um, with very senior engineering leaders at some of the largest organizations in the world. These are, these are people that are responsible for teams of 10,000 plus developers and clear what they're hearing from their board is go, go, go.

Um, with regard to, um, AI investments, um. They also know that like they can't just, that doesn't mean assume any risk, right? Just, you know, move fast and break things like these, these are businesses with literally billions of dollars in revenue. Customers that rely on their services. So they need to move quickly, but they also need to move, um, safely.

And, but also it feels early to mandate a way of doing it. And one of the challenges that, that we hear from them is, [00:44:00] I know that I, I need to encourage a thousand flowers to bloom within my organization. I don't know where the killer use case from us is gonna come. I have some ideas. My team is working on it, but ultimately I think we're all gonna be surprised by.

Where that next, let's call it killer, internal killer app comes from, and it may, I, which tool is gonna enable it? Which process, which, which people, which team is gonna enable it? And so I want to, I wanna be fairly permissive with the ability to experiment because that's how I'm gonna drive innovation.

However, like I, I'm the one who's responsible for the systemic risk and the existential risk. That can come with these technologies, and this is where one of the architectural kind of control points that we're advocating for is, again, this is, this is like an architecture level discussion. Yes. Unleash. It has a platform that would enable this, but there are other [00:45:00] platforms that would enable it, including things that you write yourself.

It's not about a particular tool, it's about a process, which is to say, okay, for any non-trivial change, what one of the things that we can do, it doesn't matter if it's quad or if it's copilot or if it's, um, you know, uh, just you name it. We're gonna make sure that any non-trivial change is wrapped in a feature flag, and that's gonna be using the system that we've built ourselves, that we're gonna use a commercial office self.

Um, we're gonna use, you know, open source. We're gonna use a, a, a, an enterprise like that. We, we could care less about that part of it. However, each change is gonna be wrapped in, in this, uh, feature flag, which again, enables speed of innovation, enables additional, um, experimentation faster. Enables the ENT workflows where we can roll out, we can roll back all automatically.

However, importantly, it enables that kill switch, um, with it. And as, as a CIO or a CTO of a massive organization, this is one control [00:46:00] point that I can by fiat implement. The same way. It's like, okay, people, I know it's annoying, but we're doing two factor authentication. You like,

Mehmet: yeah,

Michael: you don't have a choice about this.

This is not an individual preference issue like what IDEI like to work with. This is something that could have existential risk at our company. So I am a making a decision by fiat that we're gonna do it this way. Within that now, now you can do whatever. Now, the, the thing that this enables for you is you can install anything on your device because we have this central mechanism to understand if it's actually you who's doing it versus someone else.

So it brings benefits to an individual developer, even if that individual developer might not like that individual decision. We see feature management as one of those key control layers for the CIO or the CTO. Where the individual teams then have a ton of autonomy in terms of the individual tools, how they implement it, um, what they consider material change versus [00:47:00] not.

There's a lot of flexibility in there, but it, it gets you to a point where what we saw with CloudFlare, what we saw with Google, like these, you know, hundreds of millions of dollar outages no longer happen and we think that's imp incredibly important.

Mehmet: Absolutely. I think this is the main takeaway for today, Michael.

And you know, and please, as you said, like I'm begging also people go implement the skill switch, right? Like, uh, uh, Steven than any time before. I would say final, traditional question, Michael, where people can find more information and probably maybe get in touch with you.

Michael: Yeah. So, um, uh, on social media, um, I personally am Ferranti, M-F-E-R-R-A-N-T-I-M.

Um, and then, uh, for my company, um, gi unleash.io, um, we're also available on, on, um, we're open source. So on GitHub, you can, you can, um, go to, um, unleash slash Unleash on GitHub or GI Unleash slash Unleash on GitHub. And, um, you can find our open source project or just Google or Ask Chat GPT, uh, what, [00:48:00] whatever, whatever floats your boat.

Uh, we're not hard to find. We'd love to hear from people. Um, would love it. In fact, if you think anything that I said was like, well, like I have a different, like just hit me up as well because that's, that's what makes my job interesting. It's, you know, just, um, it's, it's when our ideals confront reality of, you know, an engineer that has certain requirements that are outside of kind of what we've conceived, that that is how we improve as well.

Both as individuals as well as a company. So please do, um, reach out.

Mehmet: Sure and I will make people's lives much easy. They don't have to do anything. They just need to go to the show notes. You'll find all the links that Michael just mentioned. Michael, I really appreciate the time. It's very interesting. Not only interesting for the sake of being interesting, it's like important topic that we discussed today and you know, the two examples that you gave about, you know, Google and uh, CloudFare outages.

I think this will resonate with with. Anyone, you know, from CTOs to engineers [00:49:00] all the way, you know, in, in, in that full stack I would call it to think, you know, about like how we can, uh, make sure that we keep the lights on, as you said. Yeah. Uh, so I hope also from my side that uh, whoever listens or watch this, like they will benefit out of it.

And thank you very much for your time, Michael. Yeah. Yeah, and this is how I add my episode. This is for the audience. This is my call to action, simple one. Uh, if you just discovered us, give me a favor, subscribe, share it with your friends, colleagues. And if you're one of the people who keeps coming and are loyal fans to the podcast, I really appreciate all the help.

Uh, probably we are airing this because, you know, I have a very busy schedule. So basically this is the first episode of 2026, so happy new year. Um, and, you know, thanks for all the help you did. The past three years actually. So the, the podcasts, you know, finished its third year. We're entering the fourth year now.

Uh, and 2025 was exceptional because we were able to enter the top 200 [00:50:00] Apple Podcast charts across multiple countries. This is, you know, I honor that couldn't happen without all the supports of people who come back and listen to the podcast. And as I say, always stay tuned for a new episode very soon.

Thank you. Bye.

#560 Why DevOps Alone Is No Longer Enough: Michael Ferranti on FeatureOps and Reliability

Listen On

Featured Episodes

Recent Episodes

Support On

New to The CTO Show with Mehmet Gonullu?