Key Takeaways
- Platform engineering is about systems of systems – starting from the very purpose and the "why" for its existence, to the people that build it, operate it, use it, and the ecosystem that consumes, enhances and leverages it.
- A platform is not an end in itself, it evolves with the evolving environment, organically, while being anchored to accelerating concrete business value.
- A successful platform makes users highly autonomous; being intentional about what it offers, in order to bring velocity and efficiency to your users, making it easy to do what is right and hard what is wrong.
- A successful platform is surprisingly delightful to use; being thoughtful about migrations, onboarding and day zero adoption.
- A successful platform bakes in trust and accountability for safe, sustainable and healthy evolution of itself, and the ecosystem that builds on top of it.
In this article, I will share key lessons I have learned while building and delivering three platforms over the last two decades from VMware and Stripe to Apollo GraphQL, including where we got stuck, how we unblocked ourselves, and what ultimately led to the right outcomes for our users and the business.
Solis
We were tasked to solve our burgeoning cloud costs, and we didn’t know what we were spending on. This is a story about the inception and evolution of our Cost Platform, or Solis.
We started building cloud observability with the aim that 100% of our cloud costs would be owned by our users, and 100% of our users would see every single penny they were spending on. Our thesis was that building the platform and providing this observability would automatically help make Stripe efficient and bend the curve of spend. But our platform didn't magically drive impact. The mistake we made was assuming that the platform and its existence itself were the outcome. We thought that we'd build it and the results would just follow.
A North Star
We realized the value we had to create had to be anchored on some business outcomes. The first step towards that was understanding the why. Platforms take time, they take commitment, they take effort, and have a high degree of opportunity cost. Engineering, as we know it, is the single biggest asset in any software-driven company. The opportunity cost of going down the wrong path or fixating on a wrong problem can be very grave for the entire organization. Before you embark on this journey, you want to be very intentional about the why behind your platform.
A part of asking that why is finding out what's your North Star, a goal or a value maximization function that will roughly be true for the next five or ten years. In our case, cloud cost observability was not a 5+ year problem. It was definitely a stepping stone, but it was not the outcome. So, we moved our metrics from cloud cost observability to cloud efficiency, which we defined as the total spend on cloud infrastructure as a function of the total revenue at Stripe. As a compensating control, we also defined a unit metric to see if Stripe was getting more or less efficient over time, which was the cost per transaction. Those were our business key performance indicators (KPIs) and we used them to get our stakeholders - the CTO, the CFO, and the founders – aligned.
Leverage
Once you have a North Star, the second thing is leverage. Platforms have to be able to give you a leg up in some ways. For us, we didn't want all engineering teams to become 5% more efficient, but we wanted those two to three key teams to become a whole lot more efficient. In our case, it was our database teams, our data platform teams, and our machine learning teams that had to account for 60% to 80% of those efficiency targets. To get leverage, remember the 3Rs: reduce, reuse, recycle.
A big leverage point with platforms is that you can abstract away and reduce complexity. This complexity could be domain-shaped, such as when you don't want multiple teams to figure out the guts of your banking vendors or your cash platform. Or it could be technical or skill-set specific, such as when you don't want many teams to have to think about the guts of Kubernetes, service networking, and service mesh.
The next is finding ways to reuse policies and build for governance and standardization on the platform. For us, we wanted to make sure that 100% of our cloud resources would be owned, so at this step we enforced those policies. We baked governance tooling directly into the platform.
Lastly, leverage is about recycling code through designing the right primitives. We built out data primitives for data archival and retention. Then over time, we could reuse them for different data regulatory needs, whether they were PII, GDPR, or data locality.
In short, when thinking about leverage through your platform, think about what reduce, reuse, and recycle look like as you're building the platform out.
Coverage
The last guiding principle is coverage. If leverage is about being slightly tall, coverage is about being wide. When you're thinking about coverage, the three things that come to mind are being opinionated, cohesive, and coherent.
When I say opinionated, I mean you don't want your platform to be too wide, where it's doing something for everyone, but not really doing anything for anyone. At the same time, you don't want it to be too narrow, where it feels like it's a tool, but not really a platform. You could think of the 80/20 Pareto split where 80% of use cases are solved by 20% of your effort.
In other words, you want to be very opinionated about which users and which use cases you're going to support. To do that, you might have to even be intentional and say no to some. That brings me to the second bit, which is being cohesive. Think about the different nouns and verbs, the building blocks and legos that are going to make up your platform. You want these to cohesively fit with each other.
Lastly, your platform needs to be coherent. As you're building out all these sub-components of your platform, you want to do it in a way that your platform evolves harmoniously and logically, taking into account the changing needs of your users.
Impact
We also dogfooded Solis ourselves, identifying numerous cost saving initiatives, which were small to large engineering projects, from turning off unused instances to migrating key compute-heavy workloads to Graviton.
We drove efficiency wins – saving 10s of millions of dollars. We succeeded in changing our engineering culture, moving from reacting to costs to actively engaging in upfront planning during the platform's design phase. This included estimating the expenses of new projects such as multi-region setups and Disaster Recovery beforehand.
To sum everything up: we had identified the why - a lofty Northstar and we got opinionated about our leverage and coverage. Most importantly, we learned that building a platform was not the outcome. It’s not the end in itself. It has to serve a purpose. And that purpose has to be anchored around accelerating value for the business.
ShuttleCrawler
My second story is about our Service Delivery Platform, also called ShuttleCrawler.
We had a monolith and a monorepo with some undesirable dependencies due to years of accrued tech debt. In our developer productivity surveys, we would routinely have users sharing concerns around not being able to confidently know what changes were running in production. We also saw how this would then manifest in incidents due to accidentally rolling out breaking changes to production.
So we decided to build a Service Delivery Platform. We had our why - we were clear about our purpose: make the product engineering teams a whole lot more productive and make change management safer. We would reduce the lead times by over 50%. In doing so, we would also abstract away a lot of those complexities that just inherently come with service deployments, whether it's the guts of Kubernetes, scheduling, service networking, service mesh, and so on.
If the why is about creating value for the business, the what is all about driving velocity for your users, bringing delight to your users, and making your users awesome at what they do. This requires bringing a product mindset to building a platform.
The Double Diamond
This is where I found it very useful to think in terms of the Double Diamond framework, where the first diamond is about product discovery and problem definition and the second is about building a solution. While in the first diamond you can do divergent thinking and ideation, either widely or deeply, the second diamond allows for action-oriented, focused thinking that converges into developing and delivering the solution.
Applying the product mindset requires you to first define who your user is, being very intentional about the user persona and their jobs to be done (JTBD). Here I found it very valuable, time and time again, to focus on ten super delighted users, versus 1000 who are partially satisfied with what we have to offer.
Building delightful user experiences, and making users awesome requires us to deeply understand their workflows and the toolkit you can offer in addition to the platform. What are the gardened/paved paths? What are the guardrails baked into the platform? What escape hatches do we need to provide? The goal is for the users to find it easy to do what is right, and hard to do what is wrong.
When we’ve done this user analysis correctly and understand our persona’s skill set, their strengths, and their painpoints, then using the platform feels like wielding a Swiss knife with a variety of tools, specific to the need and the occasion.
That brings me to the last bit, which is progressive disclosure. One way of looking at it is in terms of layers. Platforms are for developers, and as you're abstracting away some complexity, you will always have some developer who is curious enough to peel the onion, look under the hood, and tinker around to suit their own unique needs. A successful platform should offer programmable and composable building blocks for that class of developers.
Migration strategy
A year in, we had less than 5% of production traffic on ShuttleCrawler. Not only that, we now had the dual cost of maintaining two delivery platforms: the new one, which was not yet at feature parity, and the old one, which continued to see scaled usage. Net-net we were far worse off than when we started 18 months ago!
We realized that we had left the adoption of the platform to a "build and they'll come" mindset. We had not been intentional about how our platform would get adopted and the level of investment we were expecting the organization to take on, once the platform was shipped. Consequently, we couldn’t land the impact we sought out to – our work wasn’t "done" once the platform was built out.
To fix this, we built out a detailed migration strategy, accounting for the off-ramp from the old platform and the on-ramp onto the new platform. We found that we had to build out an A/B testing tool, so that our developers could incrementally dial up the traffic from the old to the new platform, while promising ~zero downtime throughout the migration.
We white-gloved the first tier of critical services, and applied those learnings into building out further automation to reduce the cost for the remaining services. Additionally, we got our service developers who had migrated their services to talk about how successful they were with the new platform. Their own improvement in developer productivity, support, and testimonials greatly helped adoption of the platform for the long tail of services.
Impact
Shifting from the "build & they‘ll come" mindset to a more intentional and focused objective, we were able to lower the time-to-value and delight for our users.
About a year in, 100% of our services were on ShuttleCrawler, and every new service started onboarding directly onto this platform. Our average lead times reduced not just by 50%, but by 65%, and we saw far fewer breaking changes in production.
We deeply cared about making our users awesome and it was critical to bring a product mindset to building and rolling out our platform. But the true impact came from focusing on day zero adoption and reduced friction, which drove high velocity & autonomy for our users.
Monster
Monster was an async processing platform with remote task execution for a monolith written in Ruby. In 2020, during COVID, we saw unprecedented use in online commerce. Some of that demand was also showing up in the state of our systems. Deploy success rates to Monster fell to 20%, which not only impacted developer productivity but also caused a massive business impact. Everyone was rapidly losing trust in this critical platform.
The importance of the How
The big question we were now faced with was how. How would we support sustainable, safe, and healthy growth of our platform? How should the platform be used, and what did "good" look like here? If the why is about creating value, and the what is about velocity for users, the how was about veracity - driving that trust and accountability back, both in the platform, but also in the ecosystem around it.
It was very helpful for us to remember that the users of our platform have obligations from their users. If users are choosing to use our platform, they're taking a bet on us, they're taking a leap of faith on us, our decisions, our intentions. This trust is the implicit contract baked into a platform.
Platforms cannot be shaky - solid fundamentals (Reliability, Security, Privacy, Compliance, disruption) and operational excellence are tablestakes, not a nice-to-have. Our platforms have to be stable. In our case, we decided to put a stop to all feature delivery for about a quarter, did a methodical analysis of all the failures that led to the massive drop in deploy rates, and focused on crucial reliability efforts until we brought this metric back up to 99%+.
We also identified several key traits of trust and accountability- backward compatibility, long-term support story, and deprecation policies. We communicated some of those policy changes thoughtfully to our users so they could better plan their feature delivery.
The last mile to success
Once we’d solved for the low hanging fruit, we hit a roadblock. We found that most of those Monster tasks didn't have owners, were not rate limited, were not isolated, and we had "noisy neighbor" issues. We realized we were running into an extreme case of organizational debt, either due to frequent reorganizations, or people leaving the company. In short, the platform had no accountability built in to boot off these execution tasks which were ruining the experience of other use cases on the platform - a few rotten mangoes were making the environment unsustainable and toxic for the broad majority.
To prevent abuse or misuse, we put in isolation and rate limiting, both in terms of compute and memory, to avoid those "noisy neighbor" issues. We implemented graceful service degradation to manage sudden spikes in traffic. For platforms catering to external developers, incorporating measures for handling DDOS attacks and enforcing governance can greatly enhance accountability within the platform's framework.
We built out a chargeback and accounting model in partnership with the Cost efficiency team. We attributed task execution costs to the owning teams and we leveraged the cost platform Solis, to drive healthy use of the platform.
Lastly, we started enforcing invariants such as 100% ownership (every Monster task had to have an owner, with code owners in the git repo), compute and memory limits which necessitated the task owner to think through task optimization upfront, vs the platform falling over after the fact.
Impact
At peak usage, with 10X more load, and 2X more changes a day, Monster not only achieved 99.9% deploy success rate but also eliminated most of our severity 0 incidents (SEV0s).
We led with trust and empathy, drove operational excellence, and provided much higher stability of the platform. Most importantly, we baked in governance for healthy use and sustained growth of the platform, and its ecosystem – through driving accountability both from ourselves & our users.
Summary
Each journey of mine has empowered me to build richer, more empathetic and sustainable platforms – guiding principles which we are today bringing to building Apollo GraphOS for our users. GraphOS provides everything you need to build an API platform for the modern stack. This includes an underlying architecture that promotes API composability and reuse (the supergraph!), secure, governed & highly reliable infrastructure (Apollo Router) and the tools & workflows so that teams can delightfully consume and publish to the graph (Apollo Studio), at any scale.
I’ve learned that Platform engineering is not just a technical problem to solve, it is not an end in itself. It is a journey toward an ideal that evolves with the evolving needs of your organization and business.
Our role in that journey is to have our feet planted on the ground, being careful and acknowledging what today looks like, but also having our eyes set to the skies, where we're aiming for that ideal.
It's about being intentional around the why of our platform’s existence, the what for its users and the how of sustainably baking trust into it. We can build successful platforms with the three guiding principles of value, velocity, and veracity:
- Drive acceleration for the business, through early alignment toward a lofty northstar, opinionated coverage and leverage;
- Make users awesome & autonomous by not only building a delightful platform, but also being intentional about its adoption; and
- Foster a sustainable ecosystem, leading with trust and baking in accountability directly into the platform.
I hope this gives you a primer for building successful platforms; anchoring on the objectives you seek, outcomes you desire, and the behaviors you want to incentivize or disincentivize within your engineering organization.