Transcript
Mudgal: In this particular talk, we will be discussing how ad serving works, typically, and go into some of the details about how we scale ads at Pinterest, in general. Similar to every e-commerce platform that you might see, like Google, Facebook, personalized experience is at the heart of Pinterest. This personalized experience is powered through a variety of different machine learning applications. Each of them is trying to learn complex web patterns from large scale data coming across. For Pinterest, it's about 460 million active users, and billions of content. The decisions and the design choices that we make here is because of the scale that this data is being learned from. At Pinterest, even the touchpoints that a user might have could be very different. There could be as simple as home feed, where the user doesn't have any context. It's trying to learn from your past engagements and interests that you might have provided to the platform. There could be another surface like search, which you see in Pinterest, and also in Google, where you could do textual based searches. Here, the interest of the user is well defined, and it's coming through the search query. There could also be some visually similar pins that we have on the related pins surface where the context is coming from the pin that the user has clicked on. Back of this, there are different machine learning models that are powering this ecosystem in general. There could be understanding the user, trying to understand the content, trying to have ads in general, and also many different trust and safety compliance.
Outline
We will be discussing how ads marketplaces work, and go over the delivery funnel. Some of the typical parts of the ad serving architecture, go into two of the main ranking problems, ads retrieval, and heavyweight L2 ranking. Then we will go into how we measure the system health monitoring as these models are being trained. Lastly, go into some of the challenges and solutions for large model serving.
Characteristics of the Recommendation System
Just to give an understanding, this is the characteristics of the recommendation system, typically. It needs to sift through billions of content. Every of this social media platform has tons of data in terms of what they can show to a particular user. Sifting through millions of content but at a low latency, around 200 milliseconds to 300 milliseconds. Since the content is large, and since the user base is large, we cannot precompute the prediction probability so this system needs to be performant and make these predictions in real-time. I'll go into some of the kinds of predictions that these models make. Then there's high QPS. Similarly, these systems need to be responsive to user feedback, and their changing interest over time. To capture all of these nuances, we need to make sure that these systems do a multi-objective optimization problem.
Also, let's see how ads fit into the ecosystem for many of the major use cases. Similarly, for Pinterest, if you click on a pin on Pinterest, you are shown many visually similar images, how ads come in. Ads have to connect users with their content on the platform, so that they can buy or engage with relevant content and take the journey from the platform to the advertiser's website. In all of these scenarios, this is a two-sided marketplace. The advertising platforms like Pinterest, they help to connect users with advertisers and the relevant content. Users visits this platform to engage with the websites and apps, in general. Advertisers pay these advertising platforms so that they can show their content. Also, users inherently interact with advertisers off-site or on-site on the platform and engage with them. In all of these, the basic scenario is that we want to maximize the value for both the users, the advertisers, and the platform. Let's see how that's done.
Ads Products
Advertisers can have different marketing goals, depending on how they want to have their content shown to the users. It could be as simple as creating an awareness for that brand, or driving more clicks on-site on the platform. When they do this, the advertisers can also choose how much they value a particular ad shown on the platform. They can decide the bids in general. There are two ways to decide the bids. One is, I'm willing to pay $1 for every impression, or a click that delivers by the platform, or the other way is, I have $100, can the platform deliver it more efficiently through auto-bidding? The advertisers also choose what kind of image creatives, or what creatives they want to have. To bring this into production, the advertising platform needs to define what's a good probability, like, should we show this particular content to the user or not? This is defined through predictions, as simple as a click prediction, given a user, what journey they are on the platform and the content. What's the probability that this user is going to click on it? That's where the machine learning models come into picture. Not only this, catering just to clicks might not give the best relevance on the platform, it might promote spammy content. The platforms also have shadow predictions such as good clicks, hides, saves, and repins that are trying to capture the user journey in a holistic way. Not only this, there are also other products like conversion optimization, which is trying to optimize to drive more traffic on to the advertiser's website. To bring this product, these conversion predictions don't happen on the platform, but they happen off-site the platform.
Combining all of these, there are different kinds of predictions that the system needs to generate. Let's imagine, we want to expand the system to more creatives, like videos and collections. Not only do we need to make these predictions that are shown here, but we also need to understand what a good video view is on the platform. Combining all of these, it's a very complex system. Not only this, surfaces also have different context behind them. It could be home feed, where we don't have any context or the relevance at that particular time, or a search query where the user has an intent behind it. As we look at this as a product scales, we need to make sure that we are able to make all these predictions in a performant way, and manage this complexity as the product is growing. We cannot have separate models, everything doing each of them in general, because we need to make sure that the product can scale in a linear way, not in an exponential way. Some of the design decisions that we take are to cater for a product's linear growth.
Ads Serving Infrastructure - High Level Overview
Let's go into the ads serving infrastructure, in general. This is the first, like the user interacts with a particular app. Through a load balancer, this request is passed in that we need to fetch content that we want to show to the user through an app server. This in most conditions is then passed to an ad server which requests to send some ads that we need to insert onto the user's feed. This is the ad server which controls all the logic behind it, of which ad we need to show to this user at this particular time. Some of the consideration here is, so that we can do this in a very low latency manner, around 200 milliseconds to 300 milliseconds, end-to-end funnel. Ads server, like the request that you get is something like a user ID, or you get something like, what's the time of the day, or the IP address? The first service that comes in is to extract features for this user. This could be something like extracting the gender, the location, how this user has interacted in the past. This is where the feature enrichment comes into the picture. I'll go into some of the details. This is a typical key-value store that you have where the key is the user ID and you retrieve the features.
The next step that comes in here is the candidate retrieval. Once you have enriched your feature space, in general, these are then passed into a candidate retrieval phase, which I'll cover later in more detail. In this particular scenario, candidate retrieval is trying to sift through billions of contents that you have, trying to find the best set of candidates. If you can imagine, you have millions and billions of content that you can show to the user, in the end, only three or four ads candidates are shown on the platform. This is just trying to make sure that we can see the entire ad index, and try to find the best set of candidates that we can show to the user. Then these are passed into a ranking service, just not only getting the best ad candidate, we need to define these predictions that we discussed before, in a more precise way. That's where the ranking system comes into picture. These are heavyweight models, which are trying to get into, can we define this probability of click in the right way? This ranking service also has access to feature generation. As the system and the data is flowing, some features, like if you look at the ad server to the ranking service, that RPC call that we make, you cannot transmit all the content features over the call, because as you transmit that, the size of that could increase. Typically, around 1000 to 2000 candidates are being sent into the ranking service. Sending all of those features together would bloat the system. There are some of the considerations that go in, so that we can fetch these features in a more performant way and ensuring that we have the maximum cache hits. This ranking service then sends the ads back to the ad server.
At this point, typically, in most of the traditional machine learning systems, the values of the features that are used to show that ad through a particular time are very important to train these machine learning models. At this point, there's also an async request that's sent to a feature logging service which logs these features. Also, to make the system more performant, there are more fallback candidates. What if any of the system dies down, or they're not able to retrieve candidates, we have a system that have fallback candidates to show to the user so that the user always sees some content on the platform. From the ad server, these are sent back and inserted into the feed for the user. As the user now interacts with the feed, there is an event logging service, which uses Kafka to log all of these events in real-time. This event logging service is very important. This is at very high QPS. This event logging service defines if the user interacts or clicks on an ad, how do you want to charge the advertisers at that particular time? Charging needs to be real-time, because advertisers, the promise the platform has is about the budget, how much maximum budget they can spend in a day. If you start to not have this logging pipeline in a real-time manner, we might overshoot the budget. What it means is that you might deliver free impressions back to the advertisers, which is not desired. Also, this event logging pipeline, what events are happening, also flow into a reporting system, which is hourly or daily monitoring systems. This reporting system also has a linkage to the features that we'll log, because for advertisers, we also want to show what's the ad performance with respect to different features like country, or age, or gender, or different features that you might have on the platform. This event logging service, and also the feature logger together combine the training data for all of these machine learning models. I will go into some of the details how those pipelines are set up.
Also, let's see what happens when advertisers create a particular ad on the platform. Advertiser details like budget, what's the creative type, what's your optimization type, they go into ads database, in general. This could be as simple as a MySQL database. From this ads database, we want to make these ads servable to our systems. Typically, they go into an index publisher, like what ads you have. That's the step which reads from these. This index publisher can now update the feature stores in a batch and real-time manner, or it can also pass this information to our indexes, and these indexes get updated. Because of the size of these indexes, we need to make sure we are able to incrementally update the indexes too, so that we can have it in a shorter duration of time. All of this together is some of the basic building blocks of a recommendation system. This can be also similar to a content-based recommender system. Also, the other important concept for ads is around the spacing, and budgeting, which is reading from this logging service to define how much budget or how much amount of money we have spent for these advertisers. Knowing that is important in real-time so that we can decide whether we want to show this ad to the user or not, if we have finished the budget.
Ads Delivery Funnel
Also, let's look into how a typical ads delivery funnel looks like. In many of the cases, this is broken down into two steps: a retrieval and a ranking process. I'll go into some details of what each of them is catering to. In the retrieval step, these are millions of parallel running candidate generators, given a request, their motivation is to get the best set of ad candidates that we can have. This could be based on different business logic, like fresh content, what the user has interacted with recently, or any embedding-based generators. These then are passed into a blending and a filtering logic. We want to combine, have more diversity constraints or policy enforcement, or depending upon other business use cases. These are then passed into a ranking model, which is trying to make all of the precise predictions that we discussed before. Once we have those precise predictions, we see, what's the value of inserting this ad to a particular user? That's where the auction comes in. Depending on the value of that ad, we decide, do we want to show it to the user or not? Also, different business logics and constraints around allocation can be handled at this time. Do we want to keep two ads together in the feed, or do we want to have two ads separated out? Those are some of the considerations that go into the allocation phase. Together, all of this brings the best experience that you see of how ads are inserted into the field.
If you look at the auction, there are five major ingredients. Depending on how you choose which ingredient you want, you can move the platform metrics in that particular direction. The first is the auction mechanism. How do you want to define the auction? This comes from econometrics. In general, Pinterest uses second-price auction. The second most important ingredient is these ads ranking. Can you define the probabilities of prediction in the most accurate way? Because that's most important to define, what's the value of showing this ad to the user? Thirdly, is this allocation. How do you want to structure your ads, in general? Do you want two videos to be together, or do you want your videos to be separated from images? How do you define? Do you want ads from the same advertiser together? Those are some business restrictions that you might have, depending on the context you have. If you have less ads to show, maybe that's ok. If you have a big pool of ads to show at that point, you can probably reduce some of the duplicacy there. Fourthly, is around quality floors. This is for the platform, in general. Like platform wants to enforce, what's the minimum quality that I want in my ads. This could be just the relevance, or click-through probabilities. Finally, there's also this reserve pricing. Since we have second-price auction, the idea is, we at least want the advertisers to have this much minimum pricing, at least pay like $1 for every 1000 impressions. That's for enforcing the lowest limit that the advertiser at least needs to pay. If there are no ads on the platform, advertisers can reduce the spend and say that we don't want to pay more, so that's what reserve pricing is trying to make sure, at least the platform gets the relevant pricing.
In terms of the optimization framework, there are two assumptions that people make, one is that satisfied users will engage more with the product. If the users are satisfied, they will have more interactions with the product. Similarly, if the advertisers are satisfied, they will increase the spend. To capture all of these, there are short-term goals that the system tries to optimize for, hoping that they will translate into the longer-term goals. It boils down to this equation in general, that we want to maximize revenue subject to some constraints that the user is satisfied and the advertiser is satisfied.
Ads Retrieval
Let's go into what ads retrieval looks like, and what's the main motivation. The main motivation is, how can we retrieve the best ads candidate with the best efficiency? These models need to be very lightweight, so that they can decide this at a very low computing cost. How we measure the efficiency of this system is through recall. What we want to do here is that, if let's say this model is saying that we should retrieve these kinds of ads, we see how many of these ads are eventually shown to the user in the end. That's how this measurement goes back in the recall setup, in general. Let's go into the signal enrichment that I discussed before. Let's say this is the request that user ID 123 is requesting, and this is the pin that the user is seeing. They'll set up many different graph-based expanders, which is trying to fetch these extra features, depending on the context from your key-value feature stores. This is the single enrichment step before it happens. This is the example of what it means if you look at converting this user ID into this feature that it's a female patient based is the U.S., age is 20, 30. There are different set of either parallel executions that can happen or something that needs to happen sequentially. Similarly, enriching this pin with different features that exist and precomputed in the pipeline.
This is how retrieval works. It's through a scatter-gather approach. There are different components that retrieval is trying to cater to. I'll go into each of the objectives. The first component is lightweight scoring and targeting filter. What it means is, we want to understand, given this context, how valuable this pin is using very simple models, and also apply targeting filters. What targeting means is, advertisers don't want to show their ad to all the users in general, because that might not be useful. Advertisers can choose, do I want to restrict my ads to San Francisco, or do I want to restrict my ads to California? Those kinds of filters happen at this stage, to remove the ads that don't satisfy the request from the advertiser. The next step is around budgeting. This is trying to control, like if this ad has finished all the budget. It's trying to control, should we pass this ad down in the funnel or not? This is trying to reduce the number of false impressions that you might give.
The next step is around pacing. Another concept in ads is around pacing, which means, let's say $500, from an advertising platform, we don't want to spend this $100 in the first one hour, because if we spend everything in the first hour, you might not get the best value. Typically, advertising platforms try to spend this in a way that users visit their platform. Depending on how users visit their platform, there's a notion that we want to spend maybe $10 by the first hour. Depending on this computation, you can compute, is this ad spending more than what it's supposed to be doing? If it's spending more, you can try to remove it from auction so that its spend is reduced. That's where pacing comes in. Similarly, to ensure that there's diversity of ads, there's deduping, limit the number of ads that an advertiser can contribute. Because an advertiser, we don't want to overwhelm with all the ads from a particular advertiser. Then, also, keep the top K candidates that are passed into the next stage. Then, finally, since these are scatter-gather approaches, there are different retrieval sources. How do you blend those retrieval sources together? How do you make sure that you're able to send a consistent set to the next set of ranking process? These are some of the design choices that happen here.
This is a typical training flow for a typical machine learning model. On the first side, we have the serving logs of how the user is interacting. Then on the other hand, we have the serving logs of these features of what the features were there for a particular request. On the second hand, we have these actual logs of how users are interacting. These formulate the backbone of all the machine learning models. We join them together based on unique identifiers: these are the features, these are your labels, and you can train models on top of it. These are then passed into a feature stats validation to make sure that none of the features of this dataset has consistency, there is no issue in this dataset in general. I'll go into some of the validations that you have, because this is done once, but then every model that's training has another inner working loop in general. Every model owner can define, how do you want to sample? What datasets do you need from this? What's your positive labels and negative labels? Those are the kinds of logic every model definition can have. Then, this passes through a model training, offline evaluation, having more feature importance, and then finally pushed to a model store, when it's ready for serving.
In terms of retrieval, if you look at retrieval, this is the funnel and how it looks like. Ads index is the whole pool of ads that you have. Retrieval acts on top of this ads index, passing it to ranking and auction. In terms of collecting the data, how can you train these models? In terms of user feedback, you only have data to what's shown to the user. That's what in the top is called insertion log. You have data of what the click is, what's the engagement that the user is doing. For training these models, you don't have data anywhere else in this pipeline. There are ways on how to create that data for retrieval, in general, not only looking at labels that the user is providing, you can also look into what data is sent for ranking, and just try to train on top of it. Similarly, there are different kinds of flows in the system. You don't want to show very low pCTR, like ads that have low click-through probability. Those are trimmed so you can also use that as sample dataset. One thing to note, that as you go down in the funnel, the size of the dataset increases, which makes it harder to keep all of those logs. These are typically sampled, in general.
I went to some of the motivation of the recent advancements in this field. Traditionally, different business logic has different kinds of candidate retrievers. It could be as simple as just matching the word or matching the synonyms of the word, matching based on the textual description, or the words on the text, matching based on the image, and all of these. If you look, as the system is growing complex, this becomes harder for the system to maintain. If you have so many things in the system, one thing can break, and that makes it harder to iterate. Going forward in 2016, YouTube had a very seminal paper in this direction, which changed the way these retrieval systems are working. Just to give an idea, this is something that's in the industry, called, Two-Tower Deep Neural Networks. The idea is, just ignore the model architecture that's in between. I'll go into some detail what it's trying to do. The idea is, can we learn a latent representation of this user based on their features? Can we learn a latent representation of the pin based on their features? These representations are kept very separate from each other. We are not learning anything, interactions between them, both of these representations are kept separate. These features are also kept separate. In the end, we are trying to learn if the user has engaged with a particular pin, those representations should be very close to each other. That's what the objective for this training is. The benefit of shifting in this scenario is this, for these ad features that we had here, if you want to create these predictions during the serving time, this ads tower doesn't change that much, because this is not dependent on the context features.
What we can do is we can cache all of these predictions from the model server into the system. We can index it into the system, and then utilize that indexing during the serving time. At serving time, we only need to score this user tower once. Then the index that we have created already can be used to retrieve those ads. That's the power that this Two-Tower DNN architecture brings. One thing is, deep neural networks are like universal function approximators. In theory, anything that you have seen in interactions in the history, these are powerful functions that can learn to predict the future. That's why most of the industry is now moving to DNNs and transformer-based solutions, because they are very good at capturing past engagements and utilizing that to predict all the engagement in the future. In terms of the deployment, we have the ads database. That ads database builds an index, this index building service. To bring this index building service, what we do is, that tower that we showed before is passed through a model server, utilizes the index in the ads database, and tries to create an index on top of it. Everything of this is done in an offline manner. Once you have this ad indexed during the serving time, the retrieval server needs to only call the user part of the model, in general. You infer the user part of the model, utilize approximate nearest neighbor search algorithms like HNSW fires.
The Ranking Model
I think this is where the ads delivery funnel was and what we discussed so far. We discussed about the retrieval system, now let's go into the ranking model and what goes behind this ranking model, in general. Beginning in 2014, the simplistic models that were here were like linear regression, or logistic regressions, which were very linear models, simplistic model. The machine learning community, at least, were using very simple models at this point. The next step that happened in this evolution, to make the models more expressive, we moved from simplistic solutions to more complex GBDT plus logistic regression solutions. This is coming from a Facebook seminal paper. The idea is, if you want to learn certain kinds of interactions, there are four types of features that a model can have. It could be a user feature. It could be the content feature. It could be an interaction between the user and the content in the history, and also what's happening during this impression time. There are mainly these four types of features, and you want to learn some nonlinear interactions, and GBDTs are good at it. In this particular scenario, you're learning these non-interactions as part of these machine learning systems. Also, you're learning a logistic regression framework, which is a linear model. I'll skip some of the details around the hashing tricks in general here. The idea is, it's trying to learn these different computations and interactions between users.
One thing, at this point, we had about 60-plus models into the system. These models were growing. The product is growing. It becomes complex to maintain all these models. Some of the decisions that we take forward are to make sure, how we can grow this sustainably. The other issue with ads, like if you look at ads, in general, new ad groups keeps on creating every time. Ad groups keeps creating, some ad groups keep dying. Ads might have just maybe a one month or two months of time window when they're alive. You want these models to be responsive, so that they can train more incrementally our new data distributions that are coming in. In this particular scenario, these GBDT models are static. There is no way we can train them incrementally. DNNs, on the other hand, have incremental power to be trained. Once we have 60-plus models, it becomes much harder for us to maintain all of them, leading to super large cycles in feature adoption and deletion, which leads to suboptimal systems. Simplifying the system to make sure that you can iterate faster leads to much many more wins in the future. Also, around this time, machine learning systems were not ready in terms of the machine learning serving. What we were training was in a different language, like we used to train in XGBoost language, then translate that into a TensorFlow language, then translate that into C++, our serving language. These kinds of hops into the system leads to further more suboptimality, leading to longer cycles to develop new features.
This was the GBDT framework, and how the models moved are using nonlinear transformation. As the product is growing, the next set of advancements have been replacing this GBDT with more neural network-based approaches. These are different machine learning techniques, so the benefit that DNN brings are much more larger, in general, but these are more complex models. I'll go into some details later, how to handle the complex models. DNNs are like universal function approximators. As I discussed before, you can approximate any function. That's where the ChatGPT or GenAI, all of those cases are coming in. It's able to learn from the data and approximate whatever you're trying to predict. That's where the DNNs are most powerful, because of all these advantages that DNNs bring, the next set of logical improvements were moving to more DNN architectures. This is very similar. Whatever the combinations we had in the GBDT, this is just a different algorithm to learning those interactions as part of the model architecture. You replace that GBDT structure with a different machine learning algorithm, which is the DNNs here.
If you look between the red boxes, this is the relative improvements that we see. One thing I want to point out here is that even though DNNs are powerful, you think that it might give you more gains, but it takes time in the systems to have incremental improvements. Because you're changing one stack to another, you might miss a few things here and there. Because of that, you might not get the full value as you bring new changes. Between 2018 to 2020, Pinterest spent a considerable amount in improving the architecture. If you look, this is like a 2-year window, in general. Because now we want to further architecture new solutions, which we haven't done before, it takes us further event time. Because of our systems and how they are placed, we also have legacy systems moving to more new systems, and how to make sure that production is not impacted was a big challenge.
This is in 2020, some of the deep multitask neural network models. One of the things that changed here is, previous to this, previous traditional machine learning algorithms relied more on feature learning by hand. Users would define what two features might be related. They would cross them together or learn a feature combination by hand, and then pass it to these models. The models inherently couldn't learn on features by themselves. In these architectures, you can specify your raw features, what it means. You can specify the user age. You can specify your pinning age, means you don't need to create features by hand, and the model can learn these feature interactions. One of the benefits that these models have is now you can multitask it across different objectives. What it means is, part of this network that you see here, it is latent cross layers, or whatever the fully connected layers are, these are having the bulk of the model weights. You can share these weights across different objectives like clicks, repins, or whatever closeup that you have on the platform.
This is the interface of how this model architecture looks like. There are many different layers. Pinterest also has a blog on it if you're interested. This is very typical to every recommendation model in the industry, in general. The first is the representation layer, which is like an interface of how you want to define your features and how the model can understand those features. In this particular scenario, for deep neural networks, feature processing is very important. If the feature scale is different across different features, the model can break, which was not in the traditional machine learning algorithms. Here it has different logic about squashing or clipping the values or doing some normalization to your features. Then, if you have two features that you think are related to each other, you can probably put them together, summarize them together, and learn a common embedding. In each of these models in general, like learning embeddings, which is a latent space, given your input, can you learn floating point representations on top of it? Those mathematical computations power all of these models. This is also trying to learn multiplicative layer crossings.
This is the idea, like you can also multitask different objectives here, so now if you can multitask across different objectives, what it means is you don't need to train different models for different objectives. These blue dots that you see here are different objectives. Each can have their own optimization function. What you can see is, between these blue dots, part of the network is shared, so now you can learn and call across tasks. That's where transfer learning comes in. This saves infrastructure cost considerably, because the same model can now predict multiple different predictions. These are different ways that feature crossing can do. Depending on the serving infrastructure, as simple as CPU or GPU. If you are on GPU serving, you can have much more complicated mathematical operations. If you're not there, there are also simpler versions to do this feature crossing.
This is the time till 2020, where Pinterest invested in improving the architectures. The next set of iterations is, can we use these architectures to utilize more features that we were not able to do before? That's where utilizing the sequence of activities that the user is doing on the platform comes into picture. The first iteration here is around the attention sequence. Let me go into some of the details about how a user interacts on the platform. Let's assume that a user could have interacted with multiple pins on the platform or multiple images on the platform, which could be food related pins, home decor, or travel related pins. The idea is, can we use this representation of what the user is doing to define what the user might do next? The first thing that comes in, can you understand the content on the platform in a well way? That's where embeddings come in. Trying to learn what's from a textual representation, from your content-based representation and how users interact with different pins, it's trying to learn a very rich representation through metric learning approaches. There's also a blog listed here if you're more interested. The idea is, can we represent this content on the platform and how people are interacting in a way that we can have a rich representation? What it means is if two particular images have the same kind of characteristics, their representation should be very similar to each other. Once we are able to get to that point, having those embeddings can fuel much more advancements.
Phase 1: Attention Model
If you look at the first use cases, like cloud. Let's say, if you have different embeddings, or the journey that the user has, this is a concept in transformer or NLP domains around the attention model. The idea here is, not all of these pins in this combination is useful for predicting this food related pin. Not all of them are useful, can you define which pins are more useful, and just pay more attention to them? What it means is this representation, when you have this food related pin is more related to the first two pins and not the later pins. To make it in low latency fashion, since these computations are much more complex, you cannot do it very easily for long-term sequences. Sequence length here is limited to something like 8 or 10 sequences, or everything. One thing to note here is, now if you want to make this responsive, we need to also make sure that our feature logging service is able to get the sequences in real-time. That's where importance of streaming datasets or streaming pipelines comes into picture. Also, in terms of feature computations, at least till this point, Pinterest was still on CPU serving, so we couldn't expand the sequence beyond what we see here. Similarly, the next iterations, at this point. At this point, Pinterest system was training on a different language, serving in a different language, which creates bottlenecks for improvements. Because now you want to translate all of this logic that you have here into your serving languages. At this point, native serving came into picture so that we can fuel the further set of advancement. If you look at, between 2018 to 2020, most of the steps for building the backbone in general to support the further advancements, and that's the bulk of time spent on the platform. Once we have a good foundation, it builds up our future iterations.
Phase 2: Transformer
The next set of improvements happen here utilizing this transformer sequence. This is very famous in the NLP domain these days, with GenAI and ChatGPT stuff kind of training. Transformers are the pareto-optimal frontier in general. They can encode much more powerful information and feature interactions. In terms of that, since sequence were important, can you increase the sequence length? As you start to increase the sequence length, you are bounded by the serving capacities. To increase the sequence length to 100 events, you cannot have the complex features that we used to have before. Moving at this point, having very simple features like, what's that action? Did they click on it or not? Very simple features, but a longer sequence enables the model to have better capacities. Similarly, now you can expand the sequence beyond what we have, to something like Pinterest has, this pinner format. This is a user understanding of the pins on the platform and how they interact. If you want to have more long-term sequence, how the industry goes about doing it is, there are two components that the user might have. One component is engagements beyond, let's say from yesterday to a year back. All of these engagements are encoded in an offline manner to learn an embedding, which you can use as a feature into the model. Then part of the sequence is also real-time that's coming from your engagements. Combination of these two can learn what the user is doing on the platform, in general. Utilizing these sequences, which is taking inspiration from the NLP domain, is what's powering most of these recommendation systems. This is like some of the iterations that happened and how the system evolved from very simple models to more complex models recently.
This is just the model architecture, in general. If you look at the recommendation system to bring this model architecture to power, machine learning is a very small piece of it. To bring it into power, like, how do you make sure that you can iterate faster? How can you make sure your serving infrastructure can support it? What kind of process management, like how can you manage your resources? How can you store your data? How can you move your data, verify your data? Having all those kinds of checks are very important. Around this time, Pinterest also spent a considerable amount of time building all these blocks, so that we can have the best experience.
How ML is Done at Pinterest
Going into how machine learning is done at Pinterest. Before that, every team used to have many different pipelines, everyone is rearchitecting the same wheel, in general. We see that that slows down the problem. Going back and having an interface which is generalizable across different stacks of the company, like ads could be one, trust and safety could be the other. There could be like home feed and the organic pin. How can we do it in a more scalable manner? That's where most of the iterations happened in the last year, in general. We use Docker images, the traditional CI/CD services in general. The part where the user writes the code is very small in the custom ML project, but these integrations between different schedulers, TensorFlow, MLflow, TensorBoard, and training schedulers, is done seamlessly through API based solutions. Then, all of these components are shared across different teams, which can iterate much faster.
I think also going to how the model training and the evaluation pipelines are set up in most of these scenarios, we can assume that most of these models are being trained incrementally. As you're training incrementally, you create these datasets. Assume that you're at date x, like if you're at date x, this workflow is training at date x in general, what you have is like you have a model that's trained before that dataset. You will create a training job. We take this dataset, the model checkpoint that you have, create a new model in the ecosystem. Similarly, also, create an evaluation pipeline where you take this model that was trained yesterday, try to infer on the future date, and see how good that model was. The right side should be date x plus one. Similarly, you can iterate this going every day and forward. The guarantee is a day here, but it could be as small as an hour.
In terms of the model deployment, like Pinterest follows the standard model deployment flow, which uses MLflow, which is an open source solution. As these models are moving into the production pipeline, we need to make sure that models are versioned. Like what's being pushed out. Every model has a new version as it's being incrementally trained, so that if you want to roll back efficiently, we can do it pretty easily. Also, it makes sure that we can track how the training is happening. We also want to make sure that models are reproducible. MLflow allows us to have a UI where we can see what are the parameters that went in there. If you want to retrain and reevaluate the training process, that's easy to do. Similarly, we need to make sure that this deployment is fast. This rollback is through a UI based solution so that we reduce the steps between model training and how it's deployed to the model servers.
System Monitoring
The other thing is in terms of systems, like how do you want to develop these systems, and what testing these systems have? Whenever anyone is writing a particular code, since this is running on the production system, and Pinterest has very high scale coming in, a lot of revenue, a lot of users, how do you test this system in a more reliable way? The first step is around integration testing. Whenever someone writes a particular code, are there ways that you can test it in the production environment through shadow traffic to see what would be the intended changes? Can you capture those metrics more manually, or also automatically? Capturing it automatically ensures that nothing is missed during the testing process. In terms of the debugging, we also have a debugging system where we can replay, let's say, if this is the change, this is the new model, can you hook up with your backend system to see how a particular request would look like based on the serving of this model? Traditionally, the next step is around how the code is released once it's merged into the system. Pinterest follows the standard process around canary, staging, and production pipeline. Each of these stages is trying to monitor the real-time metrics that the business cares about. If there is a deviation day-over-day, or there's a deviation between your production environment, the rollbacks would be stopped and given the power to roll back in a seamless manner. Finally, in terms of the online service, irrespective of these, bugs could be introduced in the system intentionally or non-intentionally, and also, advertisers might have different behaviors. There's also real-time monitoring that's happening which is capturing the day-over-day, week-over-week patterns into the system along different dimensions, which could be revenue, insertion rates, and QPS. The system can autoscale based on how much QPS you're getting into the system.
ML Workflow Validation and Monitoring
In terms of the ML workflows validation and monitoring, just not only monitoring the system, ML workflows have a different set of monitoring patterns, in general. The first step is looking at the training datasets, what's being fed into the model. Defining coverage and alerting on top of that. Looking at the features and how they change over time. Also ensuring that features are fresh. The next set of testing that happens is around the offline model evaluation. Once you have a trained model, can you see and check whether this model will make the right predictions, that's around the metrics they can capture, like model metrics like AUC, pr-AUC, but also capturing around predictions, and if there are spikes into the predictions and how they look like. That can stop the model validation processes. It also needs to check what would happen on unseen circumstances. Sometimes what happens primarily in these systems is you might have the right features for training but when you're serving, some features are missing. What happens in those scenarios when features are missing for serving, can the model still make some relevant predictions? Once the model starts to also go into production, there are also validation steps around canary, staging, and production, again. Trying to capture the model predictions like p50s, p99s. One thing to note here is from a model perspective, whatever we predict is used to define how much we charge from the advertisers. Even if there are few insertions that are pretty high, like let's say they're trying to predict something that's 1, versus what's a traditional rate could be something like 0.01. This is 100x charging on that particular insertion. That's something we want to prevent. There are a lot of alerts on those types of metrics in general, to reduce the spikes in your predictions.
Delivery Debugging Tools
Then, not only this, in order to be able to debug the system, there are also different tools that have been developed. If you look at these different stacks of the delivery funnel, they are serving retrieval, budgeting, indexing, and advertiser. Depending on how the ads perform, so giving this to our sales operators, we are able to reduce the type of tickets that might arise. Depending on different services, we can see where this particular ad is not being shown that often. If it's not shown in the serving side, it might be that the ad is very low quality, that's why it's being removed from the serving. Or this ad is not competitive in the auction, like different advertisers are willing to pay more than what they're paying, so the stats where they're not being shown to the user. Some of those suggestions coming in automatically helps to improve the process. Similarly, in retrieval, if the advertiser is saying that I only want to maybe show this ad to a particular user A, this is a very tight retrieval scenario, so that's why the ad might not be showing. Similarly, like budgeting controls, like we don't have enough budget to show enough ads from this advertiser. Similarly, like indexing in each of these layers have different use cases and different questions that they answer in general.
How to Improve Model Performance
In terms of improving the model performance, what are the different changes that the industry is going into in terms of architectures? There are different crossing algorithms that the industry is moving into. Having more auxiliary loss functions in general to define and learn better representations. That's where transformers are coming into picture. In terms of feature addition, utilizing more sequential features through transformers attention-based model. Utilizing more rich content-based embedding features. Also, interactions between the user and the content in the past, things like, did this user interact with these advertisers, a day ago or not? Those kinds of feature generation coming in, are important for improving the model performance.
Serving Large Models
In terms of serving infrastructure, how can we make sure that the serving infrastructure has low latency? If you can optimize the model predictions, having low latency enables us to score more ads. In general, instead of 1000, now you can score maybe 3000 ads, and that can improve your precision. Similarly, you can move to GPU serving if the models are more complex. In terms of serving large models, I think this is the decision making that happens, typically. If there is a large model, can you serve it on the GPU? If you have the latency, you can easily serve them. That's not the normal case, in general. If that's not the case, there are optimization techniques that go in there. Can you quantize your models? Less number of operations, less number of storage to improve. At this point, you're trading your precision for latency. You can also utilize techniques around distillation, which uses a big model to infer that knowledge into a very small model. Normally, this discussion is around trading these dollars for latencies. How we can make sure that, at some point, even if you have latency, can you get better predictions? That's the tradeoff that happens behind these systems.
See more presentations with transcripts