Spotify has recently published a tech blog containing details of their internal machine learning process. One of the main challenges of any online business is to get actionable insight from their data for decision-making. Spotify shares its methodology and experience to solve this problem by clustering diverse data sets through a unique method involving dimensionality reduction, recursion, and supervised machine learning.
The approach yields strong results and provides insights and better explainability. It helps user researchers and data scientists enhance their understanding, refine their solutions, and iterate more efficiently for the final solution. Additionally, this method includes an explainability layer, facilitating the validation of findings to communicate with the stakeholders. The following diagram shows this high-level method.
Based on the blog post, the author proposed a method containing four simple steps:
- Make the data manageable
- Cluster it
- Understand it (and predict it)
- Communicate it
The first step in this process is to find a way to visualize data to manage it better. The main challenge is that engineers need to handle high-dimensional data in actual practice. One practical approach is to use dimension reduction techniques like Principal Component Analysis or PCA. The main challenge with PCA is that, in many cases, not all information can be presented in two dimensions. The author suggested using state of the art technique of Uniform Manifold Approximation and Projection or UMAP instead of PCA.
The main difference between PCA and UMAP is that UMAP is the projection method that reserved local and global similarity of the points in the lower dimension, and it is non-linear in comparison to PCA. This will capture non-linear relationships among data. For example, the author showed the difference in the results when using the MNIST (Modified National Institute of Standards and Technology) dataset. MNIST has 784 dimensions to represent the written digits 0 to 9. The following figures show the differences.
Once we visualize data and get an initial sense, we need to create some meaningful clusters. As mentioned in the article, this clustering should have the following properties for explainability:
- A point belongs to a cluster if the cluster exists
- If you need parameters for your clustering, make them intuitive
- Clusters should be stable, even when changing the order of the data or the starting conditions
Numerous clustering algorithms exist, such as K-Means and HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). HDBSCAN leverages a hierarchical approach combining clustering and DBSCAN methods to yield more robust and meaningful clusters. Extensive experimentation at Spotify has demonstrated that HDBSCAN consistently produces more meaningful and stable results.
In pursuit of a deeper understanding of cluster behavior, a recursive application of clustering techniques becomes imperative. This iterative process allows for enhanced insights into the intricate dynamics within clusters. Subsequently, once a sufficient number of clusters have been established, the application of supervised techniques, notably classification, becomes viable. Established classification methodologies, such as XGBoost, can be employed as a one-versus-all model for each cluster.
Moreover, the integration of SHAP enhances interpretability, elucidating the primary drivers within each cluster. This dual approach, combining HDBSCAN for initial clustering and subsequent classification through XGBoost, augmented by SHAP for explicability, forms a comprehensive methodology for gaining profound insights into the behavior of diverse clusters.
In the final stage, there is a need to communicate findings with the data science group and other stakeholders and iterate on the process for the final solution if needed.
A similar methodology has also been used successfully in other disciplines, like anomaly detection in health data.
Many machine learning engineers found this work exciting. As one of them commented on the LinkedIn post of this work :
Umap and Shap are real game changers and foundational elements to advanced analytics workflows