Can the Product Hierarchy Help your Machine Learning Models? Take a look at your Taxonomy Dendrogram




In general, data governance and, more specifically, data quality is the cornerstone of predictive modeling. I will not go into what data governance is or how you measure data quality, nor describe the ideal partnership between a Chief Data Officer and a Chief Analytics Officer (sometimes they may be the same person).


I will focus here on the relationship between product taxonomy and machine learning models.


There is a vast literature on taxonomy; I liked very much an article I found on Medium: The do's and don'ts of Product Taxonomy from Scalia (no relationship here, I just loved the article).

I agree with the article that "a category tree should have a hierarchy depth of 3 to 5 category levels and between 5 and 15 level-one categories"

Let's do some quick math: a five-levels hierarchy with an average number of 5 categories between each level will get you to over 3,000 levels to maintain. An average of 15, will bring you to ~250,000 levels.



It is somewhat intuitive that you may not be able to use the lower levels of your taxonomy as features for your machine learning models; the data are likely to be too scattered across all levels whether you run a forecast model or classification model. Also, finding similarities across thousands of levels may be a hard thing to do, even if you run a cluster analysis as a subsequent experiment (i.e. finding which categories exhibit similar growth levels or similar cash flow over their product lifetime).

You usually focus on Level 1 and Level 2. It is therefore critical to maintain a healthy ratio between the first and or the second level of the hierarchy.

So if your taxonomy dendrogram looks like this, right off the bat (between the first and second level of the hierarchy), you may have a hard time to find meaningful insights from your Product Category Level 2.  


    

You also face the following situation: no consistent number of levels in the same category, which makes the analysis difficult in one of the branches.   



What you need is the consistency of healthy ratios across all levels to give yourself a chance to find insights across your Levels 1 and 2 of your Hierarchy.



Last but not least, even if you manage to build a taxonomy with healthy and consistent levels (either in collaboration with your CDO or because you own Data Governance, don't take for granted that each product will have a taxonomy assigned. Unless you enforce it by process or your master data system, you may have a lot of nulls or N/A, which the number one killer of any feature analysis on taxonomy because you'll have to discard the record.  




Comments

Post a Comment