Wilson Lower bound Score and Bayesian Approximation for K star scale rating to Rate products

As a maintainer of an online community, which is having a lot of products where user gives a rating to products based on their experience, then it is definite that at some point you have to find an answer to questions like

How you are going to show the product on the page based on filters i.e. like highest voted or lowest voted, etc.? Or
How can you rate a product based on upvotes and downvotes?
How you can give a score to a product which is rated on a K scale by users?

There are some ways you can find a score and rate products accordingly:

Score = Average rating of products
Score = Positive rating - Negative rating
Score = Proportion of Positive ratings

Evan Miller’s famous blog How not to sort, explains why the above two scores are not good ways to rate the product or sort a product.

Lower bound of Wilson score confidence interval for a Bernoulli parameter provides a way to sort a product based on positive and negative ratings.

The idea here is to treat the existing set of user ratings as a statistical sampling of a hypothetical set of user ratings from all users and then use this score. In other words, what user community would think about upvoting a product with 95% confidence given that we have an existing rating for this product with a sample (subset from the whole community) user ratings.

Therefore if we know what a sample population thinks i.e. user reviews for a product, you can use this to estimate the preferences of the whole community.

If there are X positive votes and Y negative votes for a product and we want to understand how popular the product will be across the whole community. We can estimate that with 95% confidence between wilson_lower_bound_score and wilson_upper_bound_score% of users will upvote this product using Wilson Score of confidence interval.

Wilson Score

$\left(\hat{p}+\frac{z_{\alpha / 2}^{2}}{2 n} \pm z_{\alpha / 2} \sqrt{\left[\hat{p}(1-\hat{p})+z_{\alpha / 2}^{2} / 4 n\right] / n}\right) /\left(1+z_{\alpha / 2}^{2} / n\right)$

where, $\hat{p}$=(# of positive ratings)/(Total ratings)
$n$ = Total ratings
$z_{α/2}$= quantile of the standard normal distribution

import math
import scipy.stats as st

def wilson_lower_bound(pos, n, confidence=0.95):
    """
    Function to provide lower bound of wilson score
    :param pos: No of positive ratings
    :param n: Total number of ratings
    :param confidence: Confidence interval, by default is 95 %
    :return: Wilson Lower bound score
    """
    if n == 0:
        return 0
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    phat = 1.0 * pos / n
    return (phat + z * z / (2 * n) - z * math.sqrt((phat * (1 - phat) + z * z / (4 * n)) / n)) / (1 + z * z / n)

Wilson Confidence Interval considers binomial distribution for score calculation i.e. it considers only positive and negative ratings. If your product is rated on 5 scale rating, then we can convert ratings {1-3} into negative and {4,5} to positive rating and can calculate wilson score.

Lets look at some examples:

If a product is rated across each category uniformly [10, 10, 10, 10, 10], i.e. 10 votes for rating {1-5}, then wilson_lower_bound(20,50,.95), avg_rating([10, 10, 10, 10, 10]) => (0.2760838973025655, 3.0)
A product receives only one rating i.e. positive and one product receives 10 positive and 2 negative ratings: in that case value of product having more ratings should be greater wilson_lower_bound(1,1,0.95) < wilson_lower_bound(10,12,0.95) , which is true.
Product having ratings A: (209 up and 50 down votes) and B: (118 up and 25 down) wilson_lower_bound(209,259,0.95) < wilson_lower_bound(118,143,0.95)
Suppose one product receives [5, 10, 20, 0, 0] ratings, then wilson_lower_bound(0,35,0.95) = 0, If any product does not have any positive ratings associated with it then the Wilson score is zero.
Wilson Score can not be applied to new product which is yet to receive any rating, if using above implementation wilson_lower_bound(0,0,0.95) = 0.

Wilson score gives us the zero value for both the product which does not received any positive user rating and to product which is new and yet to receive any rating, which essentially does not make any sense as this implies no user rated product is same as product having lower ratings. Also, it is not clear how tight the lower bound is i.e., how far it deviates away from the “real” proportion of thumb-ups [1]. It does not seem intuitive to convert items rated on five star scale to convert to up votes and down votes for calculating scores to follow binomial distribution.

Bayesian Approximation

Bayesian Approximation provides a way to give a score to product when they are rated on star scale.

$\large \left(\hat{p}+\frac{z_{\alpha&space;/&space;2}^{2}}{2&space;n}&space;\pm&space;z_{\alpha&space;/&space;2}&space;\sqrt{\left[\hat{p}(1-\hat{p})+z_{\alpha&space;/&space;2}^{2}&space;/&space;4&space;n\right]&space;/&space;n}\right)&space;/\left(1+z_{\alpha&space;/&space;2}^{2}&space;/&space;n\right)$

where, $s_k=k$ (That is, 1 point, 2 points, ….)
$N$ = total ratings, with $n_k$ ratings for $k^{th}$ scale

The above expression provides the lower bound of a normal approximation to a Bayesian credible interval for the average rating. For more mathematical details please check [4].

import math
import scipy.stats as st

def bayesian_rating_products(n, confidence=0.95):
    """
    Function to calculate wilson score for N star rating system. 
    :param n: Array having count of star ratings where ith index represent the votes for that category i.e. [3, 5, 6, 7, 10]
    here, there are 3 votes for 1-star rating, similarly 5 votes for 2-star rating. 
    :param confidence: Confidence interval
    :return: Score
    """
    if sum(n)==0:
        return 0
    K = len(n)
    z = st.norm.ppf(1 - (1 - confidence) / 2)
    N = sum(n)
    first_part = 0.0
    second_part = 0.0
    for k, n_k in enumerate(n):
        first_part += (k+1)*(n[k]+1)/(N+K)
        second_part += (k+1)*(k+1)*(n[k]+1)/(N+K)
    score = first_part - z * math.sqrt((second_part - first_part*first_part)/(N+K+1))
    return score

Lets look at some test samples

bayesian_rating_products([0, 0, 0, 0, 1]) = 2.2290
bayesian_rating_products([0, 2, 0, 10, 0]) = 2.9921
bayesian_rating_products([5, 10, 20, 0, 0]) = 2.2349
bayesian_rating_products([10, 10, 10, 10, 10]) = 2.6296

On comparing 1 and 2, we can observe the second product should have a higher score and that is the case here. Also in point 3, unlike Wilson score it provides a score to the product which does not have a positive rating and still that score is greater than the first product which seems reasonable to me.

Bayesian Approximation does not consider only upvotes unlike Wilson’s score but considers ratings across the K scale and proves to be better in this scenario.

Please let me know if you like the post, or have some suggestions/concerns in comments and feel free to reach out to me on LinkedIn.

References

How to Count Thumb-Ups and Thumb-Downs: User-Rating based Ranking of Items from an Axiomatic Perspective
Star Quality: Aggregating Reviews to Rank Products and Merchants
http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
http://www.evanmiller.org/ranking-items-with-star-ratings.html

Machine learning Code

Wilson Score

Bayesian Approximation

References

Comments

Related Posts

Ranking the results of ML models with a time decay factor for large-scale Anomaly detection 03 Apr 2023

Monitoring Machine Learning Models in Production 02 Mar 2021

Maximal Marginal Relevance to Re-rank results in KeyPhrase Extraction 13 Oct 2019