Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Not To Sort By Average Rating (evanmiller.org)
141 points by ntoshev on March 25, 2010 | hide | past | favorite | 31 comments


The wrong solutions the article describes are certainly wrong. It doesn't really make much of a case that the particular solution it proposes is particularly good.

For instance, here's another much simpler formula that avoids the problems described in the article: score = (pos+1)/(pos+neg+2). It's the posterior mean of Pr(random person likes the product), if your prior is uniform on [0,1]. You can adjust this to favour items with more ratings by, e.g., using a prior that "prefers" smaller values of that probability; one easy way to do that is to pretend that every item starts with a certain number of positive reviews and a certain number of negative reviews; you end up with a formula of the form (pos+A)/(pos+neg+B).

Is the article's formula better than this? Probably, but the author hasn't said why.


If we're going to use a Bayesian treatment, the real sin is throwing away your uncertainty.

What I would love to see is a ranking algorithm that provides an elegant, intuitive interface to fuzzy ranking. If we can't be confident of the ordering, why pretend? It's a usability problem, not a statistics problem.

Your estimator is slightly more resilient than Amazon's and somewhat less conservative than the author's. It's definitely simpler, but who cares? Visitors never have to compute it after all.


I think it's a huge usability problem, if we're talking about something Amazon-like that's presenting ranked data to an unsophisticated audience. Why pretend? Because it produces something the user might be able to understand, that's why.

(Yes, it would be extremely cool to make an elegant, intuitive interface to a fuzzy ranking. One reason why it would be extremely cool is that it's totally not obvious how it can be done, or even that it can be done.)

I wasn't suggesting that my estimator is better because it's simpler, nor that anyone should care; just pointing out that the author of the article made a huge leap from "here are two things that don't work" to "and here is the specific arbitrary math-heavy thing I think you should do instead".


With Bayesian average (http://news.ycombinator.com/item?id=1219154 , http://wiki.answers.com/Q/What_does_true_Bayesian_estimate_m...) you can easily put a slider saying how much the missing votes count (or how much preference do you give to items with more ratings).


I agree. I think it's a great problem to motivate a startup on.


Agreed. Even if you find the formula scary, he implemented it in Ruby for you. It's a pretty short trip to implement it in the language of your choice.


This solution is called a "Laplacian correction", and I agree with your recommendation.


I use what reddit uses, works well

http://code.reddit.com/browser/r2/r2/lib/db/sorts.py

      s = score(ups, downs)
      order = log(max(abs(s), 1), 10)
      sign = 1 if s > 0 else -1 if s < 0 else 0
      seconds = epoch_seconds(date) - 1134028003
      return round(order + sign * seconds / 45000, 7)
Let me know if you want my attempt at a MySQL translation of that.

ps. why is my account ( _ck_ ) so slow when logged in and my comments marked as "dead" ? Was I flagged as a spammer by mistake? Example: http://news.ycombinator.com/item?id=1219475


ps. why is my account ( _ck_ ) so slow when logged in and my comments marked as "dead" ? Was I flagged as a spammer by mistake?

Looks like it. Looking at your posting history on _ck_, though, I don't see why you'd be auto-deaded. I suggest emailing pg about it.


That's not really appropriate for what this blog post was talking about, which is rating products. It doesn't make sense for products' ratings to be time sensitive. That would make the most recent items have an artificially inflated score.


But you want to control the weight of a rank for an aging product, no?

Let's say you are ranking cellphones that a store sells and someone has a fond memory of a cellphone they used a few years ago so they vote it up. But the phone itself was posted years ago, compared to the newest phones that came out this year. You don't want their vote to have as much weight because of the product age.


You could make an argument for using that for, say, deciding on which products should be featured on a page that's promoting products to consumers.

But for actual product "scores" that people see that's a very bad idea. A book shouldn't have a 4/5 if everyone rated it a 5/5 just because it came out 50 years ago.


It's very nice to see a mathematically motivated piece on how to sort based on users ratings. But there are two problems with the solution presented, the first minor, and the second more serious.

Minor: Distributions of ratings are not usually normal about the average possible rating -- they are usually skewed to the positive end, because people are pretty good at selecting what they read or view. The solution presented will tend to skew low-frequency ratings too far toward the average. It should be easy to calculate an expected average for your site, and use this instead of 1/2 for the formula given in the post.

Major: What you really need is a missing data point: the number of readers who haven't made any rating at all. It's pretty clear if you think of two examples:

Item 1 1,000,000 views 10 positive ratings 1 negative rating

Item 2 15 views 10 positive ratings 1 negative rating

Clearly, readers feel pretty blah about Item 1. You're much better off displaying Item 2, other things being equal, but the formula doesn't distinguish the two cases.


I agree, especially that the skew needs to be site dependent. For example the iPhone app store actually has the opposite problem where their rate on delete "feature", skews the average ratings way down. Reviews are heavily distributed along 1 star or 5 star ratings.

Would also be curious to see a comparison of different rating systems, for example i'd like to see what the distribution is on a site thumbs up/down rating site like digg vs a 5 star system.


One data point is the netflix rating set from their contest. The system is 1 to 5 stars. The average movie rating was 3.4 in 1998 climbing to 3.8 in 2005.

Back to the original point -- if you used a formula with a default rating of 3 it would underestimate low-frequency movies. People don't view movies at random. They try to pick things they will like, which should be reflected in the statistical model.


The article does actually cover your serious point.

> Indeed, it may be more useful in a "top rated" list to display those items with the highest number of positive ratings per page view, download, or purchase, rather than positive ratings per rating. Many people who find something mediocre will not bother to rate it at all; the act of viewing or purchasing something and declining to rate it contains useful information about that item's quality.


This is a very helpful blog post from about a year ago. We implemented something similar to a Wilson score to power the default sorting for the millions of reviews we have at RateItAll. In general, with a system like this, the items with the most reviews are going to rise to the top. This might not be what you want. You can easily see this effect with something like our "Events of 2010" list: http://www.rateitall.com/t-2987869-events-of-2010.aspx


>items with the most reviews are going to rise to the top

Yes and no. Anything with a larger N is going to make for greater confidence in the rating. But that rating still depends on the number of positive and negative reviews. So if a product has a large number of ratings, but they are 50/50 positive and negative, then it's not going to rise to the top simply because it has the most reviews. The algorithm would presumably be very confident about placing it in the middle of the pack.


>So if a product has a large number of ratings, but they are 50/50 positive and negative, then it's not going to rise to the top simply because it has the most reviews.

What I'm saying is that it will rise to the top because even though only 50% of the ratings are positive, it will have many many more positive ratings than the nearest next item. That 2nd item may have a much higher percentage of positive ratings, but a much lower number of total ratings.


It seems to work as described in the article:

    require 'rubygems'
    require 'statistics2'

    def ci_lower_bound(pos, n, power)
        if n == 0
            return 0
        end
        z = Statistics2.pnormaldist(1-power/2)
        phat = 1.0*pos/n
        (phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
    end

    puts ci_lower_bound(60, 100, 0.10)   # => 0.517809505446319
    puts ci_lower_bound(500, 1000, 0.10) # => 0.474027691168875
    puts ci_lower_bound(50, 100, 0.10)   # => 0.418847795168265
The top ranked result has 60 positives out of 100, and beats 500 out of 1000 - almost 10 times as many positive votes.


A couple of notes there: first the item at the top of the list has more than an order of magnitude more ratings than the rest. For most ordered lists, this is probably an edge case that may have to be dealt with separately.

Finally, I've got to say, 26 reviews (for the #2 item) doesn't seem insignificant, and the #2 item seems to have something like a 15% higher _average_ rating. Also, on another trending rating (http://www.rateitall.com/t-3239938-2010-ncaa-tournament-team...), the #2 has a significantly higher rating (17-18% it looks like), but only one less review: 4 vs. 3. I think if I were sorting either of these based on the average rating and the # of ratings, I would have done both differently.

Based on this, it seems that the Wilson score probably _over-emphasizes_ sample size, especially on things like ratings on high-traffic internet sites that may have orders of magnitude swings in the number of ratings.

It seems like the Amazon method of average works just fine, except for items with very few ratings possibly receiving disproportionately high ratings. Again, edge cases, which should probably just be penalized manually.


I recently needed to do this for sorting search results. My approach was to calculate a Bayesian average (http://en.wikipedia.org/wiki/Bayesian_average ), which was also pretty simple (once I found a formula to copy from here: http://www.thebroth.com/blog/118/bayesian-rating )

It seems to me that the Bayesian average is superior, because it considers the value relative to other items.


The Bayesian example given in the Wikipedia article looks like it works best with interval level data. Not sure it would work as well with ordinal data, because ordinals don't suffer to the same degree from outliers.

The second example also isn't ordinal data—it's just nominal "liked" vs "didn't like."

Both examples are interesting, but don't necessarily get at the problem with ordinal data introduced by the article.


reddit uses the Wilson algo for comment sorting:

http://blog.reddit.com/2009/10/reddits-new-comment-sorting-s...


A more thorough post on the same subject has been posted on HN before: How to build a popularity algorithm you can be proud of http://news.ycombinator.com/item?id=797740.

It's been a great resource for me.

EDIT: Fixed the link.


He shows the flaws of both examples but only gives a solution to the first one. How does one nicely sort a rating scale like Amazon's that does have more than "yes" and "no"?


I don't know offhand how the Wilson thing generalizes. The simple Bayesian posterior mean one I mentioned yesterday is pretty simple: suppose you have a1,a2,...,ak ratings with 1,2,...,k stars; then the posterior mean number of stars is [(1+a1).1 + (1+a2).2 + ... + (1+ak).k] / [(1+a1) + (1+a2) + ... + (1+ak)]. (Of course the denominator can be written more briefly, but I think it's clearer that way.) Equivalently, you just calculate the average star count, but you add in a bunch of fictitious reviews, one with each possible number of stars. Once again, by adjusting the number of each kind of fictitious review you can (e.g.) make less-reviewed items do better or worse relative to more-reviewed ones; by changing the coefficients that are 1,2,...,k above you can (e.g.) make a 5-star review more than 5/4 as good as a 4-star one; etc.


We have used this algorithm in our side project http://movieorwhat.com/month/ - it seems to be working very well!


Nice. I'd figured out the problems with the 2 wrong solutions myself but I wouldn't have known where to start looking for the better approach.


This is the most useful post I've read in a long time - AWESOME.


Oh god, wall of math!

/me hangs head in shame.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: