Is there a way to compare averaged ratings, each with a different count of participants


— **Closed**. Many satisfying answers. Thanks! —

Let’s say people on the street are asked to rate three different movies on a scale of 1-10, but only if they saw it. We get the following, *already averaged* data:

* Movie A: 1000 people gave A 8.1 on avg.
* Movie B: 10’000 people gave B a 6.9 on avg.
* Movie C: 20’000 people gave C a 7.4 on avg.

Simply according to the averaged ratings the movie to watch would be movie A. But taking into account that way less people watched A, its rating does not seem as accurate, i.e. trustworthy as the ratings of the other two movies: C seems to be best suited for the average viewer, while A seems to be the choice if you like that type of movie (or it is a hidden gem).

So back to my question: Is there a way to calculate one value (preferably in the original rating range) for each movie that allows accurate (approximate) comparison of these movies – i.e. that takes the count of participants into account?



*What I’ve tried:*

* Dividing and multiplying these values:

|Movie|Participants: *ps*|Avg. rating: *rg*|*rg/ps*1000*|*ps/rg*|*ps*rg/1000*|
||||useless, I think|better, but probably not fair|maybe even better, but still unfair|


* Getting the total of all participants and averaging the ratings accordingly

In total 31000 people have participated and have given a 7.3 to all three movie on average [sum of (ps*rg for A,B,C) divided by 31000]. I feel like this could be helpful, but then again I have not gained much with this new average. I simply cannot grasp how I could find such a number…


Any help/explanation appreciated!

In: Mathematics

In statistics if a population reaches 1000 individuals it is considered to be representative. It doesn’t matter if you increase the number of people you ask the accuracy will be the same.

This assumes that the population (the people participating a study such as a questionnaire about movies) is randomly selected. If the group is not randomly selected and you for example select 1000 people from a horror movie club or from an older people’s home. Then this will influence the results and make them not representative of the general population. The horror movie fans will likely select a horror movie and the older people a golden age classic. The result is biased.

The problem you bring up is real though. Imagine you could only find 3 people to ask about movie A, 200 people for movie B and 1000 for movie C. You can still calculate an average for each movie but for movie A and B the average is not representative since the idiosyncrasies (all the ways that people are NOT like other people) will influence the result. In movie A one of the three people might straight up just hate movies and rank it a 0. Even if the two other participants both rank it 10 the average is no higher than 6.6.

The problem

The simple answer is: unfortunately no.

The longer answer is: the problem here is not really one of the numbers, it’s the bias in the people watching and rating the films.

If you have a random sample you can work out how reliable your results are. If you’re looking at 1000+ responses though, they will be pretty reliable, at least for these purposes. Which is good because there are two problems with this. One is that you need all the ratings, not just the average (you need to work out the standard deviation). The even bigger problem is that your sample isn’t random.

You might look at a film, see it has a really high rating from a small number of people and go “aha, that’s a niche film!” but just from the numbers there’s no way of telling. It might be a film everyone would love if they saw it, or it might not.

Or imagine comparing an older film with a newer one: people who rush out to watch films will likely have different attitudes to those who don’t.

This doesn’t make ratings entirely useless, but means you have to use your judgement when interpreting them. What matters is not so much the number of ratings – unless those numbers are much smaller than the ones in your example – it’s who’s doing the rating. If you think, for example, that roughly the same sort of people will have watched a bunch of different romcoms, or if you think transformers films are mostly watched by transformers fans, then comparing the ratings will be more valid. But comparing those romcoms with transformers films… tricky.

You should look into Bayesian shrinkage. This article has an intuitive explanation of how it works. However, it’s not a trivial exercise, and it will be much harder to do with ratings on a 1-10 scale (comparing averages over scales like that is its own problem. A movie that gets all 4s and 6s can rank the same as a movie that gets all 1s and 9s, but are they really the same?)


The key thing to realize is that statistical techniques can be helpful for letting you answer the question of “Are these two numbers actually different”, but they will never let you *reverse* what you’re seeing in the data. The two possible answers you could find are “Movie A is definitely rated highest” or “Movie A is rated just as good as B or C”.

Should you buy a product with 100% positive ratings from 2 people, 80% positive ratings from 10 people, or 60% from 100 people.

What you can do is add one positive and one negative review. So it would be 2/4 on the first. 9/12 for the second. And 61/102 with the last. That would be 50%, 82% and 60%. So the second product would be your goto.

Dk if it works with so big numbers, as it would merely affect the result.

Source: 3Blue1Brown