Doubt GraphicErrors when thinking about degrees of certainty

by Matthew Leitch, 17 May 2002

Errors when thinking about degrees of certainty

The previous section argued that in many everyday situations we should think of many things as being uncertain to some extent. Some more common errors arise in this sort of reasoning. Some are the result of the limitations of ordinary language.

Other errors arise when we rely on judgements of quantities such as likelihood, or try to combine those judgements using yet more judgement. This kind of judgement is often necessary in the absence of more objective information, and it is possible to learn to give reasonably accurate probability judgements in a narrow field with long practice, good feedback, and some skill. However, there are also several errors that can undermine reasoning.

This area has been studied intensively for several decades by psychologists. There are many, many scientific articles on the subject and debate about why people make the errors they do is particularly complex and hard to follow. The research has also been picked up by authors who have taken the names of the theories and misunderstood and over-generalised them in a highly misleading way. Instead of leading with the theories and risking over-generalisation, I present below some examples of questions used in experiments, the answers people usually give, and why they are wrong.

If you've never come across this research before prepare your ego for a battering. If you get to the end of these examples and you're still thinking "Surely this is some mistake?" forget it. Thousands of people have been through the same thoughts you are having but there have been so many studies now that you can be pretty confident that anything you can think of as on objection has been tried out and failed. The fact is we're mostly stupid. Get used to it.

Judgement is unreliable
QuestionTypical answersWhy they are wrong
Imagine we have a group of 100 professional men, a mixture of 70 lawyers and 30 engineers. Now we select one at random. What is the probability that he is an engineer? Now suppose we are told that the man is 30 years old, married with no children, has high ability and motivation and promises to do well in his field. He is liked by his colleagues. What would you say was the probability that he is an engineer? Finally, we learn that he builds and races model cars in his spare time. What now is your opinion as to the probability of his being an engineer? Most people answer the first of the three probability questions with "30%", which is correct. Given some information but none that helps decide if the person is an engineer or a lawyer people tend to say it's "50:50". Given some information that suggests an engineer, they give a number derived purely from the extent to which the information seems to represent an engineer rather than a lawyer, without regard to the proportion of engineers in the group. For example, building and racing model cars seems to suggest an engineer to me, so I would say "70%" The first answer is quite correct. The second answer should be the same as the first, but the snippet seems to trigger us into judging probabilities purely on the basis of how much the infomation seems to represent a stereotypical engineer rather than a stereotypical lawyer. The problem is not so much the use of stereotypes but the failure to consider the proportion of engineers in the group as a whole, which is still an important factor when specific factors of the person selected are not conclusive. The third answer should reflect the original proportion of engineers as well as the biographical clues. In this case it should be lower than the answer based just on model cars.
Imagine a town has two hospitals with maternity wards. In the larger hospital about 45 babies are born daily, but in the smaller hospital only about 15 are born each day. Of course about 50% of babies born are boys, though the exact percentage varies from day to day. For a year, each hospital recorded the days on which more than 60 percent of the babies born were boys. Which hospital do you think recorded the most such days? Most people say "about the same" and other answers are evenly split between the large and small hospital. In other words people generally don't think size matters. Size does matter. An elementary conclusion of statistics is that the larger a sample is the more closely it typically resembles the population from which it is taken. The larger hospital will have far fewer days when more than 60% of babies born are boys. In general people seem to have no concept of the importance of sample size.
Imagine a huge bag into which we cannot see, filled with lego bricks of two colours, blue and yellow. 2/3 are one colour and 1/3 are the other colour but we don't know which is which. One person pulls out five bricks at random and finds 4 are blue and one yellow. Another person grabs a bigger handful at random and has 20 bricks, of which 12 are blue and 8 are yellow. Which person should be more confident that the bag contains 2/3 blue bricks and 1/3 yellow bricks, rather than the opposite? Most people say the handful with 4 blues and one yellow is more conclusive. Again, sample size matters but is ignored. The actual odds are twice as strong from the larger handful because of the value of a larger sample, even though it does not contain such a high ratio of blue bricks. Furthermore, when asked to give the actual odds people usually underestimate the confidence given by the larger sample.
A coin is to be tossed six times. Which sequence of outcomes is more likely: H-T-H-T-T-H or H-H-H-T-T-T ? The second sequence does not look random so people say the first sequence is more likely. The sequences are equally likely. However, we have definite views about what a random result looks like. We expect it to be irregular and representative of the sample from which it is drawn or of the process that generated it. Since the second sequence violates those expectations it seems less likely.
Write down the results of a sequence of twelve imaginary coin tosses, assuming the coin is fair i.e. heads are as likely as tails. A typical result looks pretty random. Because people expect the results to look irregular and to be representative of the process that generated them, their sequences tend to include far too few runs and to be too close to 50:50 heads vs tails for a small sample like 12 tosses.
A coin is tossed 20 times and each time the result is heads! What is the probability of tails next time? Gamblers tend to think that surely it must be time for tails, "by the laws of probability". If you trust the coin the odds are the same as before at 50:50. If there is doubt about the coin the odds should actually favour another head. Those are the real laws of probability.
Ten trainee teachers each give a half hour lesson which is observed by an expert teacher and a short report is written on the quality of the lesson. You see only the ten short reports and must decide for each teacher where their lesson ranks against others in percentile terms (e.g. are they in the top x% of trainees?), and where they are likely to be after 5 years of teaching relative to other teachers of the same experience. Most people's ratings of lesson quality and future prospects are basically the same. The ratings of future prospects should be less extreme than the evaluations of lesson quality. Observing a half hour lesson is not a reliable guide to performance in 5 years time! The evidence of the lesson should be influential but not the only factor. Without that evidence you would say their prospects were about average; with the evidence you should say they are somewhere between average and the relative quality of the lesson.
You are a sports coach and want to find out, scientifically, what effect your words have on your coachees. Your records show that when you give praise for a particularly good performance they tend to do less well next time. The good news is that when you give them a hard time for a poor performance they tend to do better next time. Sadly some people conclude that this shows praise does not work but a verbal beating does. Actually it just means that after a particularly good performance the next one is likely to be less good, and a particularly bad performance is likely to be followed by a better one even if you say nothing. Performance varies, in part randomly, and that is the reason for this.
You have the academic results of two undergraduates for their first year of two different three year degree courses. One has scored six B grades while the other has a mixture of As, Bs, and Cs across 6 papers. Whose eventual degree result after three years would you feel most confident of predicting correctly? Most people are more confident they can predict the result of the consistent B person. They should have chosen the other person. Highly consistent patterns are most often observed when the input variables are highly redundant or correlated. The straight Bs probably contain less information than the mixture of As, Bs, and Cs.
You listen to someone read out the names of a number of male and female celebrities, some more famous than others. At the end, to your surprise, you are asked to say the proportion that were female. You hadn't been keeping score so have to use your judgement and memory. People tend to overestimate the proportion of women if there were more big female celebrities than big male celebrities. More famous people are easier to remember so they come to mind more easily. If more of those really big names are female it influences your judgement much more than the minor celebrities.
Are there more words in normal English text that start with R than have R as the third letter (ignoring words of less than 3 letters)? Most people guess that more begin with R. But in fact more have R as the third letter, but it is far easier to call to mind words that start with R so we think they are more numerous.
Imagine you have 10 people to choose from to form a committee. How many different committees can you form from this group with 2 members, 3 members, 4 members, and so on? Since most people don't know the maths to do this they make a judgement. In one study the median estimate of the number of 2 person committees was 70, while the estimate for committees of 8 members was 20. The number of ways to form a committee of X people out of 10 is the same as the number of ways to form a committee of 10 - X people. If you select one set of people for the committee you are simultaneous selecting the others for the "reject committee". For example, the number of 2 person committees is the same as the number of 8 person committees and is 45. However, as one tries to imagine committees it seems easier to imagine forming lots of small committees than lots of big committees, and so it seems there must be more of them.
You are presented with information about some imaginary mental patients. For each patient you are given a diagnosis (e.g. paranoia, suspiciousness) and a drawing made by the person. Afterwards, you have to estimate the frequency with which each diagnosis is accompanied by various features of the drawings, such as peculiar eyes. Most people recall correlations between diagnoses and pictures that match natural associations of ideas rather than actual correspondences in the examples given. For example, peculiar eyes and suspiciousness seem to go together and frequency of occurrence is judged high. The illusory correlations are resistant to contrary data. They occur even when there is a negative correlation between diagnosis and feature, and prevent people from seeing correlations which are actually present. Similar experiments have shown that we are surprisingly bad at learning from examples. We do not always learn by experience! Other experiments have shown how hindsight colours our interpretation of experience. We think that we should have seen things coming when in fact we could not have done. Another common and powerful bias is known as the "Fundamental Attribution Error". This is the tendency to explain another person's behaviour as mainly driven by their personality/habits and our own behaviour as mainly driven by circumstances. Both judgements are usually wrong.
Imagine you are asked a series of very difficult general knowledge questions whose answers are all percentages between 0 and 100% (e.g. the percentage of African countries in the United Nations). After each question is given a sort of roulette wheel is spun to give a number. You are asked if the answer to the question is higher or lower than the random number chosen by the wheel. Then you have to estimate the answer. Most people's estimates are affected by the number given by the roulette wheel. For example, in one study the median estimates of percentage of African countries were 25 for people given 10 as the starting number, and 45 for groups given 65 to start with. This is called an "anchoring" effect. Our judgements are anchored by the given number, even if it is known to be random, but especially if it is given by someone else we think might know something. We then fail to adjust sufficiently from the anchor.
Which do you think is most likely? (1) to pull a red marble from a bag containing 50% red and 50% white marbles, or (2) drawing a red marble 7 times in succession from a bag containing 90% red marbles and just 10% white (assuming you put the marble back each time), or (3) drawing at least one red marble in 7 tries from a bag containing 10% red marbles and 90% white (assuming you put them back each time). Most people think drawing 7 reds in succession is most likely, and at least 1 red in 7 tries the least likely. In fact the probabilities are very similar, but the reverse of what most people think: (1) = 0.48, (2) = 0.50, (3) = 0.52. This illustrates our general tendency to understimate the likelihood of something happening at least once in many tries, and overestimating the likelihood of something likely happening successively. For example, we might look at the many risks affecting a project and see that each alone is unlikely. Based on this we tend to think there's little to worry about. In fact, because there are lots of unlikely risks, the risk of at least one thing going wrong is much higher than we imagine.
You are asked to make some predictions about the future value of the FTSE100 index. For a given date you are asked to say a value that is high enough that you are 90% sure the actual index will not be higher, and another number so low that you are 90% sure the actual index will not be lower. Most people are too confident of their judgement for a difficult estimate like this and their ceiling is too low while their floor is too high. Actually there are various ways to find out what a person thinks the probability distribution of some future value is. Different procedures give different answers and an anchoring effect from the first numbers mentioned is quite common.
Here's another question with bags of coloured lego. This time I have two big bags of lego. One has 700 blue bricks and 300 yellow bricks in it. The other has 300 blue bricks and 700 yellow bricks. I select one by tossing a fair coin and offer it to you. At this point there is nothing to tell you if it is the mainly blue bag or the mainly yellow bag. Your estimate of the probability that it is the mainly blue bag is 0.5 (i.e. 50% likely). Now you take 12 bricks out of the bag (without looking inside!) and write down the colour of each before putting it back in the bag and drawing another. 8 are blue and 4 are yellow. Now what do you think is the probability that the bag is the predominantly blue one? Most people give an answer between 0.7 and 0.8. The correct answer is 0.97. Hard to believe isn't it? Most people are very conservative when trying to combine evidence of this nature. This contrasts with some of the other situations described above where we seem to be overconfident. These were in situations where we had little idea to go on. Here we have all the information we need to come up with a correct probability but fail to. The more observations we have to combine to make our judgement the more over-conservative we are. Another odd finding is that where there are a number of alternative hypotheses and we have to put probabilities on all of them we tend to assign probabilities across the set that add up to more than 1, unless we are forced to give probabilities that sum to 1 as they should. Our ability to combine evidence is so poor that even crude mathematical models do better than judgement.
© 2002 Matthew Leitch