For an assortment of reasons, I found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J.C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005).

The premise was (roughly) that elderly people are stereotyped as "warmer" to the extent that they are also perceived as incompetent (as in "Grandma's adorable, but she is a bit doddery"). The authors wrote:

We might expect a competent elderly person to be seen as less warm than a reassuringly incompetent elderly person. The open question is whether this predicted loss of warmth is offset by increases in perceived competence, or whether efforts to gain competence may backfire, decreasing rated warmth without corresponding benefits in competence(*).

The experimental scenario was fairly simple. There were 55 participants in three conditions. In the Control condition, participants read a neutral story about an elderly man, named George. In the High Incompetence (hereafter, just High) condition, the story had extra information suggesting George was rather forgetful. In the Low Incompetence (hereafter, just Low) condition, by contrast, the story had extra information suggesting George had a pretty good memory for his age. The dependent variable was a rating of how warmly participants felt towards George: whether they thought he was

Here is the results section:

Let's see. The three warmth ratings were averaged, and then a one-way ANOVA was performed. This was statistically significant, but of course that doesn't tell us exactly where the differences are coming from. You might expect to see this investigated with standard ANOVA post-hoc tests (such as Tukey's HSD), but in this case, the authors apparently chose to report simple

Hold on a minute. The overall 3x1 ANOVA was just about significant at

Let's plug those means and SDs into a

Before we can run our

OK, now we can do our calculations. Here's what we get:

High/Low:

High/Control:

Low/Control:

So there is

But wait... there's more. (Alert readers will recognise some of the ideas in what follows from our GRIM preprint).

Remember our sample sizes: nHigh = 18, nLow = 19, nControl = 18. And the measure of warmth was the means of three items on a 1-9 scale. So the possible total warmth scores across the 18 or 19 participants, when you add up the three-item means, were (18.000, 18.333, 18.666, ..., 161.666, 162.000) for High and Control, and (18.000, 18.333, 18.666, ..., 170.666, 171.000) for Low.

Now, the mean of the High scores was reported as 7.47. Multiply that by 18 and you get 134.46. Of course, 7.47 was probably rounded, so we need to look at what it could have been rounded from. The candidate total scores either side of 134.46 are 134.333 and 134.666. But when you divide 134.333 (recurring) by 18, you get 7.46296, which rounds (and truncates) to 7.46, not 7.47. And when you divide 134.666 (recurring) by 18, you get 7.48148, which rounds (and truncates) to 7.48, not 7.47.

Let's look at the Low scores. The mean was reported as 6.85. Multiply that by 19 and you get 130.15. Candidate total scores in that range are 130.000 and 130.333. But when you divide 130.000 by 19, you get 6.84211, which rounds (and truncates) to 6.84, not 6.85. And when you divide 130.333 (recurring) by 19, you get 6.85956, which rounds to 6.86. (It could be truncated to 6.85 if you really weren't paying attention, I suppose.)

For completeness, the Control mean of 6.59

So this means that, given the

A possible solution that allows the means to work is if the

To summarise, either:

/a/ Both of the

or

/b/ "only" the

And yet, the sentence about paired comparisons is pretty much the

As of today, Cuddy et al.'s article has 523 citations, according to Google Scholar; yet, presumably, none of the people citing it, nor indeed the reviewers, can have actually read it very carefully. So I guess some of the old stereotypes are true, at least when it comes to what people say about social psychology.

(*) Note that the study design arguably did not really test any efforts by the elderly person to gain competence; it tested how participants reacted to

(**) I presume that the term "paired comparisons" refers to the fact that the comparison was between a pair of groups in each case, e.g., High/Low or High/Control. The authors can't have performed a paired samples

[Update 2016-07-04 13:32 UTC: Thanks to Simon Columbus for his comment, pointing out the PubPeer thread on this article. Apparently a correction has been drafted (or maybe published already?) that fixed the

[Update 2016-07-09 22:17 UTC: Fixed an error; see comment by John Bullock.]

The premise was (roughly) that elderly people are stereotyped as "warmer" to the extent that they are also perceived as incompetent (as in "Grandma's adorable, but she is a bit doddery"). The authors wrote:

We might expect a competent elderly person to be seen as less warm than a reassuringly incompetent elderly person. The open question is whether this predicted loss of warmth is offset by increases in perceived competence, or whether efforts to gain competence may backfire, decreasing rated warmth without corresponding benefits in competence(*).

The experimental scenario was fairly simple. There were 55 participants in three conditions. In the Control condition, participants read a neutral story about an elderly man, named George. In the High Incompetence (hereafter, just High) condition, the story had extra information suggesting George was rather forgetful. In the Low Incompetence (hereafter, just Low) condition, by contrast, the story had extra information suggesting George had a pretty good memory for his age. The dependent variable was a rating of how warmly participants felt towards George: whether they thought he was

*warm*,*friendly*, and*good-natured*. Each of those was measured on a 1-9 scale.Here is the results section:

Let's see. The three warmth ratings were averaged, and then a one-way ANOVA was performed. This was statistically significant, but of course that doesn't tell us exactly where the differences are coming from. You might expect to see this investigated with standard ANOVA post-hoc tests (such as Tukey's HSD), but in this case, the authors apparently chose to report simple

*t*tests --- "Paired comparisons" (**) --- comparing the groups. Between High and Low, the*t*value was reported as 5.03, and between High and Control, it was 11.14. These values are always going to be statistically significant; for 5.03 with 35*df*s this is a*p*of around .00001 and for 11.14 with 34*df*s, the*p*value is bordering on the homeopathic, certainly far below .00000001.Hold on a minute. The overall 3x1 ANOVA was just about significant at

*p*< .03, but two of the three possible*t*tests were slam-dunk certainties? That doesn't feel right.Let's plug those means and SDs into a

*t*test calculator. There are several available online (e.g., this one), or you can build your own in a few seconds with Excel: put the means in A1 and B1, the Ns in C1 and D1, the SDs in E1 and F1, and then put this formula in G1:
=(A1-B1)/SQRT((E1*E1/C1)+(F1*F1/D1))

(That just gives you the Student's *t*statistic; adding*p*values is left as an exercise for the reader, as is the extension to Welch's*t*test.)Before we can run our

*t*test, though, we need the sizes of each sample. We know that nHigh + nLow + nControl equals 55. Also, the*t*test for High/Low had 35*df*s, meaning nHigh + nLow equals 37, and the*t*test for High/Control had 34*df*s, meaning nHigh + nControl equals 36. Putting those together gives us 18 for nHigh, 19 for nLow, and 18 for nControl.OK, now we can do our calculations. Here's what we get:

High/Low:

*t*(35) = 1.7961,*p*= .0811High/Control:

*t*(34) = 3.2874,*p*= .0024Low/Control:

*t*(35) = 0.7185,*p*= .4772 (just for completeness)So there is

*no*statistically significant difference between the High and Low conditions. And, while the High/Control comparison is significant, its strength is far less than what was reported. If you ran this experiment, you might conclude that the intervention was maybe doing something, but it's not clear what. Certainly, the authors' conclusions seem to need substantial revision.But wait... there's more. (Alert readers will recognise some of the ideas in what follows from our GRIM preprint).

Remember our sample sizes: nHigh = 18, nLow = 19, nControl = 18. And the measure of warmth was the means of three items on a 1-9 scale. So the possible total warmth scores across the 18 or 19 participants, when you add up the three-item means, were (18.000, 18.333, 18.666, ..., 161.666, 162.000) for High and Control, and (18.000, 18.333, 18.666, ..., 170.666, 171.000) for Low.

Now, the mean of the High scores was reported as 7.47. Multiply that by 18 and you get 134.46. Of course, 7.47 was probably rounded, so we need to look at what it could have been rounded from. The candidate total scores either side of 134.46 are 134.333 and 134.666. But when you divide 134.333 (recurring) by 18, you get 7.46296, which rounds (and truncates) to 7.46, not 7.47. And when you divide 134.666 (recurring) by 18, you get 7.48148, which rounds (and truncates) to 7.48, not 7.47.

Let's look at the Low scores. The mean was reported as 6.85. Multiply that by 19 and you get 130.15. Candidate total scores in that range are 130.000 and 130.333. But when you divide 130.000 by 19, you get 6.84211, which rounds (and truncates) to 6.84, not 6.85. And when you divide 130.333 (recurring) by 19, you get 6.85956, which rounds to 6.86. (It could be truncated to 6.85 if you really weren't paying attention, I suppose.)

For completeness, the Control mean of 6.59

*is*possible: 6.59 times 18 is 118.62, and 118.666 divided by 18 is 6.59259, which rounds and truncates to 6.59.So this means that, given the

*df*s as they are reported in Cuddy et al.'s article, the two means corresponding to the experimentally manipulated conditions are necessarily incorrect.A possible solution that allows the means to work is if the

*df*s of the second*t*test were misreported. If you change*t*(35) to*t*(34), that implies nHigh = 19, nLow = 18, nControl = 18, and now the means can be computed correctly. But one way or another, there's yet more uncertainty here.To summarise, either:

/a/ Both of the

*t*statistics, both of the*p*values, and one of the*df*s in the sentence about paired comparisons is wrong;or

/b/ "only" the

*t*statistics and*p*values in that sentence are wrong,*and*the means on which they are based are wrong.And yet, the sentence about paired comparisons is pretty much the

*only*evidence for the authors' purported effect. Try removing that sentence from the Results section and see if you're impressed by their findings, especially if you know that the means that went into the first ANOVA are possibly wrong too.As of today, Cuddy et al.'s article has 523 citations, according to Google Scholar; yet, presumably, none of the people citing it, nor indeed the reviewers, can have actually read it very carefully. So I guess some of the old stereotypes are true, at least when it comes to what people say about social psychology.

(*) Note that the study design arguably did not really test any efforts by the elderly person to gain competence; it tested how participants reacted to

*descriptions*of the person's competence by a third party, which is not quite the same thing.(**) I presume that the term "paired comparisons" refers to the fact that the comparison was between a pair of groups in each case, e.g., High/Low or High/Control. The authors can't have performed a paired samples

*t*test, since the samples were independent.[Update 2016-07-04 13:32 UTC: Thanks to Simon Columbus for his comment, pointing out the PubPeer thread on this article. Apparently a correction has been drafted (or maybe published already?) that fixed the

*t*values, and then claims, utterly bizarrely, that this does not change the conclusion of the paper. But even if we accept that for a nanosecond, it does not address the question of why the means were not correctly reported. It looks like a second correction may be in order. I wonder what Lady Bracknell would say?][Update 2016-07-09 22:17 UTC: Fixed an error; see comment by John Bullock.]

This was already noted by Andrew Gelman; there's a PubPeer conversation about it: https://pubpeer.com/publications/E043DD982C3CC7F4B2CB4980522684

ReplyDeleteYep - if you look at the comments on Gelman's blog you'll see that this issue was first brought to his attention by someone signing themselves as "Nick" (http://andrewgelman.com/2016/01/26/more-power-posing/#comment-261113). For a variety of reasons (principally the desire to get the GRIM preprint out the door, before demonstrating the technique as here), I just didn't get round to blogging it until now, although I drafted most of this post several months ago.

ReplyDeleteSo how did they get the ultra-low p-values?

ReplyDeleteMaybe they *did* run paired t-tests, incorrectly? But that can't be possible because the n's were inconsistent across groups.

So what on earth happened?

I looked up the article and it starts with a poem. I think that about says it all.

ReplyDeleteFor what it is worth, the statistics from the main effect ANOVA also do not seem to match the reported means and standard deviations; although the result is still significant. Using Nick's calculations of the sample sizes for each condition, one can compute MSb = 3.686269091 and MSw = 0.988803846, so F=MSb/MSw = 3.728008447, which gives p=0.03069. The original paper reported F=3.93.

ReplyDeleteA quick check on the numbers does not suggest that this could be a rounding error mistake (although maybe I was not creative enough). It seems like the paper is full of sloppy work all the way around.

The correction posted on PubPeer only changes the t-statistics, but leaves the reported means and standard deviations unchanged, so there is still a minor discrepancy.

ReplyDeleteRemember our sample sizes: nHigh = 18, nLow = 19, nControl = 18. And the measure of warmth was the means of three items on a 1-9 scale. So the possible total warmth scores across the 18 or 19 participants, when you add up the three-item means, were (18.000, 18.333, 18.666, ..., 143.666, 144.000) for High and Control, and (18.000, 18.333, 18.666, ..., 170.666, 171.000) for Low.Nick, this first set of possible total warmth scores -- (18.000, 18.333, 18.666, ..., 143.666, 144.000) -- seems a bit off. Don't you want (18.000, 18.333, 18.666, ..., 161.666, 162.000)?