15 February 2017

A different set of problems in an article from the Cornell Food and Brand Lab

[[ Update 2018-09-23 20:30 UTC: Fixed some links that were broken because some original documents had gone missing from sites controlled by the Cornell Food and Brand Lab. ]]
[[ Update 2017-10-19 17:00 UTC: This post now features in a BuzzFeed article here. ]]

(This is the first time I've blogged on the subject of the ongoing kerfuffle around the Cornell Food and Brand Lab, that was started by reactions to this blog post [[2018-09-23: that post is long gone, but an archived version is here]] by the lab's Director, that led to our preprint and subsequent peer-reviewed article showing that their research had a lot of errors and inconsistencies in it, and some more blog posts describing problems at the same lab, and some scrutiny in the media. This story is of another article from the same lab, but with some rather different problems from the others that we've documented so far.)

The article

The American Medical Association has a journal dedicated to all aspects of pediatric medicine, JAMA Pediatrics (formerly known as Archives of Pediatrics & Adolescent Medicine). If you are involved in anything to do with children's health, this is the journal that you really, really want to publish in.  It has an impact factor of 9.5 (which is kind of impressive even if you don't believe in impact factors), making it the "highest-ranking pediatric journal in the world", and it accepts just 9% of the research articles that it receives.

So presumably the editors of JAMA Pediatrics thought that a short article ("research letter") that they accepted for publication in the "Pediatric Forum" section of the journal in 2012, entitled "Can Branding Improve School Lunches?", could result in substantial improvements in nutrition for American schoolchildren. This article, by Brian Wansink and David R. Just at Cornell and Collin R. Payne at New Mexico State, is behind a paywall, but if you know of any alternative ways to get to a published PDF using its DOI (in this case, 10.1001/archpediatrics.2012.999) then you can probably find it.  You can also download a pre-publication draft of the article from here [[2018-09-23: now here]] [PDF].  In fact I encourage you to do that, even if you have access to the published article from the journal, because one of the differences between the two versions is quite important.

The method... a little unclear

Here's how the study worked.  The researchers recruited 208 students at seven elementary schools.  As part of their regular lunch menu, these students were already allowed to take an apple, a cookie, or both, in addition to their main dish.  During the study, the researchers manipulated these "bonus" food items by adding (or not) a sticker of  a cartoon character (presumably to the skin of the apple, or the packaging of the cookie).  The sticker depicted either a well-known, branded cartoon character (specifically, "Elmo", presumably meaning the character from Sesame Street) or an "unknown" character.

The study lasted five days.  On days 1 and 5, there were no stickers on either item.  On days 2, 3, and 4, three of the eight other possible combinations (Elmo/unknown character/no sticker on the apple, Elmo/unknown character/no sticker on the cookie, 3 x 3 = 9, minus the "no stickers on either item" case) were deployed, and the percentage of students taking each item was noted.  The authors do not explain why they only tested three of the possible combinations.  Indeed, it's hard to tell exactly which combinations they tested:
On one day, children were offered a choice between an unbranded apple and a cookie that had a sticker of a familiar popular character (ie, Elmo) on it. On another day, children were offered a choice between an unbranded cookie and an apple that had a sticker of the Elmo icon on it. On another day, their choice was between an unbranded cookie and an apple with a sticker of an unknown character.
Does "unbranded" here mean "with the sticker of the unknown character" or "with no sticker"?  It's completely unclear to me.  The way the sentences are written, particularly the last one, seems to imply that "unbranded" means "no sticker".  But elsewhere in the article, the authors noted that the presence of the unknown character had no effect on the consumption of apples; they also claimed that "this study suggests that the use of branding or appealing branded characters [emphasis added] may benefit healthier foods [sic]" (p. 968), which suggests that the branding/no branding distinction is between Elmo and the unknown cartoon.  As a minimum, this is a confusing mix of terminology, which might lead to inaccuracies when this article is cited.

It's also worth noting the vocabulary that was used to describe the process: "children were offered a choice between [emphasis added]" an apple or a cookie (p. 968).  That exact phrasing is used three times.  To me, this conveys the idea that the children had to choose one or the other (and abandon the unchosen item).  The fact that they could take both an apple and a cookie is only mentioned once, almost in passing, in the Methods section.  (On a tangent, this also seems to me to cast some doubt on the utility of the study, in that nothing was done to try and reduce the number of cookies being taken --- for example, by putting unpleasant stickers on them.  Eating an apple is fine, but it doesn't seem clear how it would counteract the purported deleterious effects of eating a cookie.)

The results... not what they appear to be

The authors set up a web page here [[2018-09-23: this page has been taken down. The last available snapshot at archive.org is here]] to promote the study.  That page claims that "Compared to the pretest, the apple choice nearly doubled when the Elmo sticker was present. There was no effect of the Elmo sticker on the cookie and no effect of the unknown character on the apple."  But an examination of the Results section of the article shows that, using conventional levels of significance, no such effect was demonstrated.  The key statement here is:
The Elmo sticker led children to nearly double their apple choice compared with the pretest control session (Χ2=2.355; P=.06)
There are two problems here.  First, the implicit claim that a p value of .06 is as good as one of .05 is, frankly, cheating in plain sight.  (Of course, we shouldn't be playing the p value game at all, but if we're going to play it, let's at least all play it by the same rules.)  Second, the actual p value associated with a chi-square of 2.355 with one degree of freedom is not .06 but .12.   If you don't want to trust some random online calculator page, here are some tables of critical values for the chi-square distribution from the National Institute of Standards and Technology (NIST); you can see from the first line (df=1) of the table for the upper tail that the value of 2.355 isn't significant at the .10 level, never mind the .05 level.  Or, you can put "1-pchisq(2.355, 1)" into R.  (The degrees of freedom are not reported, but it seems clear that this is just a 2x1 contingency table: number of children taking an apple on day 1 versus day 2.  Even if it's a 2x2 contingency table, including cookies, then there's just one degree of freedom.  Of course, if this chi-square statistic were derived from a larger contingency table, say counting consumption across all five days, the degrees of freedom would be larger, and so the p value would be even higher.)

Now, some readers may be thinking that we can apply the universal statistical get-out-of-jail-free card and do a one-tailed test, which enables us to magically halve our p value because, um, reasons.  Indeed, the authors announced that "All analyses were 1-tailed tests" (p. 968), and they made use of this slightly further on in the results section, when they claimed that t(78)=1.65 has a p value "equal to" .05.  The two-tailed p value for those values is actually .103, so a one-tailed test gives a p of .051, which is not "equal to" .05 (unless you are either using an alpha level of .05499 or think that p values can be rounded down before being compared to .05), but hey-ho, this is what a lot of other people do, so let's pretend it's OK.  But this one-tailed shuffle doesn't work for the chi-square, because the p value reported by statistical software for a chi-square test is already, in effect, a one-tailed value.  The tails of the chi-square distribution are asymmetric; the p value for the upper tail is the one you're usually interested in, and it's what your software will tell you by default.  In any case, but especially with only one degree of freedom, the upper tail is much, much longer than the lower tail; compare the values for 0.95 (lower tail) and 0.05 (upper tail) here.  You can't get out of jail with the one-tailed card when the p value from your chi-square statistic turns out to be a dud.  And you certainly can't just cut your p value in half while shouting "Diminuendo!" (I don't think it's very likely that the .06 emerged in that form from SPSS). In short, that p value should have been reported as .12, not .06; and, whether you play the p value game or not, that is not very convincing evidence of anything.

[[[[[ Update 2017-02-17 14:15 UTC
Carol Nickerson suggests that the design and analysis were not correct for this study.  For example, at the pretest (day 1), the students seem to have had 4 options: plain apple, plain cookie, both, neither.  For the second intervention, they also seem to have had 4 options: Elmo apple, plain cookie, both, neither.  This 4 x 4 design was apparently reduced to a 2 x 2 design with pretest options: no apple, plain apple, and intervention options: no apple, Elmo apple.  Each of the 4 cells of this cross-tabulation should have contained the frequency of the paired observations: (1) no apple, no apple; (2) no apple, Elmo apple; (3) plain apple, no apple; (4) plain apple, Elmo apple, respectively.  This cross-tabulation should have been analyzed with McNemar's test, not the chi square test.
]]]]]

The figure... no, I don't know either

Possibly the strangest thing about this article is the figure that is suppose to illustrate the results.  Here is how the figure looked in the draft article:


That looks pretty reasonable, right?  OK, so it leaves out the days when there were no stickers, but you can see more or less what happened on the days with interventions.  A fairly constant proportion of about 90-92% of the children took a cookie, and between 23-37% of them took an apple.

Now let's see how the results were represented graphically when the article appeared in JAMA Pediatrics:


Whoa.  What's going on here?  Is this even the same study?

Let's start with the leftmost pair of columns ("Unbranded").  The note with the asterisk (*) tells us that these columns represent the baseline percentage of children taking an apple (about 22%, I reckon) and the baseline percentage of children taking a cookie (about 92%).  This presumably shows the results from Day 1 that were missing from the figure in the draft.  (Apples are now the darker column and cookies are the lighter column.)

The dagger (†) on the titles of the other three pairs of columns sends us to a different note, which describes the bars as representing the "percentage of change in selection from baseline".  This needs to be unpacked carefully.  What it seems to mean is that if 22% of children took an apple on Day 1, and 37% of children took an apple on Day 2 (let's assume that the columns are in time order, so "Branded Apples" is Day 2), then we should calculate (37%22%)/22% which gives around 0.68 or 68% (or maybe a bit more; there is quite a big rounding error factor here since we are obliged to get all of these numbers from visual inspection of the figures, in the absence of any actual numerical results). So the meaning of the height of the bar for apples in "Branded Apples" is that the percentage of children taking an apple increased by 68%.  But it makes absolutely no sense to plot this chart in this way.  The label on the Y axis (%) means something completely different between the pair of columns labelled with an asterisk and the pairs labelled with a dagger; they both happen to be numbers that are expressed as percentages, but those numbers mean completely different things.  And the next two pairs of columns have the same problem.  To see how meaningless this is, consider what would happen if just 0.5% of children took an apple on Day 1 and 2% took an apple on Day 2.  The bar for apples in the second pair ("Branded Apples") would be at 300%, suggesting to the casual reader that the intervention had had a huge effect, even though almost no children actually took an apple on either day. As published, this figure is almost meaningless and arguably deceptive --- but it looks like something spectacular has occurred.  Just the kind of thing that might impress a policymaker in a hurry, for example.

Let's ignore for a moment all of the other obvious problems with this study (including the fact that consumption of cookies didn't decline at all, so that one of the results of the study was a net increase in the calorie consumption of the students who otherwise would not have eaten an apple).  For now, I just want to know how this figure came to be published.  We know that the figure looked much more reasonable in a late version of the draft of the article (the PDF that you can download from the link I gave earlier is dated 2012-05-29, and the article was published online on 2012-08-20, which suggests that the review process wasn't especially long).  I can't help wondering at what point in the submit-revise-accept-typeset process this new figure was added.  I find it very strange that the reviewers at a journal with the reputation, impact factor, and rejection rate of JAMA Pediatrics did not apparently challenge it.

The participants... who exactly?

The article concludes with this sentence: "Just as attractive names have been shown to increase the selection of healthier foods in school lunchrooms, brands and cartoon characters can do the same with preliterate [emphasis added] children" (p. 968).  But we were told in the Methods section that the participants were elementary school students aged 811 (let's not worry for now about whether Elmo would be the best choice of cartoon character to appeal to children in this age range).  I have tried, and failed, to imagine how a team of three researchers who are writing up the results of a study in seven elementary schools, during which they identified 208 children aged 8-11 and obtained consent from their parents, could manage to somehow type the word "preliterate" when writing up the results after the study.  The reader could be forgiven for thinking that there might be something about the literacy levels of kids in the third through sixth grades in the state of New York that we should know about.

But it seems that the lead author of the article may have been a little confused about the setting of the study as well.  In an article entitled "Convenient, Attractive, and Normative: The CAN Approach to Making Children Slim by Design" (published in Childhood Obesity in 2013), Dr. Wansink wrote (p. 278): "Even putting an Elmo sticker on apples led 70% more daycare kids [emphasis added] to take and eat an apple instead of a cookie", with a reference to the article I've been discussing here.  Not only have the 8-11 year olds now become "daycare kids", but it is also being claimed that the extra apple was taken instead of a cookie, a claim not supported by the JAMA Pediatrics article; furthermore, the clear implication of "take and eat" is that all of the children ate at least some of their extra apple, whereas the JAMA Pediatrics article claimed only that "The majority of children [emphasis added] who selected a food ate at least a portion of the food" (p. 968).

These claims were repeated in Dr. Wansink's article, "Change Their Choice! Changing Behavior Using the CAN Approach and Activism Research" (published in Psychology & Marketing in 2015): "Even putting an Elmo sticker on apples led to 46% more daycare children taking and eating an apple instead of a cookie" (p.  489).  The efficacy of the intervention seems to have declined somewhat over time, though, as the claimed increase in the number of children taking an apple has dropped from 70% to 46%.  (It's not clear from the JAMA Pediatrics article what the correct figure ought to be, since no percentages were reported at all.)

Conclusion

Dr. Wansink wrote a rather brave blog post recently in which he apologized for the "pizza papers" incident and promised to reform the research practices in his lab.  However, it seems that the problems with the research output from the Cornell Food and Brand Lab go much further than just that set of four articles about the pizza buffet.  In the first paragraph of this post I linked to a couple of blog posts in which my colleague, Jordan Anaya, noted similar issues to those in the "pizza papers" in seven other separate articles from the same lab dating back as far as 2005, and here I have presented another article, published in a top-rated medical journal, that seems to have several different problems.  Dr. Wansink would seem to have a long road ahead of him to rebuild the credibility of his lab.

Acknowledgement

Although I had previously glanced briefly at the draft version of the "Elmo" article while looking at the public policy-related output of the Cornell Food and Brand Lab, I want to thank Eric Robinson for bringing the published version to my attention, along with the problems with the figure, the inconsistent p value, and the "preliterate" question.  All errors of analysis in this post are mine alone.