How To Be A Proper Fucking Scientist – Answers Sheet

Previously, I posed 10 questions to tax and tickle people’s scientific methodology instead of, sigh, whether they know the definition of “photosynthesis” (if we really wanted to tax science knowledge, surely you want to ask about the light-dependent and light-independent redox pathways… No? Okay.). This is not easy – many are classic counter-intuitive examples of reasoning biases – and it’s not taught at school properly.

Anyway, here are the answers. This is long, and each one could take up a post on their own, as they require a lot of deconstruction and explanation.

Short aside: Unlike most posts I write, I’m not linking out to definitions. The jargon is in bold for people to research at their leisure. I did ponder linking each to LMGTFY, but I’m just so damn lazy.

Question 1

Question: The chaya leaf is claimed as a cancer cure. What is the most important part of the article announcing this?

Answer: (c) The experiments they have done

This should have been easy. The only relevant thing is the experiments done to prove the claim. That is ultimately what science is about; demonstrating a hypothesis through correct methodology and experimental processes. You can have all the cool theories you want, but if it doesn’t turn out to be true, then it means nothing. None of the alternative answers mentioned “it makes sense” or “it was a sound theory”, however.

The other answers on offer presented numerous fallacies and things that are very common in the media – however, they are irrelevant. The expert’s qualifications are irrelevant because of the argument from authority fallacy. Ditto the institution. Although who they are and where they come from are a good proxy, and worth raising when people speak outside their field of expertise, they are not directly relevant and have no impact on the validity of experimental work. A nine year-old girl working in a shed can produce science as valid as a professor tenured at Harvard – it depends on how their theories and experiments work (note that the opposite of this, where you assume the unqualified person is better is known as the Galileo Gambit).

The exact same goes for the reporter’s opinion on the piece. Though if the reporter is a qualified scientist, the ‘expert’ in question isn’t, and the reporter is saying it’s all bunk, it’s probably a good indication, but it’s still irrelevant and has no effect on the actual experimental evidence for a claim. Similarly, whether friends have tried something or not  is anecdotal evidence – anecdotes are not the same as data because they are prone to so many spectacular biases such as selection bias, selective reporting or cherry picking. Also, the plural of “anecdote” is not “data” (except when it is).

Question 2

Question: Autism correlates with organic food sales. What does this tell us?

Answer: (a) Nothing in particular

The autism and organic food correlation has been around for a while now and has slowly become a classic go-to example for this. The data is reasonably legit, but of course mere correlation is not causation; this rules out answers (b) and (c). From a correlation alone we cannot tell if one causes the other, or which way around it goes. For that, we need to do a controlled experiment,  where one variable is altered and we watch the change in the other. The existence of a correlation alone can give us a nice hypothesis to test, but it needs tested in a controlled way first.

Answer (e) is obvious nonsense and constitutes both an ad hominem attack, where you dismiss data and evidence on irrelevant personal grounds, and batcrap crazy conspiracy theory thinking (and no, that isn’t an ad hominem attack, it’s just a statement of fact).

When it comes to answer (d), however, it’s complicated. One can assume a hidden third relationship is at work, something known as a confounding variable or a confounding factor, but that’s not always the case; after all, there’s a correlation between smoking and lung cancer but there is a direct cause that’s been demonstrated there through other studies and good theoretical links, rather than a third hidden variable at work. Alternatively, there may be no causal connection at all, even given very strong correlation. You need at least a decent theory for this – for example, shoe size and reading ability correlate, and the third hidden relationship is age. Therefore the answer is a) the graph doesn’t really tell us anything particularly interesting apart from the fact there is a correlation. How we proceed from there to say something more useful is a matter of doing further experiments and analysis.

Question 3

Question: A company is offering a get-rich-quick scheme. What further evidence do you need?

Answer: (d) Interviews and testimony from the failures

In the ideal world, you want statistics to answer this. You want data. But in this case what we’re looking for is silent evidence, which is a specific corollary to cherry picking and selective reporting. This is evidence where some kind of selection effect stops you seeing it – ergo, the answer is (d).

For instance, suppose everyone who survived a long sea voyage through stormy weather in the 17th century said they prayed on the journey. Of course, what about those who didn’t survive to tell the tale; did they pray also, or didn’t they? This evidence is silenced. In this particular business example, you can sum it up with the phrase “there are no unsuccessful entrepreneurs”. By self-selecting out of the pool of evidence, you can make things look better than they really are. This is also common amongst psychics, where selective reporting and cherry picking means we only get to hear from the predictions that came true – the ones that didn’t are conveniently forgotten.

The other answers allude to confirmation bias. This is where you method for testing focusing on confirming your hypothesis. This sounds counter-intuitive – because, after all, you do want to confirm it, right? – but is actually the wrong way to go about things. Repeatedly looking at where your hypothesis works gets you no-where. You need to find the places where it doesn’t work. Trotting out success stories does nothing, you need to look for the failures to get a better picture.

Question 4

Question: An alarm, which catches 95% of thieves, sounds. What is the odds that it has caught a thief?

Answer: (e) More data needed

This question is about test sensitivity. Simply put, the answer is (e). We need more data; specifically the data on how many false positives it generates. This factor is known as the specificity. If the test generated zero false positives, then the answer would be 100% – but in reality this is almost never the case and all tests generate some degree of false positives, usually as a side effect of having such a high sensitivity. For example, police sniffer dogs usually have a high sensitivity to drugs, but are easily lead by the suspicions of their handlers or distracted by similar, but benign, substances. That produces a lot of false positives.

It is only the sensitivity that we’ve been given in the question, 95%. As an approximation we can say it’ll almost certainly be less than 95%, but without that false positive data we cannot be sure. This effect is alarmingly counter-intuitive. Question 8, below, goes into this effect in way more detail.

Question 5

Question: Some men are tall. True or false?

Answer: False

Here, we have a logical syllogism. A fun thing to do in logical arguments is to swap around the nouns used in the premises with something else, because this helps highlight a formal logical fallacy. A formal logical fallacy is a breakdown in the reasoning, so should be fallacious no matter what we put in there as variables. In this example we’re talking about three properties and how they are related; “men”, “tall” and “doctors”. We could simply make this generic and replace them with X, Y and Z.

  • Premise 1) Some X are Y
  • Premise 2) Some Y are Z
  • Conclusion) Some X are Z

Now it becomes a little clearer that there’s a problem. The premises don’t formally assess the relationship between X and Z. We can highlight it further by inputting values where we know that the conclusion will be wrong:

  • Premise 1) Some men are doctors
  • Premise 2) Some doctors are women
  • Conclusion) Some men  are women

Quibbles about the gender binary aside (the real world is always slightly fuzzier than pure logic suggests) this is obviously incorrect. We only get tricked into thinking the logic relating “tall”, “men” and “doctors” is a valid piece of logic because the conclusion happens to be true outside the context of the argument. Separating the formal and informal aspects of a logical argument, therefore, is an essential skill in order to spot flawed reasoning.

This context-dependent assessment of reason is common in all of us. In the classic example of the Wason Card Problem, the version using abstract letters and numbers, or colours and numbers, easily trips most people up; but a context-driven example involving ages and whether someone is legally drinking alcohol results in people getting the answer right every time.

Question 6

Question: 15 people drop out of a drug trial. What is the effect?

Answer: (b) Weaken the study because we lose information from the drop-outs

This is basic Ben Goldacre style stuff. It’s known that trials that do not report their drop-out rates overestimate their success, and not reporting these loses is a way to cover-up potentially disastrous results. This is vastly important because correct data from medical trials save lives. Knowing as much about a medical trial as possible, therefore, is essential to assessing its validity.

So consider the following: Why would people drop out? Because the drug affected them badly? Because it simply didn’t work and they had to be excluded? Because of some other ethical problems with the administration of the trial? Perhaps something else.

A large drop-out rate is one of the biggest alarm bells for a medical trial, and so the answer is (b).  Hopefully I don’t need to state why (a) and (d) are incorrect here, but (c) is incorrect because the statistical significance of a study is proportional not just to the size of a group but to the size of the effect. If we’re looking for a small, subtle effect on the order of 1% we need a large number of people. If the effect is huge and dramatic, we need relatively fewer people to say it’s significant. Therefore dropping the number in a group from 40 to 25 would not necessarily alter the significance of the outcome – but it does suggest that if those 15 dropped out for a particular reason, the trial’s conclusions will be erroneous.

This is also related to an issue called publication bias. This is where studies that come up negative simply aren’t published or released to the world. It’s another form of selection bias that prevents us seeing a full picture of all the data we have available. If people drop out of a study because a drug didn’t work for them, the final report will over-estimate its effectiveness. If studies aren’t published because they show negative results, our meta studies (where we look at multiple published studies in unison with each other) will also over-estimate the effectiveness. Thankfully, there are ways to spot these biases and they are implemented in meta studies.

Question 7

Question: Why is the graph misleading?

Answer: (c) Because it doesn’t start at zero

So, another graph question. First of all, I just made it up. It’s bollocks data, it’s just for illustration. Yet it illustrates things that have been seen on news reports quite frequently. The answer is c) the graph does not start from zero. Now, sometimes there are good reasons for this, but often it’s there deliberately to convey a political message. Namely, it emphasises differences where the differences may, in the grand scheme of things, be minuscule. That’s not to say they’re not significant, nor does it say they’re not real; it just robs them of a wider context. This kind of manipulation is all too common.

Obviously, (a) is incorrect because we can kinda assume it’s a percentage based on context (although meaningless graphs do crop up in pseudo-science). (b) is incorrect because it isn’t continuous data. (d) is completely irrelevant (it’s illustrative data, and we assume “A” and “B” are defined properly elsewhere) and (e) is just making stuff up – there is no such thing as some subliminal bias that you introduce just through colour. Colour perception doesn’t work like that (though, interestingly, different coloured pills do produce changes in the placebo effect).

Question 8

Question: Which test is better?

Answer: (e) It’s complicated

As mentioned above, this is related to question 4. We’re talking about test sensitivity again – except in this case it’s being very explicit about the existence of false positives. Indeed, the question gives you enough figures to work out the actual test sensitivity if the question was asked exactly as before. If you work it out for both tests, the new and old, the odds of the result being a true positive if the test says “yes” (the rate of what are known as Type I errors) come up at roughly the same for both.

A short aside: In fact, the figures given in the question were rounded off from ones I generated after a bit of fudging with a spreadsheet to make this the case: a test with 63% selectivity and 1.1% specificity has about the same sensitivity as a test with 99% selectivity and 1.8% specificity, and these figures just about render the sensitivity difference between the two independent of any realistic background rate (or prior probability in Bayesian jargon). The actual chance of a true positive if the test says “yes” works out to about 36% in both of these cases. So, the question uses rounded off figures of 99% and 65%, and 2% and 1% respectively. The specificity here seems quite negligible, but you’re actually doubling the rate of false positives from one test to the other. Therefore the final test sensitivity is very susceptible to small changes in this value.

So, in principle, the answer is (c) if you do the arithmetic. In terms of real-world effect on how good the test is in terms of Type I errors, they come out the same.

However, and this is a big HOWEVER, the real answer is (e). It really is much more complicated than that. The new test, with 99% sensitivity and 2% specificity, is marginally worse because of its false positives – the test sensitivity, the rate of true positives, drops from 40% to 33%. The new test also produces a larger number of false positives in real terms – twice as many. This sort of thing leads to complications. It adds to psychological trauma as more people are hit with positive test results, and have to end up undergoing further tests that can often be very invasive (such as biopsies), that may transpire to be unnecessary. There are also further practical cost implications to go with further testing of more people – even though the overall proportion of false positives remains roughly the same. But on the flip side, fewer people slip by the test through false negatives (only 1%, in fact). This means despite the cost and psychological implications of the extra false positives, the new test will likely save lives.

How you balance between those is a difficult, and indeed very complicated, decision, and also requires looking into a lot of variables such as how you target and pre-screen your testing based on demographic data, and how you refer to people for further testing. These are all very real considerations, however they’re rarely, if ever, raised in newspapers that discuss the effectiveness of medial treatments and interventions.

Question 9

Question: All toupees look fake. How do you evaluate this statement?

Answer: (b) Examine real looking hair

This is informally known as the toupee fallacy. No, really, look it up. It is another specific case of hidden and silent evidence, and a test of confirmation bias as alluded to in Question 3, above.

First of all, woe be unto anyone who chose (e). Semantic definition games are where we need to roll up a newspaper and bash you on the nose while saying “NO” in a firm tone of voice. Just no. You cannot simply “define” something as true, otherwise your statement is meaningless. A more long-winded justification of this can wait for another day, though.

Next (d) – no. This is wrong. You can evaluate the statement because you can evaluate any statement where the real-world results are defined correctly. This is where I differ greatly with (read, “I’m right, they’re wrong”) with people who claim that prayer can’t be tested with science. No. If you say “something has effect [X]”, then we can go looking for effect [X]. If you want to claim that toupees look fake, all you need is a suitable definition of a toupee (easy) and a consistent definition of what you consider “fake” (slightly less easy, but still doable). It’s very possible.

So, (a) – examining “fake” looking hair. Well, if you keep examining fake hair, you will find bad haircuts and toupees. That’s confirmation bias at work. You have proven nothing since you haven’t looked at cases where your hypothesis may fall down. And technically this is also true of answer (c). You’re examining both, you’re wasting your time. What you need to do is just examine real looking hair – because you might find a toupee that doesn’t look fake. You can only disprove your hypothesis by looking for things that will break it, and only by attempting to break it can you be sure it holds true. You cannot disprove that toupees look fake by examining only fake hair, you must try to find a toupee amongst the real looking things.

Question 10

Question: Two people died after recieving a new vaccine. What evidence do you need to evaluate this?

Answer: (b) The rate of vaccine uptake

We end with typical and all-too-common newspaper scaremongering. Remember; anecdotes are not evidence. So you need data, not stories. That means if you’ve answered (e), you need to take a course in critical thinking. Just like with correlation, anecdotes can give us ideas and hypotheses to test, but they are not in themselves sufficient evidence. If we recall life-saving drugs on a whim because of anecdotal evidence that isn’t backed up by data, we can cause more harm precisely because those drugs are life-saving. This isn’t something we should do on a hunch to placate the public because it could cost lives.

So we need data, but we need to ask what data is relevant. The fact I’m saying this should immediately point out that the answer isn’t (d). We don’t need all of it. Indeed, the answers (a) and (c) aren’t relevant to the question – precisely because we’re not looking at flu mortality. People die in flu epidemics, that rate might say something about the efficacy of the medications, but it says nothing about the safety of the medications. We’re not looking at flu related deaths, we’re looking at vaccine related deaths. Flu mortality is not even a confounding variable, it’s an irrelevant variable. You only need b) the rate of vaccine uptake in that demographic. I threw in option (d) as an intentional red herring; this is not that complicated a question.

I’ve been cheeky here, though; you do need one more piece of data that isn’t listed and that’s the death rate of that demographic. You can look this up in an actuarial table so let’s assume it’s given knowledge. This death rate works out to about 1% for this demographic. If 1% of people over 60 are expected to die anyway, then whether 2 individual deaths is a significant event depends only on the rate of vaccine uptake. If only 2-3 people took it and 2 died, we might be onto something – the vaccine uptake rate is the key piece of evidence. However, this is an unlikely scenario. If 50% of the population took the medical intervention, that could be millions. And 1% of a million is 10,000. If tens of thousands of your demographic were going to die anyway, of any cause, then two cases is hardly a significant event. Of course, you need to re-work those figures from the per-year rate to a per-day rate to figure the odds of anyone dying on the same day or the day after they receive a vaccine, but tens of thousands divided by 365 (not taking into account seasonal variations) still isn’t going to make 2 reported deaths anything more than coincidental.

In fact, we’d be expecting potentially hundreds of people do die the day they were vaccinated anyway, so if only two died after taking the vaccine, it would be significant evidence that this magical elixir actually stops you dying. Now that is counter-intuitive.

Conclusions and Further Work

Well, if you got to the end of this, congratulations. Maybe it was enlightening, maybe it was revision. I’m grateful to a few people for discussions of topics, contexts and statistics. You know who you are.

Complaints to the usual place.

Advertisements

9 thoughts on “How To Be A Proper Fucking Scientist – Answers Sheet

  1. Pingback: How To Be A Proper Fucking Scientist – A Short Quiz | Spherical Bullshit

  2. Great test! I shall be sharing it with my friends, and recommending it to anyone who plans on reading or listening to anything besides review articles published in reputable journals.

    …But I’ve gotta say, if question #2 had included an “at least one of b, c, or d” option, that would have been *more* correct. Correlations with p=0.0001 don’t grow on trees; and a simple game of bayes ball on a causal graph tells us that if A reliably correlates with B, either A=>B, B=>A, or Z=>(A,B).

    Reply
    • In principle, every question has a secret additional answer of “it’s a bit more complicated because…” hence the lengthy deconstruction in the answer sheet. But this was aimed at offering answers that were as simple as possible for a multiple choice format that could get the point across.

      Reply
      • > In principle, every question has a secret additional answer

        For sure; and I was pleasantly surprised that “it’s complicated” was the answer for the question which asked us to combine a lower false positive rate and higher false negative rate into a single subjective metric.

        I’m even sure that everyone has what they think is a *really good reason* that their particular complexification is necessary and valid. The difference between me and them is that I’m *right*; the [A=>B,B=>A,Z=>(A,B) ] trilemma remains even if you don’t have a detailed causal model. I cannot say by what bizarre chain of events organic food causes autism, autism causes organic food consumption, or some third factor causes both (although I’d bet $1.50 it’s some side effect of increasing wealth). But the causation *is* in there, somewhere, Judea Pearl guarantees it (http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf).

        Reply
  3. Thanks for the answers…

    I’m still dubious about “the experiments” being the most useful answer for question 1. It’s obviously the most correct answer but for those people who realise that our science background isn’t enough to let us usefully evaluate experiments’ validity, it’s not so helpful. Real life tends to actually be complicated and we get articles like this – which is the first one I could find on BBC science that looked like it contained an experiment: http://www.bbc.com/news/health-26224812 I can’t evaluate that science – I will tend to believe it because in my experience BBC science doesn’t publish complete bullshit very often, because the guy came from Cambridge, and because it seems to have the sort of details that pseudoscience doesn’t include like noting that the predictive power is only there for males. But I can’t actually tell you whether the experiments were valid either from the article or from the PNAS paper …

    You pointed out to ckp that sometimes news outlets publish good research – that’s true but since I don’t trust my ability to evaluate whether they did that or not it’s not really relevant. Knowing that they also publish bad research is enough to make me distrust most articles and find alternative trusted sources to evaluate things I’m further interested in.

    Question 10 I totally screwed up – I was looking not only at the vaccine’s safety but trying to figure the vaccine’s safety relative to the flu’s safety/danger. After all a vaccine that kills 1% of the people who get it might still be significantly safer than a pandemic disease which will affect 60% of the population and kill 10% of the people who get it … but that’s totally not what the question asked, so as you pointed out it’s irrelevant.

    Reply
    • Oh and also for question 1, I disagree. Newspaper reporting tends to be so shoddy that I simply don’t trust their description of the experiments done, if they bother to report them at all. I just tend to use the scientist and the institution as a rough-and-ready shit-test to throw out obvious cranks and mark the “reputable” ones as stuff I need to check later by reading the paper itself.

      Reply
      • Some reporting is better than you might think. For example, there’s a very recent story in the Daily Mail – and this is the Daily F**king Mail of all places – on breastfeeding where they were quite thorough with the data analysis done. I mean, seriously thorough. There was enough description of the analysis for me to demonstrate to a few people that the Cochrane Library’s meta-study on breast-feeding was a significantly better powered study showing a different conclusion. And that’s without looking up the institution and qualifications, which were so-so. Papers usually report at least enough information about what work was done to make a decent assessment of it – even if it is just padding it out by copy-pasting from the press release to say “Professor Doodlewatsit and colleagues exposed cancer cells to chaya leaves and noticed they started dying at a faster rate”. That’s enough to spot the problem in the study whether it’s Professor Doodlewatsit of Cambridge or Paula Nancy Milstone Jennings, aged 9.

        So no matter what the rough-and-ready test regarding qualifications says, the single most *important* factor will always come down to what the experiment is that they did and how they know it. Only the fluffiest and most awful pop-science pieces don’t tell you anything about that at all. And indeed, the complete absence of experimental details in a pop-science piece is the most important detail. If Professor Doodlewatsit says chaya cures cancer but the newspaper doesn’t say why she thinks that, it’s a bigger alarm bell than even a piss-poor study.

        Reply
    • I originally thought it was implied in the question, then I re-read it and realise that might not be the case. I could change it, but I think the major traffic stream to this is over now so it would be largely pointless. You’re right, it’s one problem with the graph but the actual source of it being deliberately misleading (as opposed to just garbled) is due to exaggerating the differences.

      Reply

Go on, derp away...

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s