Put a Number on It! did a piece a while ago going through some psychology pieces that were part of a replication effort. They found that half failed to replicate but that people in a betting market could often tell which ones were going to replicate or not. The author also did a blind test himself and was also able to guess which ones would replicate. He laid out several rules of thumb, most significantly to the article Jacob’s Rule of Anti-Significance: A result with a p-value just above 0.05 could well be true. A result with a p-value just below 0.05 is almost certainly false.

More importantly, p=0.06 means that the researchers are honest. They could have easily p-hacked the results below 0.05 but chose not to. The opposite is true when p=0.049.

This is true in more than just psych. A strategy for systems papers is to select the system that is second on the graphs. The author's results are not trustworthy, but the second place system was usable and ran reasonably well in the hands of somebody other than the original author.

> Here's the problem in a nutshell: If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments, you might expect that 5% of these 95 significant effects would be false positives. However, as an example shown later in this blog will show, the actual false positive rate may be 47%.

> […] However, this is a statement about what happens when the null hypothesis is actually true. In real research, we don't know whether the null hypothesis is actually true. If we knew that, we wouldn't need any statistics! In real research, we have a p value, and we want to know whether we should accept or reject the null hypothesis. The probability of a false positive in that situation is not the same as the probability of a false positive when the null hypothesis is true. It can be way higher.

> Here's a more simple thought experiment that gets across the point of why p(null | significant effect) /= p(significant effect | null), and why p-values are flawed as stated in the post.

> Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

> Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.

For about a year or so now Ive been wanting to make a game about science. It'd basically be a research and discovery simulator, and there would be a free-play mode. Some of the knobs would be # of required replication, required p-value and how good people are at generating hypothesis.

There is a "people will mostly replicate/extend articles about X, and ignore articles about Y" (groupthink) effect that I imagine is also very relevant.

I hope you do! I think simulations can be a powerful way to teach/learn and I would like to see this one.

I put a reminder to check back with you about this in a few months. Is Keybase your preferred contact method? I know it in name only, but I imagine I can figure it out.

Yes! Riffing off that, I'd want to do a version where all the data in the game is pure randomness, but where your experimental configurations can influence the outcome. (e.g. you can clean the equipment, which alters the reading, or redo experiments that "went wrong")

You'd be given a prize/goal for publishable findings. Then you'd gradually introduce enough bias into the experiments to get something publishable, and then get hit with the reveal that "oh you were generating effects from random data, jerk".

Please do. We're getting closer and closer to the tipping point (although it still might be far away) where people go "Oooohhhh... Well, shit.". Where we all realise. Your game would be one of the many catalysers for this.

This is my opinion of course, it's not like I have any scientific basis for this.

>If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments

I'm all for criticism of p-values, but when I read a lot of critiques, I get to this point and simply stop reading.

No statistics text book that I've read assigns the magic value of p=0.05 and labels it as significant. All the ones I've read tell you to pick a p-value appropriate to your experiment. Yes, I get it that many social scientists don't have much of a clue and use 0.05 as some special threshold, but let's direct the criticism to the guilty parties, instead of blaming a statistical methodology.

I mean, we all know people who misuse the mean ("the average number of breasts a person has is 1") and ignore the shape of the distribution and the standard deviation. Yet we don't say "Let's stop using the mean!"

That's not what the p-value means. It means that if you run 1000 of the experiments in a universe in which the hypothesis is false, around 50 of them will confirm the hypothesis anyway. If the hypothesis is true, then there are no false positives; all positives confirm the hypothesis. In a universe in which the hypothesis is true, there can only be false negatives.

"False positive" means that the effect or condition we're looking for is not true, but the experiment yields a true answer: the positive answer of the experiment is a falsehood. If the condition we're looking for is true, then there can't be a false positive. Even if the experiment yields a positive due to some flawed step, it's still a true positive.

> The false positive rate (Type I error rate) as defined by NHST is the probability that you will falsely reject the null hypothesis when the null hypothesis is true.
In other words, if you reject the null hypothesis when p < .05, this guarantees that you will get a significant (but bogus) effect in only 5% of experiments in which the null hypothesis is true.

This is just a language issue: a false positive of the rejection of the null hypothesis.

This blog post is extremely misleading -- it appears to confuse the "null hypothesis" with an actual scientific hypothesis that is being tested. For example:
"However, there are many cases where I am testing bold, risky hypotheses—that is, hypotheses that are unlikely to be true."

Those may be the hypotheses being tested, but they are not NULL hypotheses. NHST is not about testing the NULL hypothesis, it is about testing a non-Null hypothesis (the null hypothesis should always be incredibly boring and expected). There may be problems, but this blog post does not describe one.

I think the problem with p-values is that it trains us to think about uncertainty without nuance. It hides the inherent trade off between the cost of taking on risk and the cost of reducing uncertainty, since it sets the threshold at p=0.05. Taken to the extreme, with a large enough sample we can nearly always find significant differences between populations, the difference will just be very small and n size will be enormous.

Recently I worked with a client to interpret results from an A/B test where A performed better than B with 85% confidence (based on credible intervals, accounting for multiple comparisons). We therefore recommended A. In a group phone call, the client told her colleagues that our company doesn't know what we're talking about because 85% confidence of a difference isn't statistically significant (i.e. isn't 95% confident). We lost their business.

This was a shame because gathering the data for the experiment was expensive and the downside of making the wrong choice was low. It is often the case that taking on more risk makes more sense than hitting diminishing returns on shrinking p-values with extra sample.

I think p-values are actually somewhat demonized and I have grown to like them more and more over time. The standard interpretation is actually overly complicated for some reason and it can be simplified to "your p-value cutoff is an upper-bound on the rate of type I errors," over the long term. That's simple and actionable and is an immediate consequence of the definition of p-values! Frankly, I don't know why text-books don't give this as the definition of p-values, and they should reserve "probability of an event at least as extreme conditional upon the null hypothesis" as the thing-you-show-to-prove-it's-a-p-value.

The current trend of saying that "cutting p-values off at a specific value is bad" makes me worry. Now you can argue that your p=0.06 result shouldn't be rejected when really we should probably be pushing for stricter standards rather than inching towards looser ones. It also destroys the nice interpretation of p-values above. P-values were literally made to be cut off - if you want to stop doing that, you need to show me a coherent philosophy of what to do instead.

What I do think is true is the problem you have where part A of the experiment suggests X so you test X more directly in part B with a weaker but more specific test and get p=0.06 and now you can't publish. That's a dumb cutoff, clearly a p=0.06 test is likely to shift our belief towards X so it does nothing but bolster part A. Typically papers do this several times and the marginal 'failure' of one step should not sink the entire ship. This is a case where a Bayesian analysis might be more useful as it can incorporate weak evidence.

But the problem I see often is not that p-values are misused but that they were junk in the first place. For example, the widely-used DESeq2 (as well as some competitors in RNA-seq differential expression analysis) will happily spit out p-values of 10^-100 for an experiment with only four replicates in each of two conditions! There is no way you can get that level of evidence from just four replicates, even if the values are 0,0,0,0 and 1e6,1e6,1e6,1e6. The assumption of normality is reasonable near the mean but gets increasingly inaccurate in the tail, which is exactly where you end up when you do things like sort 30,000 tests by their p-values. In fact taking a p-value cutoff is probably the only reasonable thing to do here - that way you'll ignore the fact that it's absurdly small and just treat it as "small enough".

That interpretation makes the test basically useless because we know a priori that any two variable for things within each others light cone affect each other at least a little. More practically, since the test tells you nothing about size of effect, it will pick up on the tiniest of bias in your experimental procedure and always reject the null if you have enough data.

From the author of the article: "The general point reminds me of my dictum that statistical hypothesis testing works the opposite way that people think it does. The usual thinking is that if a hyp test rejects, you’ve learned something, but if the test does not reject, you can’t say anything. I’d say it’s the opposite: if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise."

To add to this, the problem with t-tests in not the threshold. It is that the hypothesis you are rejecting (effect is exactly 0.0000000...) is infinitesimaly small. You've rejected basically nothing of your hypothesis space.

Your null should have a width. You should always be rejecting "effect is greater than some margin" which you should have to argue is greater than any bias you might expect in your experiment. There are always at least tiny biases.

P-values are a sub-optimal but okay-ish of quantifying a Popperian hypothesis (a designed-to-be-refutable conjecture). The mathematics is not the problem, the problem is carving science (which in my view (and Quine's and others's) is pretty much defined by the unity of science) in testable morcels.

None of the great achievements of science (Newton, Darwin, Mendeleyev, etc.) were obtained on the basis of Popperian demarcationism/conjectures-and-refutations -- they were obtained by positing a large framework and patching together the empirical case for it.

Your great achievements in science exclude all real world applications (engineering, pharmaceuticals, etc), where the critical details of a hypothesis don't fit on a t-shirt

Note that I said "science", not "engineering". Engineering is driven by usefulness and profit.

Falsificationism isn't a stupid idea; it's even useful at a personal improvement level. But pharma or materials research use it because it tends to lead to good results, not because it's the very definition of what's worthwhile knowledge.

Indeed. People keep saying Bayesianism will solve the replication crisis but as far as I can see it will make it worse: without a hard cut-off you have many more degrees of freedom to "hack" and can push ever more marginal "results".

It's worth noting that at their inception, null hypothesis testing and p-values were separate methodologies; there was even a bitter rivalry by their chief developers (Fisher and Pearson).

It wasn't until much later textbooks started to merge both. It may be worth to review Neyman and Pearson's attacks on Fisher in this matter.

It looks like the blog author completely missed the point of the statistical significance discussion going on. Most first-tier journals in the social sciences have an acceptance rate of about 5%. At the margins, the differences between acceptance and rejection could be having one more statistical significance result in the table than the paper that was submitted right before or after yours.

The problem with a 0.048 and a 0.052 is not a mathematical one but an interpretation one. Reviewers are condition to be very skeptical of non-significant results and use “under power-ness” as a grounds for rejection. As a result, we get publication bias and p-hacking.

You point out p-values can be troubling because there's all sorts of bad incentives that lead to p-hacking and publication bias. The author points out that even if those bad incentives didn't exist, p-values aren't all that useful to begin with. That's not "missing the point", it's just pointing out a different aspect of the situation.

I think you should reread the article because it's exactly what the blog author says. Blog author who btw is Andrew Gelman, not just some random guy on Medium, his blog is well well worth reading. Fighting bad stats in science is kind of his hobby/life mission.

There is whole group of people in social sciences who are pushing for abandoning the null hypothesis testing methods. For the reason you mentioned, I cannot believe changing the test (or threshold of the test) would solve this issue. You ask people to find something significant or fit a model to a data — and tell them that's what matters — and they'll do it, either intentionally or unintentionally.

Machine Learning will (have) the same issue if the only thing that matters is hitting a certain level of accuracy given your model and data. This has been observed in Kaggle competitions over and over, you ask a group of people to find the best fit, and they'll, by learning your train, validation and test datasets.

As mentioned, problem is not p-value, or null hypothesis testing, the problem is journals who promoted the wrong incentive, and educators who were not aware of the consequences and propagated the wrong incentive (interpretation) to students.

> Most first-tier journals in the social sciences have an acceptance rate of about 5%.

Assuming the null hypothesis, that is precisely our expectation of finding something significant under the p < .05 rule. (That is, assuming that all papers try to falsely reject a true null hypothesis, then we expect 5% of the papers to be successful at that and get published.)

> To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

I don't really follow this. Could someone clarify what is meant here? At what point would this author say something is not consistent with the null hypothesis?

Its a meta observation of how we view p-values. He's saying that the same logic behind p-values and declaring something "statistically significant" when p < X (often 0.05), falls apart when applied to differences between two observed p-values. Essentially, you would say there is no "statistically significant" difference between an experiment that resulted in a p-value of 0.2 and one that resulted in a p-value of 0.05.

And the argument will hold no matter what threshold you choose for rejecting the null hypothesis. You can choose to reject if p > X, and for any X, there will be values greater than X that, applying this meta-logic, are not statistically different from X.

> At what point would this author say something is not consistent with the null hypothesis?

Gelman's argument, I presume, is against the idea of significance testing as a whole. Declaring something "statistically significant" is in itself a very problematic thing, as it distills the entire phenomenon, the uncertainty surrounding the experiment, and the uncertainty surrounding the researcher's decisions to a single, binary conclusion.

Gelman is a Bayesian (perhaps the most famous modern Bayesian), and the Bayesian philosophy is to focus on producing a posterior distribution of the phenomenon being studied. I presume the alternative to significance and null hypothesis testing that he was suggest would be something closer to a model where people are reporting their priors/data/posteriors, and the discussion focuses around the implications and replication of those.

> P-values are shown to be extremely skewed and volatile, regardless of the sample size n, and vary greatly across repetitions of exactly same protocols under identical stochastic copies of the phenomenon; such volatility makes the minimum p value diverge significantly from the "true" one. Setting the power is shown to offer little remedy unless sample size is increased markedly or the p-value is lowered by at least one order of magnitude.

I guess I have to sleep on this because at a quick glance I can't really make sense of how it answers my question.

I do find it ironic though that this is so difficult to explain that I apparently have to read a paper to understand it... I would've thought the blog post was trying to explain things in simple terms...

He is coming at that conclusion from a Bayesian point of view to statistics. He is seeing the p-value as a random variable that can take values from 0 to 1 and follows some distribution. Under these hypotheses, observing a p-value of 0.20 and 0.005 is completely reasonable even if unlikely. Those are just two draws from a random variable.

Edit. Under Bayesian statistics testing the null hypothesis is a moot point as it becomes possible to directly model the distribution of the possible effects. Thinking of it as being able to look at a picture of something (the p-value) vs looking at a movie of it (the distribution of the effects).

I think that's what he's saying too, but what is that supposed to show? Is he arguing against some claim that every interval of 1 standard deviation is equally significant? Did anybody make this claim? So far as I know, nobody considers (say) a 6-sigma effect to be 6 times stronger than a 1-sigma effect...

I don’t get it either. It’s a trivial consequence of having a threshold: if we say two cities are “far” when they are at least 1000 miles away then Washington D.C. is not far from Jacksonville while Boston is far from Jacksonville, even though Boston is not far from Washington.

I’m commenting on pfortuny’s (correct) interpretation “those events are only separated be 1.1std deviations, which is little” of the original post by Gelman (the difference between a significant result and a non-significant result may not be significant = a city which is far from Jacksonville and a city which is not far from Jacksonville may not be far from each other).

I still don't get it. What p-value observations would not be "reasonable" here? It seems to me he's saying that anything between 0 and 1 is completely reasonable, which is a completely pointless statement as I see it.

His point is that the whole point of the field of statistics is that cherry picking results or getting lucky once is not proof of anything, unlike nonstatistical mathematics where one example is enough to justify a claim.

Say you run an experiment. You get some data. From that data you calculate a p-value. The data has randomness, so the p-value is a random variable and has a distribution. By chance you can get results that look significant. By chance you can get results that look insignificant.

Well, if there is no effect (the effect-size is zero), two different experimenters will likely see two different p-values regardless of how large an experiment either of them runs.

So by this logic "it is completely consistent with the null hypothesis to see p-values of 0.00000001 and 0.99999999 from two replications of the same damn experiment"?

At which point, what is even the point of this statement?

My understanding (Someone please correct me if I'm wrong) is that while the values can range from limx->0 and limx->1 , the actual distribution will be skewed in cases where it is statistically significant. E.g. If you repeat the experiment 100x you could find a p=.999999 but more likely they'll be close to .05 (or whatever alpha you choose)

If the null hypothesis is true the p-value is distributed uniformly in [0 1] (at least ignoring discrete data, composite hypothesis and other cases which are not easy).

The distribution will be different when then true hypothesis is not true (but you may also get non-significant results even if the null hypothesis is not true).

I’m not sure if that’s what you mean by “the actual distribution will be skewed in cases where it is statistically significant.”

For a identical normal populations, repeating an experiment produces p<=0.2 20% of the time, and produces p<=0.005 0.5% of the time.

A coin comes up on the same side 3 times in a row vs 8 times in a row...I have no idea why we should shrug and consider the plausibility of coin bias in these two cases about the same.

---

EDIT: I he means that in the case of an actual difference in the populations, p=0.2 and p=0.005 are both pretty likely outcomes.

When the populations are the same, p=0.2 and p=0.005 are quite different happenings.

This is because p-value methods doesn't worry very much about type II errors.

> It means that if the null hypothesis were true, you expect your p-values to be a random variable contained inside a nice bell shaped normal. 0.005 is dead center so it's very likely but O.2 which seems very unprobable is actually only 1std further, it's well inside the bell curve.

I'm completely lost here. How is 0.005 "dead center"? Are you assuming p = 0 is the center? Are there negative p-values I'm not seeing that somehow balance the positive ones?

How can a random variable that's strictly between 0 and 1 even follow a bell curve?

I'm not following those links either. How is p uniformly distributed under H0? If you assume H0 then obtaining a p-value near 0 is going to be damn impossible. Whereas obtaining one similarly close to 0.5 is going to be ridiculously more likely.

Am I severely lacking sleep and going crazy or something? Maybe I should check back in like half a day to see what people have said, I feel like I must be completely confused right now because literally nothing I've read so far makes sense to me.

They're right. If the null hypothesis is true, then the probability of getting any p value is equal. Put it another way, the p value is the probability of getting the observed data (or more extreme) under the null. So, under the null, 10% of the time you will get data with a p value of 10% or less; 20% of the time, you will get data with a p value of 20% or less; and so on. And that's the uniform distribution!

Here's an R example to play with:

pvals <- replicate(10000, {
x <- rnorm(100)
y <- rnorm(100)
t.test(x, y)$p.value
})
plot(density(pvals))

That will plot you a nice uniform line on [0, 1].

(NB: I have no idea why OP talked about p values following a normal distribution. That doesn't make sense to me, and I think the post has been deleted.)

Why is your x variable though? If H0 is true shouldn't your x be fixed?

I think if you remove the subtraction though then you do get a uniform distribution -- in which case I see what the claim is, yeah. Wasn't really clear to me earlier but indeed, getting p = 5% means you have a 5% chance of getting observations that extreme, so I guess it is uniformly distributed!

That's the very definition of a p-value! The mapping of data to p-values is chosen to have a uniform distribution of p-values when the data is distributed according to the null hypothesis. That's the property that makes p-values interesting.

I don’t think you understand what a p-value is. The p-value is a percentage output of testing whether a given normal distribution actually has a non-zero mean. It is phrased in terms of the null hypothesis. So a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true. Conversely, this means that there is a 95% chance that what you’re testing is true.

> So a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true. Conversely, this means that there is a 95% chance that what you’re testing is true.

> a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true.

In other words, P(H0 | X) where H0 is the null hypothesis being true and X is the data observed. But that is not what a p-value is, they actually represent P(X | H0).

mruts' (and most people who studied a little statistics) misunderstanding of the basic definition is exactly what author and others have been campaigning against for years.

If you want a college textbook, there are literally hundreds with titles similar to "introduction to probability and statistics" and you should choose the cheapest second hand book you can find (or that they have in your library). Content is mostly the same, of course the writing style will be different and some may be more appealing to you personally. Search engines are your friend for narrowing down your list.

For a popsci work, you could check out "how to lie with statistics", a classic.

> Also, to get technical for a moment, the p-value is not the “probability of happening by chance.”

Is it not? According to Wikipedia, it's "[...] the probability that, when the null hypothesis is true, the statistical summary [...] would be equal to, or more extreme than, the actual observed results." This sounds pretty much like "probability of happening by chance".

It sounds pretty much the same, but is distinctly different. That's where much of the confusion in the popular press comes from.

The difference is that, as highlighted in your quote, there is some null hypothesis that is assumed when discussing p-values.

For example: what is the probability of drawing x>2 when the underlying distribution is assumed to be a standard normal distribution N(0,1)?

The probability is small in this case, and could provide evidence to reject the null hypothesis (i.e. the distribution is not standard normal). It doesn't tell you about the probability of drawing x>2, it only gives evidence to reject (or not) the null hypothesis.

The wiki has more elaborate explanation. And probably better examples than mine.

Close, it's the probability of it happening by chance given that the null hypothesis is true. Unless H0 is objectively and absolutely true, it is not the same as "probability of it happening by chance."

Sure. Apply some fuzzy logic. Highly significant. Somewhat significant. And honestly, we're in an age where somewhat significant can often be bolstered later (in the drug industry) by coupling drugs. It's really time to stop believing everything has to be unifactor. Everything that we care about is multifactor and even slight significance could make a difference if added up.

A 5% chance of the results being reproducible by fluke even if the hypothesis is false is obviously too high for anything important. Splitting hairs over .048 and .052 is ridiculous: it revolves around tiny differences in a gaping uncertainty. Neither value is anywhere in the neighborhood of where the benchmark should be.

> Also, to get technical for a moment, the p-value is not the “probability of happening by chance.” But we can just chalk that up to a casual writing style.

Isn't it though? The probability of this large (or larger) of a variance happening purely by chance[1]?

This article is highly critical, but the criticism goes over my head at least.

I think the part that you have correctly included that people forget or elide is that it's the probability under a specific null hypothesis. So it's a function of what you have chosen for that - normal distributions, a certain parameter value of 0, etc. So this means that a) it's not the probability you'd see in the real world under repeated performance b) it's not the probability under other reasonable null hypotheses. Like maybe under the hypothesis parameter = 0 you get an improbably large p-value, but under parameter = 0.1, or with different assumed underlying distributions, you wouldn't see something so extreme.

I'd guess that the original writer understands this, and that Gelman is only pointing it out because casual readers sometimes don't mentally retain the full baggage that the p-value carries.

I think he was being pedantic about how it's the probability of observations at least as extreme as the ones actually obtained happening if the null hypothesis is true, which is not the same thing as those particular observations happening by chance.

Put a Number on It! did a piece a while ago going through some psychology pieces that were part of a replication effort. They found that half failed to replicate but that people in a betting market could often tell which ones were going to replicate or not. The author also did a blind test himself and was also able to guess which ones would replicate. He laid out several rules of thumb, most significantly to the article

Jacob’s Rule of Anti-Significance: A result with a p-value just above 0.05 could well be true. A result with a p-value just below 0.05 is almost certainly false.More importantly, p=0.06 means that the researchers are honest. They could have easily p-hacked the results below 0.05 but chose not to. The opposite is true when p=0.049.https://putanumonit.com/2018/09/07/the-scent-of-bad-psycholo...

This is true in more than just psych. A strategy for systems papers is to select the system that is second on the graphs. The author's results are not trustworthy, but the second place system was usable and ran reasonably well in the hands of somebody other than the original author.

For ML practitioners out there, this is a great method for a field also in a replication crisis.

That is a pretty interesting conclusion, do you know whether his work has been replicated to confirm the outcomes?

Yep, it's confirmed with p = 0.049

Related, about p-values:

> Here's the problem in a nutshell: If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments, you might expect that 5% of these 95 significant effects would be false positives. However, as an example shown later in this blog will show, the actual false positive rate may be 47%.

> […] However, this is a statement about what happens when the null hypothesis is actually true. In real research, we don't know whether the null hypothesis is actually true. If we knew that, we wouldn't need any statistics! In real research, we have a p value, and we want to know whether we should accept or reject the null hypothesis. The probability of a false positive in that situation is not the same as the probability of a false positive when the null hypothesis is true. It can be way higher.

https://lucklab.ucdavis.edu/blog/2018/4/19/why-i-lost-faith-...

> Here's a more simple thought experiment that gets across the point of why p(null | significant effect) /= p(significant effect | null), and why p-values are flawed as stated in the post.

> Imagine a society where scientists are really, really bad at hypothesis generation. In fact, they're so bad that they only test null hypothesis that are true. So in this hypothetical society, the null hypothesis in any scientific experiment ever done is true. But statistically using a p value of 0.05, we'll still reject the null in 5% of experiments. And those experiments will then end up being published in scientific literature. But then this society's scientific literature now only contains false results - literally all published scientific results are false.

> Of course, in real life, we hope that our scientists have better intuition for what is in fact true - that is, we hope that the "prior" probability in Bayes' theorem, p(null), is not 1.

https://news.ycombinator.com/item?id=16917158

For about a year or so now Ive been wanting to make a game about science. It'd basically be a research and discovery simulator, and there would be a free-play mode. Some of the knobs would be # of required replication, required p-value and how good people are at generating hypothesis.

I think it'd be eye opening.

There is a "people will mostly replicate/extend articles about X, and ignore articles about Y" (groupthink) effect that I imagine is also very relevant.

Oooh, that's a great one to add to the list!

I hope you do! I think simulations can be a powerful way to teach/learn and I would like to see this one.

I put a reminder to check back with you about this in a few months. Is Keybase your preferred contact method? I know it in name only, but I imagine I can figure it out.

Yes! Riffing off that, I'd want to do a version where all the data in the game is pure randomness, but where your experimental configurations can influence the outcome. (e.g. you can clean the equipment, which alters the reading, or redo experiments that "went wrong")

You'd be given a prize/goal for publishable findings. Then you'd gradually introduce enough bias into the experiments to get something publishable, and then get hit with the reveal that "oh you were generating effects from random data, jerk".

(Okay, maybe that's overcomplicating it.)

Please do. We're getting closer and closer to the tipping point (although it still might be far away) where people go "Oooohhhh... Well, shit.". Where we all realise. Your game would be one of the many catalysers for this.

This is my opinion of course, it's not like I have any scientific basis for this.

>If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments

I'm all for criticism of p-values, but when I read a lot of critiques, I get to this point and simply stop reading.

No statistics text book that I've read assigns the magic value of p=0.05 and labels it as significant. All the ones I've read tell you to

pick a p-value appropriate to your experiment. Yes, I get it that many social scientists don't have much of a clue and use 0.05 as some special threshold, but let's direct the criticism to the guilty parties, instead of blaming a statistical methodology.I mean, we all know people who misuse the mean ("the average number of breasts a person has is 1") and ignore the shape of the distribution and the standard deviation. Yet we don't say "Let's stop using the mean!"

That's not what the p-value means. It means that if you run 1000 of the experiments in a universe in which the hypothesis is false, around 50 of them will confirm the hypothesis anyway. If the hypothesis is true, then there are no false positives; all positives confirm the hypothesis. In a universe in which the hypothesis is true, there can only be false negatives.

"False positive" means that the effect or condition we're looking for is not true, but the experiment yields a true answer: the positive answer of the experiment is a falsehood. If the condition we're looking for is true, then there can't be a false positive. Even if the experiment yields a positive due to some flawed step, it's still a true positive.

> The false positive rate (Type I error rate) as defined by NHST is the probability that you will falsely reject the null hypothesis when the null hypothesis is true. In other words, if you reject the null hypothesis when p < .05, this guarantees that you will get a significant (but bogus) effect in only 5% of experiments in which the null hypothesis is true.

This is just a language issue: a false positive of the rejection of the null hypothesis.

This blog post is extremely misleading -- it appears to confuse the "null hypothesis" with an actual scientific hypothesis that is being tested. For example: "However, there are many cases where I am testing bold, risky hypotheses—that is, hypotheses that are unlikely to be true."

Those may be the hypotheses being tested, but they are not NULL hypotheses. NHST is not about testing the NULL hypothesis, it is about testing a non-Null hypothesis (the null hypothesis should always be incredibly boring and expected). There may be problems, but this blog post does not describe one.

I think the problem with p-values is that it trains us to think about uncertainty without nuance. It hides the inherent trade off between the cost of taking on risk and the cost of reducing uncertainty, since it sets the threshold at p=0.05. Taken to the extreme, with a large enough sample we can nearly always find significant differences between populations, the difference will just be very small and n size will be enormous.

Recently I worked with a client to interpret results from an A/B test where A performed better than B with 85% confidence (based on credible intervals, accounting for multiple comparisons). We therefore recommended A. In a group phone call, the client told her colleagues that our company doesn't know what we're talking about because 85% confidence of a difference isn't statistically significant (i.e. isn't 95% confident). We lost their business.

This was a shame because gathering the data for the experiment was expensive and the downside of making the wrong choice was low. It is often the case that taking on more risk makes more sense than hitting diminishing returns on shrinking p-values with extra sample.

I think p-values are actually somewhat demonized and I have grown to like them more and more over time. The standard interpretation is actually overly complicated for some reason and it can be simplified to "your p-value cutoff is an upper-bound on the rate of type I errors," over the long term. That's simple and actionable and is an immediate consequence of the definition of p-values! Frankly, I don't know why text-books don't give this as the definition of p-values, and they should reserve "probability of an event at least as extreme conditional upon the null hypothesis" as the thing-you-show-to-prove-it's-a-p-value.

The current trend of saying that "cutting p-values off at a specific value is bad" makes me worry. Now you can argue that your p=0.06 result shouldn't be rejected when really we should probably be pushing for stricter standards rather than inching towards looser ones. It also destroys the nice interpretation of p-values above. P-values were literally made to be cut off - if you want to stop doing that, you need to show me a coherent philosophy of what to do instead.

What I do think is true is the problem you have where part A of the experiment suggests X so you test X more directly in part B with a weaker but more specific test and get p=0.06 and now you can't publish. That's a dumb cutoff, clearly a p=0.06 test is likely to shift our belief towards X so it does nothing but bolster part A. Typically papers do this several times and the marginal 'failure' of one step should not sink the entire ship. This is a case where a Bayesian analysis might be more useful as it can incorporate weak evidence.

But the problem I see often is not that p-values are misused but that they were junk in the first place. For example, the widely-used DESeq2 (as well as some competitors in RNA-seq differential expression analysis) will happily spit out p-values of 10^-100 for an experiment with only four replicates in each of two conditions! There is no way you can get that level of evidence from just four replicates, even if the values are 0,0,0,0 and 1e6,1e6,1e6,1e6. The assumption of normality is reasonable near the mean but gets increasingly inaccurate in the tail, which is exactly where you end up when you do things like sort 30,000 tests by their p-values. In fact taking a p-value cutoff is probably the only reasonable thing to do here - that way you'll ignore the fact that it's absurdly small and just treat it as "small enough".

That interpretation makes the test basically useless because we know

a priorithat any two variable for things within each others light cone affect each other at least a little. More practically, since the test tells you nothing about size of effect, it will pick up on the tiniest of bias in your experimental procedure and always reject the null if you have enough data.From the author of the article: "The general point reminds me of my dictum that statistical hypothesis testing works the opposite way that people think it does. The usual thinking is that if a hyp test rejects, you’ve learned something, but if the test does not reject, you can’t say anything. I’d say it’s the opposite: if the test rejects, you haven’t learned anything—after all, we know ahead of time that just about all null hypotheses of interest are false—but if the test doesn’t reject, you’ve learned the useful fact that you don’t have enough data in your analysis to distinguish from pure noise."

(https://statmodeling.stat.columbia.edu/2019/08/18/i-feel-lik...)

To add to this, the problem with t-tests in not the threshold. It is that the hypothesis you are rejecting (effect is exactly 0.0000000...) is infinitesimaly small. You've rejected basically nothing of your hypothesis space.

Your null should have a width. You should always be rejecting "effect is greater than some margin" which you should have to argue is greater than any bias you might expect in your experiment. There are always at least tiny biases.

Compound/interval null hypotheses basically solve this effect-size problem and probably should be used more.

Yes, it should be required.

P-values are a sub-optimal but okay-ish of quantifying a Popperian hypothesis (a designed-to-be-refutable conjecture). The mathematics is not the problem, the problem is carving science (which in my view (and Quine's and others's) is pretty much defined by the unity of science) in testable morcels.

None of the great achievements of science (Newton, Darwin, Mendeleyev, etc.) were obtained on the basis of Popperian demarcationism/conjectures-and-refutations -- they were obtained by positing a large framework and patching together the empirical case for it.

Turns out that once you invent a system for "acceptance" (p < 0.05 or whatever), people will game it. Who knew?

Your great achievements in science exclude all real world applications (engineering, pharmaceuticals, etc), where the critical details of a hypothesis don't fit on a t-shirt

Note that I said "science", not "engineering". Engineering is driven by usefulness and profit.

Falsificationism isn't a stupid idea; it's even useful at a personal improvement level. But pharma or materials research use it because it tends to lead to good results, not because it's the very definition of what's worthwhile knowledge.

Indeed. People keep saying Bayesianism will solve the replication crisis but as far as I can see it will make it worse: without a hard cut-off you have many more degrees of freedom to "hack" and can push ever more marginal "results".

It's worth noting that at their inception, null hypothesis testing and p-values were separate methodologies; there was even a bitter rivalry by their chief developers (Fisher and Pearson).

It wasn't until much later textbooks started to merge both. It may be worth to review Neyman and Pearson's attacks on Fisher in this matter.

It looks like the blog author completely missed the point of the statistical significance discussion going on. Most first-tier journals in the social sciences have an acceptance rate of about 5%. At the margins, the differences between acceptance and rejection could be having one more statistical significance result in the table than the paper that was submitted right before or after yours.

The problem with a 0.048 and a 0.052 is not a mathematical one but an interpretation one. Reviewers are condition to be very skeptical of non-significant results and use “under power-ness” as a grounds for rejection. As a result, we get publication bias and p-hacking.

You point out p-values can be troubling because there's all sorts of bad incentives that lead to p-hacking and publication bias. The author points out that even if those bad incentives didn't exist, p-values aren't all that useful to begin with. That's not "missing the point", it's just pointing out a different aspect of the situation.

I think you should reread the article because it's exactly what the blog author says. Blog author who btw is Andrew Gelman, not just some random guy on Medium, his blog is well well worth reading. Fighting bad stats in science is kind of his hobby/life mission.

Yeah, he had a major part in

creatingthe stats discussion that is going on. I doubt that he has missed its point.There is whole group of people in social sciences who are pushing for abandoning the null hypothesis testing methods. For the reason you mentioned, I cannot believe changing the test (or threshold of the test) would solve this issue. You ask people to find something significant or fit a model to a data — and tell them that's what matters — and they'll do it, either intentionally or unintentionally.

Machine Learning will (have) the same issue if the only thing that matters is hitting a certain level of accuracy given your model and data. This has been observed in Kaggle competitions over and over, you ask a group of people to find the best fit, and they'll, by learning your train, validation and test datasets.

As mentioned, problem is not p-value, or null hypothesis testing, the problem is journals who promoted the wrong incentive, and educators who were not aware of the consequences and propagated the wrong incentive (interpretation) to students.

> Most first-tier journals in the social sciences have an acceptance rate of about 5%.

Assuming the null hypothesis, that is precisely our expectation of finding something significant under the p < .05 rule. (That is, assuming that all papers try to falsely reject a true null hypothesis, then we expect 5% of the papers to be successful at that and get published.)

> To say it again: it is completely consistent with the null hypothesis to see p-values of 0.2 and 0.005 from two replications of the same damn experiment.

I don't really follow this. Could someone clarify what is meant here? At what point would this author say something is

notconsistent with the null hypothesis?Its a meta observation of how we view p-values. He's saying that the same logic behind p-values and declaring something "statistically significant" when p < X (often 0.05), falls apart when applied to differences between two observed p-values. Essentially, you would say there is no "statistically significant" difference between an experiment that resulted in a p-value of 0.2 and one that resulted in a p-value of 0.05.

And the argument will hold no matter what threshold you choose for rejecting the null hypothesis. You can choose to reject if p > X, and for any X, there will be values greater than X that, applying this meta-logic, are not statistically different from X.

> At what point would this author say something is not consistent with the null hypothesis?

Gelman's argument, I presume, is against the idea of significance testing as a whole. Declaring something "statistically significant" is in itself a very problematic thing, as it distills the entire phenomenon, the uncertainty surrounding the experiment, and the uncertainty surrounding the researcher's decisions to a single, binary conclusion.

Gelman is a Bayesian (perhaps the most famous modern Bayesian), and the Bayesian philosophy is to focus on producing a posterior distribution of the phenomenon being studied. I presume the alternative to significance and null hypothesis testing that he was suggest would be something closer to a model where people are reporting their priors/data/posteriors, and the discussion focuses around the implications and replication of those.

Nasim Taleb wrote extensively on this.

>

P-values are shown to be extremely skewed and volatile, regardless of the sample size n, and vary greatly across repetitions of exactly same protocols under identical stochastic copies of the phenomenon; such volatility makes the minimum p value diverge significantly from the "true" one. Setting the power is shown to offer little remedy unless sample size is increased markedly or the p-value is lowered by at least one order of magnitude.https://arxiv.org/abs/1603.07532

Video: https://www.youtube.com/watch?v=8qrfSh07rT0

Are you sure this answers my question? Note that I was neither asking why p-hacking is bad nor even why p-values are bad.

Yes, it answers your question.

What you call "p-value" is a sample from the "p-value distribution" of your experiment.

Taleb shows you can sample a p-value of 0.05 when the actual "true" p-value is 0.12.

I guess I have to sleep on this because at a quick glance I can't really make sense of how it answers my question.

I do find it ironic though that this is so difficult to explain that I apparently have to read a paper to understand it... I would've thought the blog post was trying to explain things in simple terms...

He is coming at that conclusion from a Bayesian point of view to statistics. He is seeing the p-value as a random variable that can take values from 0 to 1 and follows some distribution. Under these hypotheses, observing a p-value of 0.20 and 0.005 is completely reasonable even if unlikely. Those are just two draws from a random variable.

Edit. Under Bayesian statistics testing the null hypothesis is a moot point as it becomes possible to directly model the distribution of the possible effects. Thinking of it as being able to look at a picture of something (the p-value) vs looking at a movie of it (the distribution of the effects).

That's not even a bayesian point of view, even in a frequentist setting the p-value is a conditional probability, conditioned on the dataset.

What he says is (I gather) worse: those events are only separated be 1.1std deviations, which is little.

I think that's what he's saying too, but what is that supposed to show? Is he arguing against some claim that every interval of 1 standard deviation is equally significant? Did anybody make this claim? So far as I know, nobody considers (say) a 6-sigma effect to be 6 times stronger than a 1-sigma effect...

I don’t get it either. It’s a trivial consequence of having a threshold: if we say two cities are “far” when they are at least 1000 miles away then Washington D.C. is not far from Jacksonville while Boston is far from Jacksonville, even though Boston is not far from Washington.

That's not quite the problem.

Let's assume you've already decided in advance what "far" means.

Without moving either city from its current location, the same experiment can give you "very far" and "very close" in identical replications.

I’m commenting on pfortuny’s (correct) interpretation “those events are only separated be 1.1std deviations, which is little” of the original post by Gelman (the difference between a significant result and a non-significant result may not be significant = a city which is far from Jacksonville and a city which is not far from Jacksonville may not be far from each other).

I still don't get it. What p-value observations would

notbe "reasonable" here? It seems to me he's saying that anything between 0 and 1 is completely reasonable, which is a completely pointless statement as I see it.His point is that the whole point of the field of statistics is that cherry picking results or getting lucky once is not proof of anything, unlike nonstatistical mathematics where one example is enough to justify a claim.

> He is seeing the p-value as a random variable that can take values from 0 to 1 and follows some distribution.

That is the frequentist point of view!

The distribution of the p-value, given that the null is true, is uniform between 0 and 1.

IFthe null is true, you're equally likely to get a p-value of 0.01 and 0.87.Say you run an experiment. You get some data. From that data you calculate a p-value. The data has randomness, so the p-value is a random variable and has a distribution. By chance you can get results that look significant. By chance you can get results that look insignificant.

Well, if there is no effect (the effect-size is zero), two different experimenters will likely see two different p-values regardless of how large an experiment either of them runs.

So by this logic "it is completely consistent with the null hypothesis to see p-values of 0.00000001 and 0.99999999 from two replications of the same damn experiment"?

At which point, what is even the point of this statement?

My understanding (Someone please correct me if I'm wrong) is that while the values can range from limx->0 and limx->1 , the actual distribution will be skewed in cases where it is statistically significant. E.g. If you repeat the experiment 100x you could find a p=.999999 but more likely they'll be close to .05 (or whatever alpha you choose)

If the null hypothesis is true the p-value is distributed uniformly in [0 1] (at least ignoring discrete data, composite hypothesis and other cases which are not easy).

The distribution will be different when then true hypothesis is not true (but you may also get non-significant results even if the null hypothesis is not true).

I’m not sure if that’s what you mean by “the actual distribution will be skewed in cases where it is statistically significant.”

Sorry I was definitely kinda vague - I meant when the alternative hypothesis is true the distribution should be skewed towards 0.

The point is that you need to replicate experiments and not cherry pick the one with the highest p-value.

I have no idea.

For a identical normal populations, repeating an experiment produces p<=0.2 20% of the time, and produces p<=0.005 0.5% of the time.

A coin comes up on the same side 3 times in a row vs 8 times in a row...I have no idea why we should shrug and consider the plausibility of coin bias in these two cases about the same.

---

EDIT: I he means that

in the case of an actual difference in the populations, p=0.2 and p=0.005 are both pretty likely outcomes.When the populations are the same, p=0.2 and p=0.005 are quite different happenings.

This is because p-value methods doesn't worry very much about type II errors.

.

> It means that if the null hypothesis were true, you expect your p-values to be a random variable contained inside a nice bell shaped normal. 0.005 is dead center so it's very likely but O.2 which seems very unprobable is actually only 1std further, it's well inside the bell curve.

I'm completely lost here. How is 0.005 "dead center"? Are you assuming p = 0 is the center? Are there negative p-values I'm not seeing that somehow balance the positive ones?

How can a random variable that's strictly between 0 and 1 even follow a bell curve?

Check out these links [0] [1] or google for "p value distribution" or "p curve"

[0] http://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HtestPValue/t...

[1] https://en.wikipedia.org/wiki/P-value#Distribution

I'm not following those links either. How is p uniformly distributed under H0? If you assume H0 then obtaining a p-value near 0 is going to be damn impossible. Whereas obtaining one similarly close to 0.5 is going to be ridiculously more likely.

Am I severely lacking sleep and going crazy or something? Maybe I should check back in like half a day to see what people have said, I feel like I must be completely confused right now because literally nothing I've read so far makes sense to me.

They're right. If the null hypothesis is true, then the probability of getting any p value is equal. Put it another way, the p value is the probability of getting the observed data (or more extreme) under the null. So, under the null, 10% of the time you will get data with a p value of 10% or less; 20% of the time, you will get data with a p value of 20% or less; and so on. And that's the uniform distribution!

Here's an R example to play with:

That will plot you a nice uniform line on [0, 1].(NB: I have no idea why OP talked about p values following a normal distribution. That doesn't make sense to me, and I think the post has been deleted.)

So I don't have (or know) R, but I do have access to Mathematica, and this is most definitely not giving me a uniform distribution (how could it?!):

Why is your x variable though? If H0 is true shouldn't your x be fixed?I think if you remove the subtraction though then you do get a uniform distribution -- in which case I see what the claim is, yeah. Wasn't really clear to me earlier but indeed, getting p = 5% means you have a 5% chance of getting observations that extreme, so I guess it is uniformly distributed!

> How is p uniformly distributed under H0?

That's the very definition of a p-value! The mapping of data to p-values is chosen to have a uniform distribution of p-values when the data is distributed according to the null hypothesis. That's the property that makes p-values interesting.

At least for continuous statistics, the p-value is uniformly distributed when the null hypothesis is true.

I don’t think you understand what a p-value is. The p-value is a percentage output of testing whether a given normal distribution actually has a non-zero mean. It is phrased in terms of the null hypothesis. So a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true. Conversely, this means that there is a 95% chance that what you’re testing is true.

> So a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true. Conversely, this means that there is a 95% chance that what you’re testing is true.

See https://en.wikipedia.org/wiki/P-value#Basic%20concepts

You don't understand what a p-value is.

> a P < 0.05 means that there is less than a 5% chance that the null hypothesis is true.

In other words, P(H0 | X) where H0 is the null hypothesis being true and X is the data observed. But that is

notwhat a p-value is, they actually represent P(X | H0).mruts' (and most people who studied a little statistics) misunderstanding of the basic definition is exactly what author and others have been campaigning against for years.

All this reminded me of an exchange where similar games were closed the fastest. [0] https://www.bestbitcoindice.com/wp-content/uploads/2017/11/Y...

For somebody who has never had formal training in statistics and similar discussions, what is a good/book to start groking these concepts?

If you want a college textbook, there are literally hundreds with titles similar to "introduction to probability and statistics" and you should choose the cheapest second hand book you can find (or that they have in your library). Content is mostly the same, of course the writing style will be different and some may be more appealing to you personally. Search engines are your friend for narrowing down your list.

For a popsci work, you could check out "how to lie with statistics", a classic.

https://en.m.wikipedia.org/wiki/How_to_Lie_with_Statistics

> Also, to get technical for a moment, the p-value is not the “probability of happening by chance.”

Is it not? According to Wikipedia, it's "[...] the probability that, when the null hypothesis is true, the statistical summary [...] would be equal to, or more extreme than, the actual observed results." This sounds pretty much like "probability of happening by chance".

It sounds pretty much the same, but is distinctly different. That's where much of the confusion in the popular press comes from.

The difference is that, as highlighted in your quote, there is some null hypothesis that is assumed when discussing p-values.

For example: what is the probability of drawing x>2 when the underlying distribution is assumed to be a standard normal distribution N(0,1)?

The probability is small in this case, and could provide evidence to reject the null hypothesis (i.e. the distribution is not standard normal). It doesn't tell you about the probability of drawing x>2, it only gives evidence to reject (or not) the null hypothesis.

The wiki has more elaborate explanation. And probably better examples than mine.

Close, it's the probability of it happening by chance

given that the null hypothesis is true.Unless H0 is objectively and absolutely true, it is not the same as "probability of it happening by chance."It is not the “probability of [happening by chance]” (as opposed to the probability of [happening for some reason]).

It is the “[probability of happening] by chance” (as opposed to the [probability of not happening] by chance).

Just putting this here in case anyone is interested in a slightly different view on p values.

https://mindbowling.wordpress.com/2016/07/19/p-values/

Sure. Apply some fuzzy logic. Highly significant. Somewhat significant. And honestly, we're in an age where somewhat significant can often be bolstered later (in the drug industry) by coupling drugs. It's really time to stop believing everything has to be unifactor. Everything that we care about is multifactor and even slight significance could make a difference if added up.

A 5% chance of the results being reproducible by fluke even if the hypothesis is false is obviously too high for anything important. Splitting hairs over .048 and .052 is ridiculous: it revolves around tiny differences in a gaping uncertainty. Neither value is anywhere in the neighborhood of where the benchmark should be.

Well, obviously, the result with p=0.052 is the will of the people and should be implemented to its most extreme.

> Also, to get technical for a moment, the p-value is not the “probability of happening by chance.” But we can just chalk that up to a casual writing style.

Isn't it though? The probability of this large (or larger) of a variance happening purely by chance[1]?

This article is highly critical, but the criticism goes over my head at least.

[1] assuming a normally distributed population

I think the part that you have correctly included that people forget or elide is that it's the probability under a specific null hypothesis. So it's a function of what you have chosen for that - normal distributions, a certain parameter value of 0, etc. So this means that a) it's not the probability you'd see in the real world under repeated performance b) it's not the probability under other reasonable null hypotheses. Like maybe under the hypothesis parameter = 0 you get an improbably large p-value, but under parameter = 0.1, or with different assumed underlying distributions, you wouldn't see something so extreme.

I'd guess that the original writer understands this, and that Gelman is only pointing it out because casual readers sometimes don't mentally retain the full baggage that the p-value carries.

I think he was being pedantic about how it's the probability of observations at

leastas extreme as the ones actually obtained happening if the null hypothesis is true, which is not the same thing as those particular observations happening by chance.If p = np, what's the chance that n = 0.045 or 0.052?

1/420 :)

The cube has 420 sides?