The method of hypothesis testing is a ubiquitous concept in science and it shapes the way how we see the world. Beside that, it can tip the scale for how successfully we publish a paper – be it in a standard journal of a certain field or the Journal of Negative Results (only the Journal of Universal Rejection does not seem to care about testing stategies when rejecting a paper). Anyway, hypothesis testing is so present in our every days scientific lives that it is good to recapitulate what it means and what it is good for.
Hypothesis testing has its philosophical roots in Karl Poppers’ considerations that theories can’t be proven but only be falsified. Since the first half of the last century generations of scientist are formulating null and alternative hypotheses hoping to disprove the former and accept the latter. The workhorses of hypothesis testing are statistical significance tests, with the first one brought up by Ronald Fischer in the 1920s.
A statistical test is commonly considered significant if the p-value is below 0.05. This significance level describes the probability that one might wrongly reject the null hypothesis although it is true (which is also called the type I or alpha error). That means for p=0.05 that if you’d repeat an experiment a hundred times you’d get significant results in five cases just by chance despite the absence of any effect (because of this imagined repetition it is also called the frequentistic approach). The significance level of 0.05 is arbitrary but generally accepted. Getting a significant result does not mean that your alternative hypothesis is true (although many people believe that) and the magnitude of the p-value does not tell you something about the size of the effect. P-values depend not only on the effect size but also on the variation in your data and the sample size.
For these and other reasons hypothesis testing has been criticised in the scientific literature and elsewhere, e.g. by Bob O’Hara here or by John Johnson here, and if you are more into arts check out Mick McCarthys dance of significance (ok, he is not dancing himself, but only simulating it in a software).
To improve the restricted meaning of a hypothesis tests and p-values, it can be useful to accompany a test statistic with a measure of effect sizes like the proportion of variance explained or the steepness of a regression line. And yes, if you are more ambitious why not dive into the realm of Bayesian inference where you can calculate the probability that your hypothesis is true rather easily (given that you cope with the other challenges that come with the Bayesian approaches).
Anyway, good luck with your next test. I surely will dance and play the trumpet the next time some stars pop up in my favourite stats software.