I have lately refereed journal submissions where authors have totally evaded the concept of a statistical test while trying to demonstrate the importance of their research. I will give somewhat altered (to protect the identities of the authors and avoid lawsuits) examples later. Mainly, each of these authors has developed an innovation, compared the innovation to a straw-man technique (or two), and concluded that their innovation is good. However, in many cases the slight differences in performance would likely fail a statistical test had a statistical test been performed.
Witness the “Reproducibility Project: Psychology”. When experiments cannot be replicated, the cause might be bad statistics, but the cause could also involve investigator bias. In transportation we see much research with no statistical testing at all. I suspect that a lot of what has been published in our journals is unusable at best and misleading at worst.
When we do statistical estimation with canned software, the more important statistical tests are automatically generated. They are unavoidable and referees know to ask for them. However, when developing new methods with custom code, those lovely statistical tests are often absent. The writing of a custom statistical test can be difficult.
What does a statistical test do? If we determine experimentally that A is better than B, then a significant statistical test will tell us that if we were to repeat the experiment, thereby obtaining entirely new data, then it is highly likely that A will still be better than B. This also applies to data we collect through surveys or data that are obtained from traffic detectors. If an innovation is best within a single trial, it does not mean it will be best on the next trial, unless we have statistics to lend confidence to that assertion.
There are rare cases where we might be willing to accept the claim of an innovation without a demonstration of statistical significance. Maybe an author has done an outstanding job of developing a new theory, or maybe the test data set is not entirely appropriate or available. In these cases the referees must make judgement calls as to whether the potential value of the innovation is worth the risk of a wrong conclusion. In these situations it is absolutely essential that the author provide the appropriate caveats.
I am dismissing the possibility of out-right deception. I personally know of only one instance of deception happening, and this instance came from a colleague of mine rather than from a random author. My unscientific opinion is that most wrong conclusions stem from investigator bias fueled by wishful thinking. An author accrues far more benefits from getting a shaky paper into print than he or she risks losing by publishing bad results.
Case Study A. An author proposes a method for forecasting a composite traffic variable one day ahead using only information about that traffic variable from the day before. He finds that he needs just three equations. One equation predicts Tuesday through Friday and Sunday. Another equation predicts Saturday. And the last equation predicts Monday. He ignores seasonality, but does not attempt to do anything with holidays. He calibrates his equations on a year’s worth of data, but tests the results on a 6-day hold out sample. He compares his method to simple averages for each day of week, but develops a different average for each season. The author reports that the day-ahead method has a standard error of 11.1% and the averaging method has a worse standard error of 12.3%. Anyone who has ever worked with the F-test knows immediately that a difference of 1.2% is unlikely to be significant with only 6 samples. The author still claims his innovation is a success, but the author never did a statistical test to prove it.
Case Study B. An author develops an artificial intelligence technique to do a short-term (less than 1 hour) forecast of a traffic speed. His technique is a modification of an existing technique. The author compares his innovation to the existing technique and a well-know time series method. Inexplicably, the author omits a rather important variable, included in the AI techniques, from the time-series method. The author uses a month of data across 10 detectors to train/calibrate the techniques, then tests them on 5 full days of traffic data, divided into 15-minute intervals. This gives him a lot of samples in his test data set. His technique, of course does best, with an standard error of 4.80 mph against a standard error of 4.88 mph for the existing AI technique and 6.11 mph for the time-series technique. A statistical test would be difficult to perform because his data is seriously heteroskedastic and serially correlated, so the assumptions of elementary statistical tests are violated. It looks to me like the difference between the two AI techniques are too small to matter. And while the innovation outperforms the time-series method, the comparison is flawed because of an unlevel playing field.
Note: These case studies are lightly fictionalized from actual papers submitted to a reputable academic journal. I also simplified some of the statistical issues to speed up this discussion, but I think I got the essence of the situations correct. Most referees thought these papers were worthy of publication.
Believe me, the omission of a quality statistical test is not rare and many referees are quite lax in holding the authors to account.
Neil Turok recently said, “All of the theoretical work that’s been done since the 1970s has not produced a single successful prediction.” He is speaking of physics, not traffic or travel forecasting, but we are in danger of falling into the same predicament.
We need to avoid granting free passes to authors who omit careful validation of their work or do so too casually.
As always, I am interested in hearing your opinion.
Alan Horowitz, Whitefish Bay, July 8, 2018