We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
I remember 17 years ago, I crashed a "big data for statisticians" meeting at the UCSD supercomputing center (they were kind enough to let me in). Emanuel Parzen, who sadly passed away this year (http://today.tamu.edu/2016/..., made an impassioned plea to the statisticians there that they needed to take the lead in data mining (at that time, this was the 'big data' analytics label) because the data mining community needed an understanding of how to apply sound principles to big data, the 'statistical thinking' discussed here. That need is even greater today as we see the same mistakes being made all over again.
The problem is that there is no real consensus on what "statistical thinking" might be, as different well-qualified statisticians often contradict one another as we have seen in the Ioannidis reviews showing lack of replication in many biomedical studies with statisticians as authors. But we do know what good science means and that does mean findings that are well replicated when biases of statisticians/researchers and sampling data are controlled as in FDA trials. Maybe we should keep our focus on good science and require evidence of replication with appropriate blinding (yes even statisticians building models should be blinded to one another to control their biases) and with appropriate controls for data sampling error (yes we even should ask the unthinkable and give each of these blinded statisticians/modelers independent model development data). This would seem much better than somebody saying "trust me I am a statistician" when their model predictions very well could be contradicted by another statistician.
Daniel, No, we are fine. RE: But we do know what good science means RESP: 'Good science' contains 'statistical thinking' so we must have a pretty good idea of what 'statistical thinking' is too. If you do not understand statistics, that is not a flaw of statistics.
Randeroid, prove you are fine by proper blinded research that controls human bias like good science always does. When reports emerge that many statisticians do not replicate, this is cause for concern in the scientific community. There are good statisticians who try to perform objective science with proper blinded controls, but there are also many who are not. If you were really concerned about the bad ones you would not be making blanket statements like "We are fine". No profession is always fine and all have bad apples, so thus proper controls like blinded research have been introduced in medical science to handle this. In statistical modeling and analysis, lack of replication may occur just due to arbitrary or biased choices on the part of the modelers who may be perfectly honest but wrong. P.S. Much of science is done without statistics and some of the worst science is by otherwise qualified statisticians who simply can't replicate each other but believe such randomly varying conclusions are perfectly "fine".
Daniel, I have seen your anti-statistician comments before. I do not recall any that were positive or informative, which leads me to doubt the objectivity. If someone took your comments at face value, they might get the completely false impression that science is better without statistics. Respectfully.
Randeroid, I am anti-methods that do not replicate due to human biases or error whether they are practiced by statisticians or by any other profession that claims to be doing science. The only way to expose such methods is through blinded controlled science. Blinded controlled methods are something that dishonest statisticians who know that they are biased and will not be replicated absolutely hate because they are immediately shown to be contradicted. Honest scientific statisticians have no problem with introducing proper blinded controls into their methods to ensure that their conclusions are not related to their biases and I have tremendous respect for these professionals.
Daniel, well no one would have gathered that you are for more statistics and more rigorous statistics based on what you wrote above.
Randeroid, I did not say that I was in favor of "more statistics". You are right that I am in favor of well-controlled and yes rigorous statistical methods, but I hope my posts make this clear.
RE: I said that I was an advocate forwell-controlled and yes rigorous statistical methods including analyses and modeling which is what all my posts say.
RESP: That means more statistics, not less.
Having statisticians solve statistics problems would dramatically reduce the number of these statistics mistakes and the statistical dishonesty that concerns you. So, you must be a HUGE fan of having the statisticians solve the statistics problems.
Randeroid, - by the way I just looked at the 2009 original Nature article on Google Flu trends and two of the authors appear to be epidemiologists. One has an MPH and another was and still is an epidemiologist at the Centers for Disease Control. According to the American Statistical Association's explanation of careers in epidemiology at www.amstat.org, epidemiologists are statisticians whose "statistical work consists of directly analyzing and designing studies, advising others on the analysis and design of studies, and developing statistical methods to improve the design and analysis of studies". So if you and Leek (assuming you are not the same person) are honest about the pitfalls of statistical methods, then you need be honest and admit that statisticians were involved in the Google Flu trends work, but things still went badly.
Daniel, The folly of Google Flu Trends was that they lacked the statistical muscle. The team consisted on data management guys and domain experts (MPH and epi). I wrote a case study on it.
Again, having statisticians solve statistics problems would dramatically reduce the number of these statistics mistakes and the statistical dishonesty that concerns you. So, you must be a HUGE fan of having the statisticians solve the statistics problems.
Randeroid, a copy of that case study would be interesting. I am in favor of scientific statisticians obviously with all the experimental design controls that they bring to the table including appropriate blinding of modelers from one another to ensure that modeler biases and error are not influencing predictions. When this was done for example in MAQC-II, the best modelers were not academic statisticians but had domain expertise and even engineering PhD degrees and the academics did not fare well. You are right I am a HUGE fan of quantitative scientists who may or may not have actual academic degrees in statistics, but who introduce appropriate blinding into their model design and/or prove themselves very capable in objective well-controlled blinded scientific tests. P.S. R.A. Fisher was a great statistician, but was extremely biased in tobacco lung cancer studies for example. Bias and statistical ability appear to be orthogonal dimensions.
Daniel, applied statisticians are NOT academic statisticians, just like econometricians are NOT macro economists. Also, you should not assume that we are their friends, teaming up on you. We are problem-based and academic stat are tool-based; there is a disconnect. A statistics degree is NOT required to be an applied statistician. Also, applied statisticians must complete their training in the field, which should tell you about the disconnect.
The GFT (Google Flu Trends) team thought that they did not need statistics because they have Big Data. Their approach and their explanations are thin on statistics. If you go by their academic degrees, the team consisted of domain experts on the flu (2 MPH/MDs); a CDC liaison (1 Epi); and the rest of the expertise was geared toward managing large amounts of data with a little biostat and econ thrown in. They did not grasp the statistics problem they faced. IF they were not so arrogant, 'Big Data hubris,' I would be more understanding.
Daniel, while you contemplate how best to apologize to applied statisticians for your reckless comments, I wish to make another point. The first step in leveraging expertise is finding it. The next step is to actually employ the expertise, right? Just because an expert is on the team does not mean their expertise is being leveraged (and they are not to be held responsible for actions with which they disagree).
This discussion supports why I have disdain for the job title of "data scientist." It's far too broad to be meaningful. It encompasses far too many fields, roles and functions. It is no less general than the broad field of business intelligence.
If you're a data scientist, then at best you're a generalist -- which negates the "scientist" part of the label. But as this discussion suggests, no one can be a specialist across the broad spectrum of all things data. I hope to see this title dissolve into far more specific, meaningful and functional titles.
Such as, Statistical Data Scientist & IT Data Scientist.