There is something about data, a certain quality that it has. A list of numbers can and frequently does tell a story about reality, which is not at all apparent from a first look at the numbers. There are numbers that hide in these numbers. These hidden numbers are called statistics, and statistics summarize in some sense the larger set of numbers. Statistics are a very humane subject, as the discipline allows humans to access and utilize the larger sets of data, and avoid having to memorize or otherwise internalize the fuller set of data.
I didn't used to like data. There was a test, a personality/interest test, that was given with the American College Test (ACT) when I took that test long ago. The test placed your interested on a circle, the top of which was labelled "Data", the left-hand side labelled "People", the bottom-side labelled "Ideas", and the right-hand side was labelled "Things". My responses to the questions placed me at the very bottom of the circle, definitely in the "Ideas" area, and about as far away from the "Data" area as you could get. I believe my recommended occupation was graphic designer. I wasn't sure what a graphic designer was back then, but I did ignore that advice and became pretty much what was expected of mildly-bright boys from mining towns in the Southwest United States; I became an engineer.
Time passed, and with it came the rise of computers, networks, massive data stores and new disciplines like "data science". Computers and programming are now second nature to me, and I have demonstrated on more than one occasion a facility for analyzing data, but I still didn't really care for Data with a capital D. Since this post is about Data, you might suspect that my attitude about data has changed, and in some sense it has. What drove that change of heart is difficult to say, but much of it has to do with an ongoing controversy within the academic disciplines of probability and statistics. Now, the fact that controversy exists at all in fields that are perceived as staid as Prob and Stats is surprising. The fierceness of the controversy is surprising as well, at least on the side of one of the camps. The controversy is that between Bayesian and Frequentist camps, and the controversy has as much to do with epistemology as it has to do with anything, for the difference has to do with how we say we know things that are fundamentally uncertain.
It is the idea of this controversy that appeals to me, which keeps me near my roots as a person grounded, so to speak, in ideas, but with a new and evolving interest in data. For the Bayesians and Frequentists have very different ideas about what data tells us, and what data is good for. It has taken me a while to understand the differences between the two camps, and I can't say that I completely understand what is going on, but I have learned a few things.
The first is the connection of the Bayesians with what I think is the scientific method. Bayesians see probability as a measure of belief. The term that is used is plausibility -- a measure of the intensity of the truth of a statement. Bayesians see data as grist for a mill that generates belief. As new data comes in, our beliefs in the plausibility of statements are updated. This aligns with the scientific method which posses that the state of knowledge as a fluid, changing thing. Interference in Bayesian statistics has to do with computing the probability of a random variable, in this case the statistics that is being inferred, is within some bounds.
The Frequentists just sort of go with the numbers. A probably is just the number of times a thing did happen divided by the number of times that thing could happen. Parameters are not random variables in themselves, but are constants that we attempt to identify from the data, and the data can only give a probability that the parameter is bounded. You assume that you have all of the data -- if not, then why are you fooling around? And there is no fussing around with a prior belief. The data doesn't update the prior belief, it just is.
The statistics that I'm being taught is of the Frequentist variety, although there have been brief excursions into Bayesian territory. The Frequentist posture seems to be one of a benign superiority -- we are doing math, the Bayesian are doing computer science for the most part. The Bayesian see the Frequentists are old fashioned fuddy-duddies that have yet to get a clue.
Where am I? Well, I'm mostly pragmatic. I'm learning the Frequentist theory and I appreciate its power and understand why it is used everywhere. I also understand, I think, the Bayesian point of view and its power, especially given the prevalence of data and computing. I also appreciate that Bayesianism offers a theory of knowledge which Frequentism lacks. I am looking forward to learning more about the Bayesian techniques and understanding some of the edgier cases where one technique seems to dominate over another.
Subscribe to:
Post Comments (Atom)

No comments:
Post a Comment