BayesRules

STAT 330 November 29, 2012

bayesrules — Sat, 01 Dec 2012 02:29:43 +0000

We finished off the discussion of model selection/averaging.

We then went into missing data models. In some cases the fact that data are not observed can have no effect on the inference whatsoever. In other cases, the effects can be profound. Sometimes (as in Example 2) the effect only appears through a prior that connects the probability of selection to the parameter being estimated. Censorship (where you know that an observation has been made, but the answer is only that the observation is not within the range of the measurement process) and truncation (where you are only given the successful measurements and do not know how many of the measurements were unsuccessful, as in a survey of voters where the voters who hung up the phone are not reported) are examples of situations where missing data affects the likelihood function and hence the inference.

We discussed a general approach to problems of this sort, where the likelihood is explicitly written to include both the observed and the missing data. The missing data are considered as latent variables, which thus require a prior and should be marginalized out. We verified that (in Example 4) this approach gives the same results as we got earlier.

In a sampling scheme, we sample on all parameters, including the latent variables that represent the missing data. We started on discussing a sampling scheme for Example 4, and will finish it next time.

STAT 330 November 27, 2012

bayesrules — Mon, 26 Nov 2012 21:43:43 +0000

The next chart set is on Model Selection and Model Averaging.

The chart set about missing data can be found here.

We started by discussing how our prior belief in a hypothesis can affect our approach to the data, e.g., guessing three cards right if the subject is a child who looks at the face of the cards, or a magician who looks only at the backs of the cards, is a priori more believable than that the subject is a real psychic. We looked at the Soal-Bateman example, where it was eventually shown that the amazing results were due to cheating. I discussed a general framework whereby very unlikely data can raise a more plausible hypothesis (cheating) over a less plausible one (psychic powers are real) when we take care to include all possible hypotheses in the analysis. A student asked, how can we do this? In fact we can never consider all possible hypotheses, there are just too many of them. The best we can do is to consider all hypotheses that might be plausible enough to have their posterior probabilities raised to significance. I finished this chart set by pointing out that just doing hypothesis tests isn’t really how decisions should be made. Really we should be using decision theory, and it turns out that essentially all admissible decision rules in classical (frequentist) decision theory are Bayes rules. Decision theory looks not only at the probabilities, but also at the costs of the decisions we make.

We then turned to model selection and model averaging. Model selection is an obvious extension of the hypothesis testing situation, except that instead of having just two models we have multiple models. The trickiest part of model selection (and averaging) is assigning priors on the parameters of the different models since if they are too vague, they will artificially favor models with fewer parameters. Unlike likelihood ratio tests, the Bayesian model selection idea can be used on non-nested models.

I also discussed approximate methods for model selection, the AIC (Akaike Information Criterion) and Schwartz’ BIC (Bayesian Information Criterion, which is not particularly Bayesian since it ignores the priors on the model parameters). BIC penalizes larger models more than AIC. Closer looks at BIC show that it reduces to the usual asymptotic form for hypothesis testing of a simple vs. a complex hypothesis.

I demonstrated the Zellner g-prior idea for linear models.

Model averaging is related but different. In model selection we average (marginalize) with respect to all the parameters and look at the posterior probabilities of the models. In model averaging there is a parameter that is common to all models that we are interested in, and we marginalize with respect to all the other model parameters and with respect to the models, so that all models contribute to the estimate of the parameter that we are interested in, in proportion to the posterior probabilities of the models themselves.

Finally, we looked at normal linear models. We saw that the prior on cannot be chosen to be a Jeffreys prior, because the resulting integral blows up at infinity.

We finished by looking at Gull’s approximation and a polynomial example that he presented. I noted that in his Figure 8, the y-axis is the log of the posterior probability, so it actually decreases much more rapidly after the peak is reached at N=10 than the figure seems to indicate.

STAT 330 (Vacation special)

bayesrules — Sun, 18 Nov 2012 23:14:20 +0000

I was pointed to this discussion on Andrew Gelman’s blog. “The sample size is huge, so a p-value of 0.007 is not that impressive”. It reinforces the lesson of the past two lectures. The comments contain links to this paper and this paper that I mentioned in class.

STAT 330 November 15, 2012

bayesrules — Fri, 16 Nov 2012 01:06:07 +0000

I started out discussing the cartoon that I linked to a few days ago. I pointed out that both the sensitivity and specificity of the test in the cartoon were very high (35/36 or about 97.3%). Nonetheless, the cartoon test is rather silly, and it reinforces the idea that frequentist tests only talk about what happens if you repeat them many times. The Bayesian probably knows (background information) that it is physically impossible for the Sun to go nova (it will die in an entirely different fashion, its mass is too small), and even if it were possible, the bet is an entirely safe one since if the Sun had gone nova, no one would be around to collect the bet!

I then showed a cartoon that Larry Wasserman put on his blog. Larry’s point here is that (under most circumstances) Bayesian credible intervals don’t say anything about frequentist coverage. There are no coverage guarantees. It is true that under some special circumstances, such as the Berger-Mossman example that you calculated for an assignment, it is possible for a Bayesian credible interval to have good frequentist coverage; but in this example, it was by design, and happened because Berger and Mossman used a standard objective prior. These objective priors probably will give decent coverage in most situations (but it should be checked if coverage is important to you), just as they usually give similar results in parameter-estimation problems (e.g., regression). But in general, informative priors will not necessarily have these properties.

We returned to the perihelion motion of Mercury. The bottom line here is that the “fudge factor” theory F spreads its bets over a large area of outcomes. It’s got to match the actual outcome, but it wastes prior probability on outcomes that do not pan out. On the other hand, Einstein’s theory E makes a very sharp and risky prediction. And, since the data lie close to that prediction, it wins big time, just as when a gambler bets all his chips on one outcome and that outcome is the one that happens.

I noted Berger’s “objective” prior that is symmetric about “no effect” and decreases monotonically away from “no effect”. It doesn’t support Einstein quite as much, but it provides an objective lower bound on the evidence for E.

Even if you put all the prior probability under F on the alternative hypothesis, you get probabilities that are significantly higher than the corresponding p-values. So p-values overestimate the evidence against the null.

Another danger is that the likelihood ratio (Bayes factor) in favor of the simpler hypothesis will increase proportionally to , so the larger the data set (for a given p-value), the more strongly the null hypothesis will be supported. Jack Good suggested a way to convert p-values to Bayes factors and posterior probabilities that, as we calculated, does a pretty good job (but it is approximate).

This led to a discussion of the Jeffreys-Lindley “paradox”, whereby you can have data that simultaneously give strong evidence in favor of the null hypothesis and a very small p-value that would reject it. I gave a real-life example that I wrote a paper on, from some parapsychology research.

Finally I discussed sampling to a foregone conclusion and the Stopping Rule Principle. If you are doing frequentist analysis, you are not supposed to watch the data and stop when the data give you a small enough p-value. Frequentist theory disallows this (but people do it a lot, and the parapsychologists did it in a huge way). The good news is that Bayesian analysis does not have this defect. This means that ethical problems using frequentist principles can be avoided by using Bayesian methods. The notes discuss this.

STAT 330 November 14, 2012

bayesrules — Wed, 14 Nov 2012 01:20:49 +0000

We spent the first part of the period looking at this code example. In this example we are studying the evidence that a coin is fair or not, given that we observed 60 heads and 40 tails. The code illustrates with this simple example the idea of Reversible Jump MCMC, where we propose simultaneously a new model and a value of the parameter (which is 0.5 if the coin is fair, but uniformly distributed on (0,1) if the coin is not fair, in this – highly unrealistic – example). The code allows you to use one of three proposals for the parameter on the unfair case – the exact beta distribution on 60 heads and 40 tails, a normal approximation, and a flat distribution. I pointed out that the first two ought to do pretty well, but that the flat distribution will often propose the parameter out in the tails of the distribution where the posterior probability is low, in which case the proposal is likely to be rejected. We ran the program and found that these predictions were borne out.

We spent the rest of the class looking at various aspects of the Ockham’s razor idea, that one should choose models that are as simple as possible but which still fit the data adequately. Too complex a model is likely to follow “noise” in the data, and too simple will not be adequate to predict the data. We looked at it from the point of view of the idea that a simple hypothesis predicts fewer outcomes than a complex one can. We saw how this worked in the case an alleged planet that had been announced around a pulsar; of proving plagiarism by encoding errors or other unique information into maps, mathematical tables, etc; and how it might be used to detect cheating on multiple choice tests. We also saw how the hypothesis of copying DNA from ancestors provides evidence for evolution, e.g., pseudogened that used to code for vitamin C production have the same defects in humans and chimpanzees, indicating descent from a common ancestor, and the redundancy of the genetic code provides independent evidence as well (there are 64 combinations of three base pairs in the genetic code, but only 20 amino acids are coded for).

I ended by describing Mercury’s perihelion motion. We will finish this example next time.

STAT 330 November 8, 2011

bayesrules — Fri, 09 Nov 2012 00:37:06 +0000

Here is the link to the VPN client mentioned in class today, that allows you to connect to websites as if you were on campus.

Here are the links to the papers I mentioned this morning. First the paper by Berger and Delampady. Next, the paper by Berger and Sellke. And finally, the link to Jim Berger’s website with the Java Applet that allows you to try the thought experiment I discussed in class. The URLs for the first two have changed as jstor.org has changed their method of assigning stable URLs.

Finally, here is a link to the paper by Dellaportas, Forster and Ntzoufras, on reversible jump MCMC.

It appears that Romney has conceded Florida. This comment to today’s Nate Silver blog is very cool (difference between mean and mode).

Here’s a cartoon about frequentism vs. Bayesianism. Don’t take it too seriously. If you mouse over the picture, a hidden message appears.

I pointed out some shortcomings of classical hypothesis tests and p-values. I then outlined how Bayesian tests might be conducted, first in the context of two simple hypotheses and then in the context of one simple and one complex hypothesis. In the latter case there is an additional parameter which requires a prior. Then to consider just the two hypotheses we have to marginalize the posterior probability on the complex hypothesis with respect to .

I pointed out that the results will depend sensitively on the prior, which means that Bayesian hypothesis tests must be conducted with great care. There are some results that are more robust with respect to the prior, and we will discuss them in subsequent classes. I showed an example (biased coin) and demonstrated that the results of a Bayesian hypothesis test can be very different from frequentist ones.

I briefly outlined how reversible jump MCMC can be used to evaluate the posterior probabilities of hypotheses. I’ll show you a program on Tuesday to make this more concrete. I mentioned that the same ideas can be used to compare multiple models of various number of parameters.

I discussed some other problems with p-values, in particular that they overstate the evidence against the null. I pointed to Jim Berger’s website for the Java applet (link above).

I finished with the beginning of a discussion of philosophical issues that relate to Bayesian epistemology.

STAT 330 November 7, 2012

bayesrules — Wed, 07 Nov 2012 13:54:06 +0000

Nate Silver nailed it. See here. Hooray for soberly analyzed statistics.

STAT 330 November 6, 2012

bayesrules — Tue, 06 Nov 2012 21:30:02 +0000

In class I mentioned an article in Scientific American by Efron and Morris on the Stein problem. Here it is! There’s also a WikiPedia article on the Stein problem here.

At start of class I mentioned an NPR story that I heard that indicated that more reliable answers to polling questions about elections would be gotten, not by asking who a person is going to vote for, but who that person thinks will win the election. Here is the story, and here is a link to a related article by the respected pollster Andrew Kohut, of Pew Research.

Here are the notes on Bayesian hypothesis testing, which we started on today.

And here is the next (and final) assignment, due after the holiday break.

I continued the discussion of the Stein problem; we saw how an estimator that dominates the obvious estimator shrinks the estimated batting averages towards the common mean (this was the Efron-Morris estimator). This is very typical of shrinkage estimators, which are a common feature of hierarchical Bayes models. I mentioned that the Efron-Morris estimator (and the James-Stein estimator) are themselves inadmissible, although they are better than the naive estimator. I noted that every admissible decision rule (with exceptions concerning finiteness) is a Bayes rule. This is interesting because admissibility is a frequentist idea, and this observation unites frequentism and Bayesian ideas in the area of decision theory.

I then discussed several examples; one was a normal model analogous to the binomial model that we discussed the other day for the baseball batting averages. I also discussed an oil well logging problem that one of my Texas students suggested some years ago. I went through writing down the posterior probability, but I did not discuss the sampling strategy (which is in the notes). The important points are two: First, enforcing the condition that by including a factor of in the likelihood; and second, using a hierarchical independence Jeffreys prior on and .

Between those two examples I discussed an example that shows how a bunch of independent MCMC calculations can be combined, after the fact, into a single hierarchical model, by using Peter Müller’s “slick trick” of using the samples from the individual calculations to provide the proposals for the hierarchical model.

STAT 330 November 1, 2012

bayesrules — Thu, 01 Nov 2012 22:14:55 +0000

Today we first looked at several examples of Jeffreys priors. First, known variance but unknown mean; Second, known mean but unknown variance. The first was flat, the second was the usual prior. We then looked at unknown mean and unknown variance and (with apologies) we finally ground through to get . Jeffreys didn’t like this (it is what you get for the left invariant Haar prior, which Jim Berger thinks we should not be using). Instead he favored the “independence Jeffreys prior”, which is flat x .

I pointed out that none of these is perfect. There may be no underlying group structure, so those priors may not be useful for some problems. The maximum entropy priors are not invariant under coordinate transformations, meaning that if you work a problem out in one set of coordinates, you may get a result that it incompatible with the working out of the problem in a different set of coordinates. And, since the Jeffreys prior is constructed from the likelihood as a sampling distribution (but the data are integrated out), some think that it is incompatible with the Likelihood Principle.

There are other ideas for constructing priors of this sort.

I again noted that if you have actual prior information, you should use it, and illustrated it with an example from astronomy.

We then turned to hierarchical Bayes models. Here the idea is that we may introduce new parameters that are not in the likelihood via a prior that is conditioned on the new parameters. We looked at an example involving baseball batting averages (trying to predict the end-of-season batting averages based on the results of the first 45 at-bats. I pointed out that because of sampling error, the averages at the extremes might be more extreme than they really should be, so that the player with the best batting average after 45 tries might just have been lucky, whereas the one with the worst batting average might have just been unlucky. There are differences in the ability of players, to be sure, but the first few at-bats are also affected by sampling error. So we modeled the individual players as a binomial with a probability that is unique to the player, but assumed that the individual probabilities are drawn from a distribution that represents the varying abilities of all players (modeled as a beta distribution). I demonstrated a program that calculates this. We’ll take this up again on Tuesday.

I finished with a short discussion of admissibility in frequentist decision theory.

STAT 330 October 30, 2012

bayesrules — Sun, 28 Oct 2012 20:30:22 +0000

I am hoping and expecting that Sandy will not prevent me from making class on Tuesday, but it all depends on how bad the storm will be. If I cannot make it I will tweet at bayesrulez, and send email (assuming that we have power!) and if that doesn’t work I will try to contact the department and get a message posted on the blackboard.

UPDATE: I am in Burlington and there will be class today.

Here is the next set of charts, on Hierarchical Bayes models.

Nate Silver, whom I have mentioned before as a Bayesian who tries to predict election outcomes (and more…) has a new book.

Andrew Gelman has an op-ed in the NY Times today on how to interpret the probabilities that Nate (and others) are calculating. And here is another similar discussion from today’s Salon.com.

Today we talked about Maximum Entropy priors; I explained how mathematical entropy can be used to quantify the amount of information that we stand to gain by learning which of a number of states happens to be the case, when all we have is a probability distribution on those states. We would have maximum uncertainty, that is to say, minimum amount of prior information, by maximizing the entropy of a distribution, subject to constraints that reflect what we do know. That maximization is accomplished using Lagrange multipliers. In the case of a continuous distribution we must also use the calculus of variations. I gave several examples of how to do this.

We then took up Jeffreys priors, introduced by the statistician Harold Jeffreys. The Jeffreys prior is the square root of the determinant of the Fisher information of the likelihood function. It has the advantage that if you transform the parameters of a problem, the Jeffreys prior in the new coordinates is the prior that will give the same results as the Jeffreys prior in the original parameter set would give. So you can decide which parameters are most convenient, and then just calculate and use the Jeffreys prior in those coordinates (if you have decided that the Jeffreys prior is the right one for the problem).

Next time I will give you some examples; then we will proceed with the next chart set.