## Archive for October, 2012

### STAT 330 October 30, 2012

October 28, 2012

I am hoping and expecting that Sandy will not prevent me from making class on Tuesday, but it all depends on how bad the storm will be. If I cannot make it I will tweet at bayesrulez, and send email (assuming that we have power!) and if that doesn’t work I will try to contact the department and get a message posted on the blackboard.

UPDATE: I am in Burlington and there will be class today.

Here is the next set of charts, on Hierarchical Bayes models.

Nate Silver, whom I have mentioned before as a Bayesian who tries to predict election outcomes (and more…) has a new book.

Andrew Gelman has an op-ed in the NY Times today on how to interpret the probabilities that Nate (and others) are calculating. And here is another similar discussion from today’s Salon.com.

Today we talked about Maximum Entropy priors; I explained how mathematical entropy can be used to quantify the amount of information that we stand to gain by learning which of a number of states happens to be the case, when all we have is a probability distribution on those states. We would have maximum uncertainty, that is to say, minimum amount of prior information, by maximizing the entropy of a distribution, subject to constraints that reflect what we do know. That maximization is accomplished using Lagrange multipliers. In the case of a continuous distribution we must also use the calculus of variations. I gave several examples of how to do this.

We then took up Jeffreys priors, introduced by the statistician Harold Jeffreys. The Jeffreys prior is the square root of the determinant of the Fisher information of the likelihood function. It has the advantage that if you transform the parameters of a problem, the Jeffreys prior in the new coordinates is the prior that will give the same results as the Jeffreys prior in the original parameter set would give. So you can decide which parameters are most convenient, and then just calculate and use the Jeffreys prior in those coordinates (if you have decided that the Jeffreys prior is the right one for the problem).

Next time I will give you some examples; then we will proceed with the next chart set.

### STAT 330 October 25, 2012

October 25, 2012

Here is the next problem set, due on Thursday, November 1.

We discussed the Errors-In-Variables case and showed how to sample this in a Gibbs sampler exactly.

We then discussed priors. I noted that if you have substantial information about the parameters of a problem you should NOT use the automatic priors we’re discussing here, you should use a prior that makes use of that information. Not to use information that you have is not a good idea.

But in many cases you do not have such information, and in these cases you can use one of these “automatic” priors with integrity.

The first example we discussed was if there is some sort of group invariance, like dice or coins (where you cannot distinguish a priori between various states, so that a prior that does not inject unknown information would be equal on all cases). Or we may think that the origin of our coordinate system should not make any difference about our results, in which case a reasonable prior would not respect translations of the origin, and would be constant. We also discussed the case where there is a scale variable, and if we think that it should not matter what units we use to describe that variable, then we should choose a prior that looks like the reciprocal of the scale variable, i.e., if s is the scale, then use prior 1/s; or transform to variable u=log(s), and in u the prior would be constant. (This is a good idea just for the improvements that sampling on u would provide).

I also discussed the situation that can present itself with some situations, where the prior generated by a left-invariant Haar prior can be different from those generated by a right-invariant Haar prior. Jim Berger strongly advocates the right-invariant choice.

### STAT 330 October 23, 2012

October 21, 2012

Here is the next chart set, on priors.

We discussed the highly correlated case that we discussed earlier, and gave a better solution, using exact simulation via a Cholesky decomposition of the matrix that describes the problem.

We briefly noted how somewhat similar problems could be attacked using proposal distributions that have fatter tails than a normal, for example a “t” distribution with a low degree-of-freedom, with a description matrix derived from the corresponding Laplace approximation.

We then discussed matrix methods for studying multivariate linear models.

Then, we looked at how the conjugate priors for a normal linear model can be used to produce a model in that class of models. The means end up being an “average” of what the data tells us and what the prior information tells us.

Finally, we talked about “posterior prediction”. The idea here is that we want to hit some target (e.g., sending a rocket to Mars), and want to know the prediction of where the rocket will end up. Or even, what the observations that we will observe will look like. We talked about the Dirac “delta function” and its uses under an integral sign.

### STAT 330 October 18, 2012

October 19, 2012

We finished looking at fitting straight lines using centered variables. However, I showed you that although the centered variables are uncorrelated, they are not independent. We mentioned some generalizations of linear regression: the heteroskedastic case, where the variances if different observations are different, error in x instead of y (which we solved by introducing latent variables and eliminating them explicitly), the errors-in-variables case where both x and y have errors, again solved by introducing latent variables, but this time sampling on the latent variables since we can’t use the trick of eliminating them explicitly. We ran some code to do this. I’ll be posting this later. I mentioned that you have to do something special to avoid $\sigma_x$ and $\sigma_y$ from being unidentified (and thus having a posterior distribution that can’t be normalized). I gave the example of $p(x,y)\propto exp(-(x-y)^2)$, for which the marginal distributions exist and are perfectly good, whereas there is no joint distribution.

We finished up by looking at a case where two variables are highly correlated and noting that sampling in the simplistic way we have been doing it doesn’t mix very well. For some reason there were several errors in the programs on the charts, and with help from you all we were able to fix the code and get it to run.

### STAT 330 October 16, 2012

October 16, 2012

I’ll get to the regular blog later, but I just looked at Steve Strogatz’s article in the New York Times today. Click here to get it. Then go down to the movie near the end of the article (but before the comments). It will take you about nine minutes to watch the whole thing, and (since this came up in our discussion today) give you a better idea of the scale of the universe. Please note, the visible universe is more than 100 times larger than the last frame showing galaxies, and we observe galaxies that are 50 times farther away than that last frame.

If you prefer to go directly to the movie without reading the Strogatz article, click here instead (although you’ll miss an interesting article).

Speaking of astronomy, here’s an article about false discoveries, and how some astronomers ethically admit mistakes, and others do not. This sort of thing happens in any field that depends on statistics (and many that do not). If you are an aspiring statistician, you should be aware of the comment that Richard Feynman made, and which was quoted in this article: “The first principle is that you should not fool yourself, and you are the easiest person to fool.”

We started out by looking at the problem of estimating the time that the leading edge of a burst of neutrinos from a supernova would have arrived, from the arrival times of the neutrinos that were detected (about two dozen of them). We looked at a 90% frequentist confidence interval based on three data points, and discovered that the time could not be in that confidence interval, which did not include the earliest neutrino. Although this confidence interval was based on an unbiased estimator, it was not based on a sufficient statistic. We looked at a Bayesian credible interval, which is not only easier to compute than the confidence interval, but also manifestly does not have the problem that the confidence interval did.

We then turned to estimation based on the normal distribution. The take-away is as follows: When not estimating the standard deviation of the errors $\sigma$ on the data, the posterior on $\mu$ is normal, but when estimating $\sigma$, the posterior is a t distribution, with a number of degrees of freedom equal to the number of data points minus the number of parameters being estimated. We saw that centered variables simplified the solution of basic linear regression of a straight line with slope and intercept by producing a posterior distribution in which the parameters are independent.

### STAT 330 October 11, 2012

October 10, 2012

Here is the next problem set, due on Thursday, October 18.

Here is the next chart set, on Normal Linear Models.

If you are interested in more about Nate Silver, whom I mentioned a few days ago, he was interviewed on the NPR show “Fresh Air” yesterday. You can listen to the interview here.

NEW: Here’s an interesting article about Bayes and children, who may start out as natural Bayesian processors. It mentions Nate Silver.

Today we looked at the simulations for the normal inference problem we discussed last time. I fixed the code by removing the offending invisible character (I’ve posted the code already). We looked at two versions, one of which sampled on $\mu$ and $\sigma$ separately, and the other that sampled them simultaneously. I pointed out that the second example might not sample quite as well with the same values of a and b, but might be competitive if they were somewhat smaller; I also noted that the second example evaluates the likelihood function half as many times as the first solution. This could be important in cases where the likelihood is expensive to compute.

We then resumed discussion of the Poisson distribution. We noted that in realistic cases, we may have to evaluate both a signal (the brightness of a star, for example) and a noisy background (due to a collection of very faint and more distant stars near the one we are studying, for example). The straightforward method would be to measure star+background=s+b=u, and then move the telescope off of the star to measure background=b by itself. Then we would subtract b from u to get s. If the signals are large, this works well; but if they are small it is statistically possible to measure b>u which would result in s<0, which is unphysical.

We developed a method for doing this, and simulated it by sampling.

Here is a link to the code for sampling in this problem.

### STAT 330 October 9, 2012

October 8, 2012

Here is a link to the next set of charts, on Poisson inference.

Here are links to the two code files I mentioned today, MCMC4.R and MCMC5.R. The problem with the first file was that a stray invisible character had somehow inserted itself into the file. I eliminated it and the code ran just fine. We’ll discuss this on Thursday.

Here is a link to a download of a ten-minute discussion with Nate Silver, the statistician who is using Bayesian methods to make forecasts of elections on his blog, fivethirtyeight. He discusses the basic ideas of what he’s doing.

Today we discussed more on MCMC simulation and looked at some of the things that have to be taken into account when programming a sampler using Metropolis-Hastings, Gibbs sampling, or hybrids such as Metropolis-Within-Gibbs. I then attempted to run the MCMC4.R file, which failed to run because as I later discovered it had acquired an invisible character somehow that compromised its integrity. I simply redid that bit of the file and now it runs fine. I did not change the code, that is, the surmise that we needed an explicit “return” statement was not true.

We then turned to Poisson inference, which describes situations in which events happen (stars, arrivals at the ER, radioactive decays, photon counting) at some rate, and we want to understand the counting process. I motivated the discussion by talking about filling a bowl with oatmeal that had some raisins in it. We briefly discussed priors and some other issues before class let out.

### STAT 330 October 4, 2012

October 5, 2012

Here is the next problem set, due on Thursday, October 11.

Today we spent some time discussing Markov chains and stationary distributions. It is our goal to simulate a sample by constructing our Markov chain so that we approach the stationary distribution (hopefully rapidly).

We then looked at a simple Metropolis-Hastings MCMC sampler for generating a sample from a standard normal distribution. The code is in the charts. We saw how varying the proposal distribution’s width can affect the samples: Too narrow, and although the acceptance rated is high, it takes a long time for the sampler to get anywhere because successive samples are highly correlated; too broad and the acceptance rate goes way down to the point where the sampler hardly moves at all. Something in the middle works best. We also saw that if we start very far away from the peak of the distribution, it takes a while for the program, which is “hill-climbing”, to get to where there is a significant amount of probability. This is why people frequently discard the first part of the Markov chain as “burn-in”

### STAT 330 October 2, 2012

October 3, 2012

We discussed the problem set. I said that in the future I want one paper per group; if there are disagreements within the group, they should be discussed in the paper.

On Problem 2, I noted that some groups were using set theory to try to solve it. That was not my intention. I wanted you to use just the axioms of probability theory as we discussed in class. Probability theory is a calculus on propositions A, B, …; a proposition such as A=”the Nile is longer than 1000 miles” isn’t a set. There are similarities between set theory and probability theory, but you should not confuse the two.

On Problem 3, the presentation is simplest if you ignore the actual numbers on the dice and just note whether the outcome is odd or even.

On Problem 4, the commonest problem was thinking that P(y|x=1) is a number. It is an array, because y is a vector with three components, and each component needs a probability. What many teams computed was P(x=1).

On Problem 5, most groups did find, but some didn’t note that the first two probabilities I gave were conditional probabilities, but the third was a joint probability. The trick is to figure out how to compute the conditional probability corresponding to the third number (0.05). This fills out the first row for independence. Then the second row is a constant multiple of the first row, and to get a table for dependence, just alter the numbers in the second row to make everything add up to 1.

On Problem 6, the main problem was that some teams didn’t give enough detail or explanation of how they reached the result or would explain it to a member of the general population.

We spent the majority of the time on #7. Here there are several issues: Is this sampling with replacement, or without?  My position is that in general this should be considered sampling with replacement, which would make a factor of 1/N for each observation of a taxi. The reason is that if one had independent repeats of an observation of a taxi (as for example, observing taxi #3 the next morning) then that would reinforce our opinion that the number of taxis was small, as against the next morning observing (at random) taxi number 89. Just because we observed taxi #3 in the evening doesn’t mean that we can’t use our next-morning observation as new evidence. There are some subtleties, as we can’t follow a taxi down the road and glance away and glance back and think that’s another independent observation. They aren’t independent, and so you can’t just multiply!

Issue number two was the prior. N is the number of taxis in the city, and you are more likely to be parachuted into a small city than a large one (because there are more small cities than large ones); so the prior P(N) should be smaller for large N than for small ones. It’s not certain what one should use, but several groups used P(N)=1/N (improper prior but OK if the posterior is proper). More generally one might want to consider something like a power law which would look like $P(N)\propto N^{-\alpha}$ for $\alpha$ positive. There could be other choices.

Issue number 3 was minor, but just to note that not every 5% credible interval should look like a [0.025, 0.975] interval. That is because in this particular case, the largest amount of posterior probability is in the [0, 0.025] interval. The whole point of a credible interval is to tell your readers where most of the posterior probability lies. It would be perverse, if this is your goal, to exclude the maximum of the posterior probability!

We spent the rest of the period looking at the example of normal data with unknown mean and variance. We did it with simulation, and saw that as we increased the sample size, we did not come closer and closer to the mean and variance from which the data were drawn; the data are fixed, and all increasing the sample size does is to increase the precision to which we can squeeze the information out of our data, but getting better credible intervals, posterior means, variances, and so forth. But the histograms of the posterior probabilities get smoother and smoother as our simulation sample size gets bigger and bigger.