Archive for September, 2012

STAT 330 September 27, 2012

September 27, 2012

The next assignment is to do problems from Michael Lavine’s book, “Introduction to Statistical Thought“. Please do problems 2, 28, 29 and 36 at the end of Chapter 1 (starts on p. 80). This is due on Thursday, October 4.

Here is a link to the proof that the average of two standard Cauchys is a standard Cauchy; the proof can be generalized to the average of N Cauchys.

I already mentioned Nate Silver’s blog, where he uses statistics (mostly Bayesian) to predict election outcomes. He was interviewed today by Salon.com; it’s a very interesting interview with lots of insights about sampling and interpretation of data. Nate has just come out with a book based on his experiences, which looks quite interesting.

Today’s lecture looked at MCMC in general and at Gibbs sampling in particular. Gibbs sampling is used if you can sample from the conditional distributions of some of the parameters given other parameters. You start somewhere in the sample space, sample on some parameters conditional on the remaining one, replace the old value of those samples with the new sample, then, conditional on this updated set, sample another set of parameters conditional on the remaining samples in the updated set. We continue in this way until all the parameters have been sampled. The result is that we will have jumped to a new place in the sample space, generating a new sample on all the parameters. We then repeat this procedure as long as we wish to generate a large number of samples.

We illustrated this with a very simple (trivial) example on a finite state space; we then derived a sampling scheme for inference on normally distributed data with unknown mean and variance, and were about to discuss an R program for doing the sampling.

STAT 330 September 25, 2012

September 27, 2012

Today’s class centered on likelihoods. First we looked at the Cauchy distribution. I asserted without proof that the average of N (standard) Cauchy-distributed random variables has the same distribution as a single random variable, so that taking averages doesn’t improve our estimates. The likelihood does, however, get more and more peaked as we get more and more data. We looked at the likelihood for 1, 2, 3 and more Cauchy variables. We noted that for 2 we often get a “two-peaked” distribution, and with 3 we sometimes have odd bumps on the likelihood. These quiet down as more and more data is obtained.

I then stated the Likelihood Principle and discussed the fact that it is a consequence of the Sufficiency Principle and the Conditionality Principle, both of which seem unremarkable. Yet the Likelihood Principle is quite controversial, and many frequentist procedures violate it. Bayesian procedures never violate it, because in Bayesian inference, the information about the parameters contained in the data is always in the likelihood, which the Bayesian mantra automatically uses.

Stat 330 September 20, 2012

September 20, 2012

A student asked me to post the hidden charts 36-40 from today’s lecture. Here they are.

I have posted the next Chart Set, on MCMC.

We looked at the problem of measuring the bias of a coin that is not fair, when we have observed a certain number h of heads and t of tails. Again, we need to identify the states of nature. The states of nature are always things that you do not know but would like to know. We know how many heads and tails we observed, so that cannot be the states of nature. What we do not know is the bias b of the coin, so the possible values of b, between 0 and 1, are the states of nature. For a prior we chose one that is uniform in the bias, that is, all values of the bias are equally probable, but we considered the possibility of a triangular or bell-shaped prior as well. For the likelihood, if the tosses are independent, so the likelihood is proportional to b^h(1-b)^t. If we know the exact sequence of heads and tails, that is the probability of the particular sequence we observed; if we only know the number of heads and tails, then there is also a binomial coefficient as a factor: C^{h+t}_h. However, as this binomial coefficient is independent of b, the two likelihoods are proportional up to a constant factor and we can use either one.

Then the Bayesian mantra: posterior is proportional to prior times likelihood. We’d have to normalize by dividing this product by its integral from 0 to 1. That’s the posterior distribution. See Chart #40 in the supplemental chart set you can download above.

We talked about various summaries of the posterior. As a point summary, the mean is superior to the mode (Chart #43).

We then showed how to get the results by simulation, and how to summarize the results from the sample drawn from the posterior distribution. As the data get more and more numerous, the results will become better and better.

We also showed how you can use a beta prior with the binomial likelihood to get a beta posterior, thus staying within the class of beta functions (“conjugate priors”); and how you can use the beta posterior as a prior with a new set of data to get an even more accurate result. We also pointed out that the rules of probability theory (when you explicitly consider the stuff to the right of the conditioning bar) don’t let you “cheat” by using the same data twice!

Finally, I introduced the Cauchy distribution and stated that the mean of N Cauchy-distributed random variables is again Cauchy, from the same distribution. Taking averages of Cauchy-distributed data does not improve things; the average is no better than a single observation. So that’s not how we should approach data that happens to have a Cauchy distribution.

STAT 330 September 18, 2012

September 17, 2012

Here is a link to Problem Set #1, due on September 27.

I remarked in class on Nate Silver’s election blog in the New York Times. He’s using Bayesian methods and simulation to make predictions about the upcoming elections.

In class we first looked at the difference between Bayesian credible intervals and frequentist confidence intervals. The key to understanding this is that in the Bayesian case, the data are considered fixed and known, and the unknown parameter is a random variable that has a probability distribution that is described by the (fixed and known) credible interval computed from the data. On the other hand, in the frequentist case the unknown parameter is considered fixed, and the probability model generates hypothetical data from which we compute hypothetical confidence intervals. A certain fraction of these hypothetical intervals (for example, 95%) will contain the unknown parameter; the remainder (5% in this example) will not. The data we actually observed will generate one of those confidence intervals, the one we observe. But we cannot conclude that the parameter has a 95% probability of lying within this particular confidence interval. The confidence interval definition only refers to the statistical properties of confidence intervals in general, not to the particular one we calculate from the one data set we happen to have observed.

I did not mention this in class, but later in the course I will give an example of a valid frequentist 90% confidence interval that has the property that we know for certain that the parameter in question does not lie in that interval. This cannot happen with Bayesian credible intervals. Sorry, it will take some time to get to this example, as we have some other machinery to develop first.

We talked about estimating functions of a parameter, with examples by simulation. We talked about robustness under prior variation and locally uniform priors, and looked at our voting example under a triangular prior. We gave an example in the voting example about how resampling can quickly let us examine the effect of changing the prior. We finished by looking at continuous versus discrete parameters.

STAT 330 September 13, 2012

September 14, 2012

We started by looking at the natural frequencies interpretation of the remark made by Alan Dershowitz, O. J. Simpson’s attorney, and got the same results we got by solving Bayes’ theorem directly. We then proceeded to look at three different problems involving sampling: Sampling with replacement (testing parts), sampling without replacement (polling voters) and a catch-and-release problem. In the first of these, the samples are independent, but in the other two they are dependent and care must be taken to assure a correct likelihood function. We concluded this chart set by seeing how R can be used to draw the posterior distribution of the voting example, and then how the same problem can be solved by sampling and then using the sample to compute credible intervals, means, medians, variances, standard deviations, and so forth.

We continued with the discussion of the next chart set: How to interpret the posterior probability. We discussed Bayesian credible intervals; I also briefly discussed means, medians and modes in terms of loss functions and decision theory.

People asked when the first problem set will be assigned. I expect to hand out a problem set on Tuesday.

STAT 330 September 11, 2012

September 11, 2012

Today we looked at some simple examples of Bayesian inference, in medical situations and legal situations.

Interestingly, the New York Times today has an article on another medical test that they are now recommending not be used because it does not reduce mortality and when positives are found, they are mostly false positives and have significant adverse consequences.

Here is Chart Set 4, “Interpretation

STAT 330 September 6, 2012

September 6, 2012

Today we talked about independence. In connection with the notes, obviously you shouldn’t be dividing by P(B) when P(B)=0. In practice this is less of a problem than it first appears. For example, suppose that B is very, very implausible. I mentioned Russell’s teapot, the statement B that there is a teapot orbiting the Sun out beyond the orbit of Neptune. Now, this supposed teapot almost certainly does not exist, but you cannot say for sure that P(B)=0; maybe some aliens set a teapot going around the Sun 100 million years ago, we simply cannot say with certainty that the teapot isn’t there, only that the probability that it is there is very, very small. About the only time that we can say for sure that the probability is zero is if the proposition is an absurdity, a logical contradiction, such as B=(A&(not-A)).

We had a discussion about medical testing that may not be as valuable as it first seems, such as PSA testing for prostate cancer. This was in connection with the charts that talked about dependence not being equivalent to causality.

I noted that any method of proving independence or dependence is OK; personally when faced with tables such as on Charts 50&51 my preference is to compute the marginals and just verify that the individual joint probabilities are or are not the products of the corresponding marginals.

I made some remarks about frequentist estimators and showed a simple example in R.

Then we went to the next chart set. We talked about Bayes’ theorem and the Bayesian mantra: Posterior is proportional to prior times likelihood. I mentioned that the word “posterior” engenders lots of Bayesian humor, which comes out in places like the Cabaret that closes typical Bayesian meetings, or the skits and songs that have been written for those performances.

I noted that the likelihood is numerically proportional to the sampling distribution, but that whereas the sampling distribution is a probability that describes hypothetical data given some fixed hypothesis, and is thus a function of the hypothetical data, the likelihood is not a probability, and is a function of the hypotheses given some fixed observed data. The likelihood can be multiplied by a (non-zero) constant and still be a valid likelihood, as the constant will cancel out when we divide by the denominator in Bayes’ theorem. I noted that there are strategies that allow us in many cases to bypass the calculation of the denominator (which is known as the marginal likelihood or the probability of the data).

STAT 330 September 4, 2012

September 5, 2012

Today we looked at some more examples…a more complex version of Bertrand’s Box, the famous Monty Hall problem (see the video that Anna found), examples from my experience with the Hubble Telescope, coin tossing examples, and other examples that illustrated conditionalization and marginalization. We finished with the definition of independence.

I did not mention several other Monty Hall variations. We discussed the standard one, where it pays to switch, and I mentioned a version where Monty randomly opens another door (I call this one “Ignorant Monty” since he doesn’t know where the prize is). There’s also “Angelic Monty”, where Monty only opens a door if you have chosen the wrong door, and shows you that you are a winner if you chose the right door, and “Monty from Hell”, where Monty only opens a door and asks if you want to switch if you have chosen the right door, and opens the door with the prize to show you that you’ve lost if you chose the wrong door. There’s also “Mixture Monty,” where Monty flips a fair coin in advance, and if it is heads, behaves like “Angelic Monty”, and if it is tails, like “Monty from Hell”. Think about these variations. In which ones does it pay to switch if you choose Door #1 and he shows you a goat behind Door #2? In which ones does it not pay?

STAT 330 September 3, 2012

September 3, 2012

The new chart set is on Bayes’ Theorem. Click here to get a copy.