Assignment 5: Complicated surveys Due: April 9, 1996 Most large surveys will involve a combination of the ideas we have discussed: e.g. several layers of stratification on top of several layers of clustering with ratio and poststratified estimates sprinkled throughout. The formulas for estimating errors can become horrendous, if they are derivable, especially if there are several layers of clustering. Cochran, sections 11.17 to 11.21, gives some of the techniques used to handle this problem. We discuss here some simple principles which have fairly wide applicability. 1. Principle of stratification: Let Y(h) be the population total in stratum h and let Yhat(h) be an unbiased sample estimate of Yh. Then Yhat = SUM[yhat(h)] unbiasedly estimates Y = SUM[Y(h)] and MSE(Yhat) = SUM[MSE(Yhat(h)]. This principle follows from the independence of the samples in the various strata. 2. Principle of cluster sampling with replacement: Suppose the primary sampling units (first stage clusters) are chosen with probability proportional to z(i) (SUM[z(i)] = 1) and with replacement. Let Yhat(i) be an unbiased estimate of the population total Y(i) in cluster i. Then an unbiased estimate of the population total Y is Yhat = {SUM[yhat(i)/z(i)]}/n where n is the number of PSU's chosen and the sum is a sum over the sampled PSU's. The variance of Yhat can be unbiasedly estimated by Vhat(Yhat) = SUM[(Yhat(i)/z(i)-Ybar)^2]/(n(n-1)). In this principle the design of the subsample in each cluster can depend upon this cluster but if the same cluster is chosen more than once, this design must be executed and a separate value of Yhat(i) found for each time the cluster appears in the first stage sample. 3. Principle of random groups: A generalization of 2. above is the principle of random groups. In this principle, a design is independently executed (replicated) k times, yielding k random groups. Let Xhat(i) be an estimate of a population total X derived from the data in the ith random group. Then Xhat(1),...,Xhat(k) are independently and identically distributed. It follows that Xhat=SUM[Xhat(i)]/k satisfies E(Xhat)=E(Xhat(i)) and V(Xhat) can be unbiasedly estimated by Vhat(Xhat)=SUM[(Xhat(i)-Xhat)^2]/[k(k-1)]. Usually when random groups are applied k is relatively small, whereas the sample size in each random group is sufficiently large to make a normal distribution assumption for each Xhat(i) reasonable. In this case, one approximates the distribution of (Xhat- E(Xhat))/sqrt(Vhat(Xhat)) with a t-distribution with k-1 degrees of freedom. If each random group represents a cluster which is chosen with replacement and with probability proportional to z, letting Xhat(i) = Yhat(i)/z(i) leads to the conclusion that the principle of random groups generalizes the principle of cluster sampling with replacement. 4. Principle of ratio estimation: Suppose we want to estimate the population ratio R = Y/X. Let Yhat and Xhat be unbiased estimates of Y and X respectively. Then for "large" sample sizes Rhat = Yhat/Xhat is an approximately unbiased estimate of R and the variance of Rhat can be approximately unbiasedly estimated by taking the formula for the estimated variance of Yhat and replacing each sample value y in that formula by (y - Rx) and dividing the resulting number by X^2 (or Xhat^2 if X is unknown). Examples of this procedure are the combined ratio estimate for stratified designs and the formula for estimating the variance of a ratio in 2-stage cluster designs. The main assumptions are that Xhat and Yhat are linear estimators (namely a linear combination of the sample values), that they come from the same sample, and that they are the same type (that is the formula for Xhat is obtained from the formula for Yhat by replacing each y with an x). Thus, for example, using a stratified mean for Yhat and a simple random sample mean for Xhat is excluded. Examples of qualifing estimators are the sample mean in a simple random sample, the post- stratified mean in a simple random sample, the stratified mean, and the ppz estimator in (2) above. In the case of the post-stratified mean, the resulting formula for the variance is correct in the 1/n term, but not in the 1/n^2 term. Although these four principles are only a small proportion of the formulas we discussed in this course, everything else we discussed sheds light on how to design a survey for greatest effectiveness. For example, the equations given above for multistage cluster sampling with replacement give no information about the role of between and within cluster variance in the total variance of the estimator. This information is provided by the formula for the true variance (as opposed to its estimate) of the ppz estimator. We are interested in estimating the average price a household is willing to pay for cable TV service in Stephens County and separately in the following four strata: STRATUM I: Rural areas, districts 1-43 STRATUM II: Three villages, districts 44-46 STRATUM III: Eavesville, districts 47-50 STRATUM IV: Lockhart City, districts 51-75 It is proposed to make the probability that any given house appears in the sample be approximately 1%. More specifically the design in each stratum will be STRATUM I: 16 districts chosen with probability proportional to size (with replacement), 5 houses sampled in each chosen district, unbiased estimate. STRATUM II: A stratified random sample with 3, 6 and 3 houses chosen in districts 44, 45, and 46 respectively; stratified (unbiased) mean estimates. STRATUM III: Simple random sample size 32 with a ratio to house assessed value estimate. STRATUM IV: Simple random sample size 197 with a poststratified (on house value) estimate. For this purpose, the Stephens County assessor, has supplied the following information: NUMBER OF HOUSES VALUE 0 TO 39999: 1021 NUMBER OF HOUSES VALUE 40000 TO 49999: 1786 NUMBER OF HOUSES VALUE 50000 TO 59999: 2724 NUMBER OF HOUSES VALUE 60000 TO 69999: 2603 NUMBER OF HOUSES VALUE 70000 TO 79999: 4592 NUMBER OF HOUSES VALUE 80000 TO 89999: 4608 NUMBER OF HOUSES VALUE 90000 TO 99999: 1788 NUMBER OF HOUSES VALUE 100000 AND OVER: 542 Execute a sample according to the above design and estimate with standard error the average price a household in Stephens County is willing to pay for cable TV service. Also estimate with standard error the average price a household in each of the four strata is willing to pay for service.