Assignment 6 TWO STAGE CLUSTER DESIGNS due: December 11, 1997 WARNING: START IMMEDIATELY In stratification, we design the strata to be homogeneous within strata, but heterogeneous between strata. We then sample in each stratum with the aim of improving on the precision of simple random sampling by ensuring a representative sample. In cluster sampling, we try to make the clusters homogeneous between clusters, but with each cluster roughly representative of the population with respect to the variables being measured. When the cost of visiting each cluster is high (such as when travel costs are significant), we try to reduce costs by limiting our sample to a small number of clusters. Although for a fixed sample size, cluster sampling is usually less precise than simple random sampling, we hope that by controlling our sampling costs we can increase the sample size to more than compensate for the lower precision. The sampling costs in Stephens County are as follows: $60 for each rural district (1-46) visited $20 for each urban district (51-75) visited $6 for each rural household visited $3 for each urban household visited $10 for each completed interview We consider in this assignment three possible two stage designs: DESIGN I--Clusters chosen with equal probability, unbiased estimate DESIGN II--Clusters chosen with equal probability, ratio to cluster size estimate DESIGN III--Clusters chosen with probability proportional to size and with replacement, unbiased estimate A design is said to be "self-weighting" if each subunit (in this case household) has an equal chance of being chosen in the sample. This translates into each subunit being equally weighted in the final estimate. The first two designs above are self-weighted if the size of the subsample in each selected cluster is proportional to the size of the cluster. The third design is self-weighted if the size of the subsamples are constant from cluster to cluster. We will use self-weighting forms of the designs above. APPROXIMATING FORMULAE: We use the following notation N=number of districts n=number of districts in the sample Mbar=average number of households per district mbar=average number of sampled households per cluster Y(i)=sum of all values in district i S(2,i)^2=variance of the values in district i M(i)=number of households in district i Y=sum of all values in the population Ybar=Y/N Ydblbar=Y/(NMbar) Consider DESIGN I, II, and III in their self weighting form. Then the formulae for the true MSE (not the sample estimate) of the estimator of the population mean per subunit reduce to: DESIGN I: MSE = ((1-f1)/n) B1 + (1-f2)W/(nmbar) DESIGN II: MSE = ((1-f1)/n) B2 + (1-f2)W/(nmbar) DESIGN III: MSE = B3/n + (1-f2)W/(nmbar) where the between cluster variances are: B1 = SUM (Y(i)-Ybar)^2/((N-1)(Mbar^2)) B2 = SUM (Y(i)-M(i)Ydblbar)^2/((N-1)(Mbar^2)) B3 = SUM M(i)(Ybar(i)-Ydblbar)^2/(NMbar) and the within cluster variance is: W = SUM M(i)S(2,i)^2/(NMbar). Here SUM represents a sum over all the districts in the population. These formulas give the true variance. They should not be used to estimate that variance from the sample. Furthermore, the formula for Design III is only approximate when the clusters have unequal sizes. PROBLEMS 0. Derive these formulas. We are interested in seeing if a cluster design can improve on the precision of a simple random sample of size 100 from the rural areas of Stephens County while keeping the same cost. To do this we need to know the cost of sampling 100 houses randomly in districts 1-43. Fortunately we can generate ten samples from districts 1-43 using ADDGEN and then without actually incurring any sampling costs we can calculate the cost of executing each of those ten samples. Using this procedure, I have been able to show that a simple random sample of size 100 from districts 1-43 would visit an average of 37.5 districts and cost an average of $3850. 1. Design a two stage cluster sampling scheme for the rural areas (districts 1-43) of Stephens County which chooses between 25% and 50% of the districts (clusters) with probability proportional to size and with replacement, and has a constant subsample size within each chosen district. The total survey cost should be about $3850. Execute the sample and estimate the average price a rural household is willing to pay for cable TV using both an unbiased estimate and a ratio estimate using the house assessed value as an auxiliary variable. Be sure to give standard errors. The remainder of this problem set refers to the unbiased estimate. 2. Take a simple random sample of size 100 from districts 1-43 and use this sample to estimate the average price a rural household is willing to pay for cable TV service (using the sample mean estimate). For future reference in studying a similar population, it is often useful to consider what would have occured if a different sampling technique had been adopted. 3. Estimate from your sample the within cluster variance VSSU and the between cluster variance VPSU, and hence B3 and W, for the rural areas. What would have been the optimal allocation between PSU's and SSU's for a total expected budget of $3850? 4. I have been able to determine that a simple random sample of size 200 from districts 1-43 would visit about 42 districts and cost about $5720. Are any two stage designs an improvement over simple random sampling to determine the average price a rural household is willing to pay if a budget of $5720 is adopted for sampling? Use the results of problem 3-- you do not need to take a new sample. 5. A simple random sample of size 200 from Lockhart City would most likely visit all 25 districts and cost $3100. Using the estimates of B1, B2, B3, and W given below for Lockhart City, design a two stage sampling scheme for Lockhart City that costs $3100. Does it appear that cluster sampling is cost effective in Lockhart City for the purpose of estimating the average price a household is willing to pay for cable service? Comment in view of the SURVEY program assumptions. B1=16.820 B2=20.448 B3=19.940 W =23.196