Assignment 4: Two stage cluster designs Due date: October 16, 1997 1. Write an Splus macro to calculate the pi-estimator for the population total ty in a two stage cluster design. Subsampling within clusters will be by simple random sampling. Sampling of clusters (PSU's) will be specified by the inclusion probabilities pi1i and pi1ij. Let n be the total sample size and n1 be the number of first stage PSU's in the sample. The input for your macro should be: y - data vector of length n iclus - vector giving the cluster numbers. cluster numbers should be recoded to be of the form 1, 2, ..., n1 pi1i - vector of length n1 giving first stage inclusion probabilities pi1ij - matrix of dimensions n1 by n1 giving first stage double inclusion probabilities Ni - vector of length n1 giving the number of SSU's within each chosen psu The output of your macro should be a list with components tpihat - pi estimator of ty se - standard error of tpihat vpsu, vssu - estimated components of var(tpihat). Note that se^2 = vpsu + vssu As much as possible you should call previously written macros, especially SRSAMPLE and HTEST. 2. Test your macro using the address list lc2stg.a. The variable of interest is the average price a house is willing to pay for cable TV service. lc2stg.a was created by Poisson sampling from Lockhart City with the districts playing the role of PSU's. PSU's were chosen with probability proportional to size and with an expected first stage sample size of 10. Thus pi1i = Prob of selecting cluster i = 10*Ni/N The overall sampling scheme is supposed to be self-weighting with weights pik = .01. Thus .01 = pik = pi1i * pik|i = (10*Ni/N) * (ni/Ni), or ni = .01 * N/10 = 19.664. Here is the Splus code to select the sample. einstein: /home/tcc8v/msta718.dir % Splus S-PLUS : Copyright (c) 1988, 1995 MathSoft, Inc. S : Copyright AT&T. Version 3.3 Release 1 for Sun SPARC, SunOS 5.3 : 1995 Working data will be in .Data > #Poisson sampling of districts in Lockhart City > #probability proportional to size, expected count=10 > district.dat<-matrix(scan("district.dat"),byrow=T,ncol=5) > pi1i<-10*district.dat[51:75,2]/19664 > pi1i [1] 0.2669854 0.3692026 0.3427583 0.2974980 0.2812246 0.2964809 0.4632832 [8] 0.5344793 0.4668430 0.4063263 0.2771562 0.4551465 0.6677177 0.4922701 [15] 0.3646257 0.3310618 0.4505696 0.4637917 0.4566721 0.3859845 0.3671684 [22] 0.3829333 0.4032750 0.3686941 0.4078519 > sum(pi1i) [1] 10 > randno<-runif(25) > ind<-ifelse(randno lcdist<-(51:75)[ind==1] > lcdist [1] 52 53 54 55 56 57 58 59 60 64 65 67 69 72 73 > #wow! 9 districts in a row, 15 districts total > #such are the disasters possible with Poisson sampling > #self weighting 1% design - .01 = pi1i * ni/Ni > distno<-NULL > hseno<-NULL > for (i in 1:length(lcdist)) {dist<-lcdist[i] + distno<-c(distno,rep(dist,20)) + hseno<-c(hseno,sample(district.dat[dist,2],size=20)) + } > write(rbind(distno,hseno),file="lc2stg.a",ncol=2) > q() For the purpose of debugging your program, be sure to use the same file lc2stg.a. Here is my answer (which hopefully will be the same as your answer.) einstein: /home/tcc8v/msta718.dir % survey.out DEMONSTRATION EDUCATIONAL SAMPLE SURVEY PROGRAM COPYRIGHT (C) 1992, TED CHANG AND SHARON LOHR ENTER FILENAME CONTAINING ADDRESSES--8 OR FEWER LETTERS IF ENTERING FROM TERMINAL, TYPE T lc2stg.a ENTER FILENAME FOR OUTPUT--8 OR FEWER LETTERS lc2stg.d ENTER DESIRED THREE NONRESPONSE RATES: NOT-AT-HOMES, REFUSALS, RANDOM ANSWERS 0 0 0 THE COST OF THIS SESSION IS 4200 DOLLARS. HIT CARRIAGE RETURN WHEN READY einstein: /home/tcc8v/msta718.dir % Splus S-PLUS : Copyright (c) 1988, 1995 MathSoft, Inc. S : Copyright AT&T. Version 3.3 Release 1 for Sun SPARC, SunOS 5.3 : 1995 Working data will be in .Data > pi1i [1] 0.2669854 0.3692026 0.3427583 0.2974980 0.2812246 0.2964809 0.4632832 [8] 0.5344793 0.4668430 0.4063263 0.2771562 0.4551465 0.6677177 0.4922701 [15] 0.3646257 0.3310618 0.4505696 0.4637917 0.4566721 0.3859845 0.3671684 [22] 0.3829333 0.4032750 0.3686941 0.4078519 > ind [1] 0 1 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 > pi1i<-pi1i[ind==1] > pi1ij<-matrix(pi1i,ncol=1)%*%matrix(pi1i,nrow=1) > dim(pi1ij) [1] 15 15 > diag(pi1ij)<-pi1i > lcdat<-matrix(scan("lc2stg.d"),byrow=T,ncol=12) > y<-lcdat[,7] > lcdist [1] 52 53 54 55 56 57 58 59 60 64 65 67 69 72 73 > distno<-lcdat[,1] > for (i in 1:length(lcdist)) distno<-ifelse(distno==lcdist[i],i,distno) > distno [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 [26] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 [51] 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 [76] 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 [101] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 [126] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 [151] 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 [176] 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 [201] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 [226] 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 [251] 13 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 [276] 14 14 14 14 14 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 > Ni<-district.dat[51:75,2] > Ni<-Ni[ind==1] > source("ht2stage.splus") > temp<-HT2STAGE(y,distno,pi1i,pi1ij,Ni) > tpihat<-temp$tpihat/19664 > tpihat [1] 12.475 > se<-temp$se/19664 > se [1] 2.83595 > vpsu<-temp$vpsu/(19664^2) > vpus Error: Object "vpus" not found > vpsu [1] 7.873017 > vssu<-temp$vssu/(19664^2) > vssu [1] 0.1695934 > sqrt(vpsu+vssu) [1] 2.83595 > q()