version January 5, 1995 This mailing has 4 parts: 1. The SURVEY program 2. The ADDGEN program to generate random addresses 3. The CENSUS program to census the population 4. Some suggested exercises in text file format The transmission is complete if a box of asterisks enclosing the words "end of transmission" appears at the end. These programs are self-contained and should run on any system that has a FORTRAN 77 compiler. Some modifications may be necessary, however, in the statements that open and close files on different machines. For instance, the value of INSYS needs to be changed when using the programs with some versions of MacFortran. I continue to teach sampling using these exercises. Some later versions of these exercises can be downloaded by anonymous ftp from wald.stat.virginia.edu or through the Division of Statistics www server http://www.stat.virginia.edu/ Later versions of the exercises contain grading macros written in s-plus. Enjoy! The SURVEY simulation programs and their associated exercises are made available free of charge to the educational community. Permission to copy for classroom use is hereby granted. Appropriate attribution would be appreciated. All other rights are reserved. No warranty express or implied is provided and the use of the SURVEY simulation programs and their associated exercises is completely at the risk of the user. ************************************************************************** * SURVEY program * ************************************************************************** PROGRAM SURVEY C DEMONSTRATION EDUCATIONAL SAMPLE SURVEY PROGRAM C ------------- ----------- ------ ------ ------- C C RELEASE VERSION: AUGUST 1992, bug fix Jan. 1995 C C TED CHANG, UNIVERSITY OF KANSAS, MARCH 1984, JUNE 1985, SEPT 1986. C MODIFIED BY S. LOHR, UNIVERSITY OF MINNESOTA, DECEMBER 1988. C T. CHANG, UNIVERSITY OF VIRGINIA, APRIL 1991. C S. LOHR, ARIZONA STATE UNIVERSITY, JULY 1991, JULY 1992. C COPYRIGHT (C) 1992, T. CHANG AND S. LOHR C C REMARK: INSYS MUST BE SET TO SYSTEM TERMINAL INPUT FILE NUMBER. C REMARK: comment lines below contain modifications for use with C Language Systems FORTRAN for the MacIntosh C These modifications set the 'creator' on output files; C allowing double click opening of output files C C C THE FOLLOWING PARAMETERS CAN BE SET TO CHANGE THE PERFORMANCE OF C THE PROGRAM. ALL ARE FOUND IN BLOCK DATA INITL. C C ISEED = INTEGER RANDOM NUMBER GENERATOR SEED. ISEED MUST BE IN C THE RANGE 1 TO 2**31-1 INCLUSIVE. C NHOUSE(I) = NUMBER OF HOUSEHOLDS IN DISTRICT I. C XHP(I) = HOUSE PRICE CODE FOR DISTRICT I; HOUSE PRICES WILL BE C GENERATED WITH A MEAN VALUE OF $30000+$7000*XHP. IN THE RURAL C DISTRICTS THEY WILL BE UNCORRELATED WITH A STANDARD DEVIATION C OF $15000. IN URBAN DISTRICTS THE STANDARD DEVIATION IS $4667 AND C EACH HOUSE WILL BE CORRELATED WITH ITS EIGHT CLOSEST NEIGHBORS. C FPD(I) = MEAN FAMILY SIZE IN DISTRICT I. C NCRUR = COST TO VISIT A RURAL DISTRICT. C NCURB = COST TO VISIT AN URBAN DISTRICT. C NCINT = COST TO COMPLETE AN INTERVIEW C NCRVIS = COST TO VISIT A RURAL HOUSEHOLD C NCUVIS = COST TO VISIT AN URBAN HOUSEHOLD C C NONRESPONSE PROBABILITIES: C PNRSP1=APPROXIMATE PROPORTION OF HOUSEHOLDS 'NOT AT HOME'. THESE C HOUSEHOLDS WILL CHANGE FROM RUN TO RUN, WITH THE LARGER FAMILIES C MORE LIKELY TO BE AVAILABLE. C PNRSP2=PROPORTION OF HOUSEHOLDS 'UNWILLING TO ANSWER QUESTION 4 OF THE C QUESTIONAIRE'. THESE HOUSEHOLDS WILL CHANGE FROM RUN TO RUN, WITH C THE FAMILIES WITH MIDDLE VALUE HOUSES TENDING TO BE WILLING TO ANSWER. C PNRSP3=PROPORTION OF HOUSEHOLDS GIVING RANDOM ANSWERS. THESE C HOUSEHOLDS ARE FIXED FROM RUN TO RUN, BUT THEIR RESPONSES CHANGE. C 80% OF THE FAMILIES IN THIS CATEGORY WILL GIVE MILDLY VARYING ANSWERS C TO QUESTIONS 4-9. THE OTHER 20% ARE PATHOLOGICAL LIARS AND WILL GIVE C WILDLY RANDOM ANSWERS TO ALL 9 QUESTIONS. C C-------------------------------------------------------------------------- C IMPLICIT INTEGER (I-N) IMPLICIT REAL(A-H,O-Z) DIMENSION NHOUSE(75),NVISIT(75),NHOURS(5),IRURAL(75) DIMENSION FPD(75),XHP(75) CHARACTER*8 INUNIT,OUTUNI LOGICAL TERM COMMON IRURAL,JSEED,NHP,NADULT,NCHIL,NTV,PRICE,NHOURS COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS DATA PNRSP1/0.0/, PNRSP2/0.0/, PNRSP3/0.0/ DATA INSYS/5/ C C INITIALIZATION MODULE WRITE(*,5000) 23 WRITE(*,1200) 1200 FORMAT( &' ENTER FILENAME CONTAINING ADDRESSES--8 OR FEWER LETTERS' &/' IF ENTERING FROM TERMINAL, TYPE T') READ(*,'(A)',ERR=23,END=23) INUNIT IF (INUNIT .EQ. 'T' .OR. INUNIT .EQ. 't') THEN IUNIT=INSYS TERM = .TRUE. ELSE IUNIT=11 OPEN(UNIT=IUNIT,FILE=INUNIT) REWIND 11 TERM = .FALSE. END IF 24 WRITE(*,1210) 1210 FORMAT(' ENTER FILENAME FOR OUTPUT--8 OR FEWER LETTERS') READ(*,'(A)',ERR=24,END=24) OUTUNI OPEN(UNIT=12,FILE=OUTUNI) c in Language Systems FORTRAN for the Mac use the following instead c OPEN(UNIT=12,FILE=OUTUNI,CREATOR='EDIT') c creator 'ttxt' for teach text 25 WRITE(*,1220) 1220 FORMAT(' ENTER DESIRED THREE NONRESPONSE RATES:'/ & ' NOT-AT-HOMES, REFUSALS, RANDOM ANSWERS') READ(*,*,ERR=25,END=25) PNRSP1,PNRSP2,PNRSP3 IF(PNRSP1 .LT. 0.0 .OR. PNRSP2 .LT. 0.0 .OR. PNRSP3 .LT. 0.0 &.OR. PNRSP1 .GT. 1.0 .OR. PNRSP2 .GT. 1.0 .OR. PNRSP3 .GT. 1.0) & THEN WRITE(*,1230) 1230 FORMAT(' NONRESPONSE RATES MUST BE BETWEEN 0 AND 1') GO TO 25 END IF PNRSP=PNRSP1+PNRSP2+PNRSP3 IF (PNRSP.GT.0.) GO TO 5 ISEED1=25 GO TO 6 4 WRITE(*,1030) 5 WRITE(*,1150) READ(*,*,ERR=4,END=4) ISEED1 IF ((ISEED1.LT.0).OR.(ISEED1.GT.1000000)) GO TO 4 6 WRITE(12,1000) DO 3 I=1,75 NVISIT(I)=0 3 CONTINUE NRESP=0 NCOST=0 C NHST=0 AVFAM=0. AVNHP = 0. SDNHP = 0. DO 1 I=1,43 NHST=NHST+NHOUSE(I) AVNHP = AVNHP + XHP(I)*NHOUSE(I) SDNHP = SDNHP + XHP(I)*XHP(I)*NHOUSE(I) AVFAM=AVFAM+NHOUSE(I)*FPD(I) 1 CONTINUE AVRUR=FLOAT(NHST)/43. C IRURAL IS USED TO MAKE PEOPLE IN MORE ISOLATED DISTRICTS C MORE WILLING TO SUBSCRIBE. DO 7 I=1,43 IRURAL(I)=2 IF (NHOUSE(I).LE.AVRUR) IRURAL(I)=3 7 CONTINUE DO 2 I=44,75 NHST = NHST + NHOUSE(I) AVNHP = AVNHP + XHP(I)*NHOUSE(I) SDNHP = SDNHP + XHP(I)*XHP(I)*NHOUSE(I) AVFAM=AVFAM+NHOUSE(I)*FPD(I) IRURAL(I)=0 IF (I.GE.51) GO TO 2 IRURAL(I)=1 IF (I.GE.47) GO TO 2 IRURAL(I)=2 2 CONTINUE AVFAM=AVFAM/NHST C Line below recovers within district variation in house price. C Note: 4.591837 = (15000/7000)^2 and .4445079 = (4667/7000)^2 WITHIN = (43.*AVRUR*4.591837 + (NHST-43.*AVRUR)*0.4445079)/NHST SDNHP = 7000.*SQRT(SDNHP/NHST - (AVNHP/NHST)**2 + WITHIN) AVNHP = 7000.*(AVNHP/NHST) + 30000. P4=.2*PNRSP3 C C ADDRESS INPUT MODULE 10 IF (TERM) WRITE(*,1010) I2=0 READ(IUNIT,*,ERR=805,END=800) I2,J IF (I2.EQ.0) GO TO 900 I=IABS(I2) IF (I.GT.75) GO TO 810 IF ((J.LE.0).OR.(J.GT.NHOUSE(I))) GO TO 820 IF (I2.GT.0) NVISIT(I)=NVISIT(I)+1 CALL ANSWER(I,J) C C NONRESPONSE MODULE C HOUSE VALUE ONLY IF (I2.LT.0) GO TO 710 C CHECK FOR NOT AT HOME JSEED1=NSEED(ISEED1,ISEED,JSEED) FRESP=RANDU(JSEED1) PNHOME=PNRSP1*EXP(1.-(NADULT+NCHIL)/AVFAM) IF (FRESP.LT.PNHOME) GO TO 720 NRESP=NRESP+1 C CHECK FOR UNWILLINGNESS TO ANSWER QUESTION 4. C Assume Z=(NHP - mean)/s.d. ~ N(0,1) and let W ~ Unif(0,1) (indep. of Z). C Let p = PNRSP2 and k = (1/(1 - p)^2 - 1)/2. Then C P{W > exp[-k Z^2]} = E[P{W > exp[-k Z^2] | Z}] = E [1 - exp[-k Z^2] ] C = 1 - 1/sqrt(1+2k) = p IF (PNRSP2 .GT. 0.999) GO TO 740 FRESP=RANDU(JSEED1) DEV = (NHP - AVNHP)/SDNHP PREFUS = 1.-EXP(- .5 * DEV**2 * (1./(1.-PNRSP2)**2 - 1.) ) IF (FRESP.LT.PREFUS) GO TO 740 C CHECK FOR RANDOM ANSWERS FRESP=RANDU(JSEED) IF (FRESP.GE.PNRSP3) GO TO 700 CORAND=.3 IF (FRESP.GT.P4) GO TO 610 C WE HAVE A PATHOLOGICAL LIAR HERE NADULT=1+(NADULT-1)*2.*RANDU(JSEED1) NCHIL=NCHIL*2.*RANDU(JSEED1) NTV=NTV*2.*RANDU(JSEED1) NTV=MIN0(NTV,5) CORAND=1.0 610 NHRMX=70.+46.*SQRT(FLOAT(NTV))+10.*RANDU(JSEED1) IF (NTV.EQ.0) NHRMX=0 PRICE=PRICE*(1.-CORAND+2.*CORAND*RANDU(JSEED1)) NHRT=0 DO 620 II=2,5 NHOURS(II)=NHOURS(II)*(1.-CORAND+2.*CORAND*RANDU(JSEED1)) NHRT=NHRT+NHOURS(II) 620 CONTINUE NHOURS(1)=NHRT*(1.+.5*ABS(RANDN(JSEED1))) IF (NHOURS(1).LE.NHRMX) GO TO 700 FACT=FLOAT(NHRMX)/NHOURS(1) DO 630 II=1,5 NHOURS(II)=FACT*NHOURS(II) 630 CONTINUE C C STANDARD RETURNS 700 NPRICE=5*MIN0(IFIX(PRICE/5.),5) WRITE(12,1020) I,J,NHP,NADULT,NCHIL,NTV,NPRICE,NHOURS IF (TERM) WRITE(*,1090) GO TO 10 710 NCOST=NCOST+1 IF (TERM) WRITE(*,1090) WRITE(12,1120) I,J,NHP GO TO 10 720 IF (TERM) WRITE(*,1070) WRITE(12,1080) I,J,NHP GO TO 10 740 WRITE(12,1140) I,J,NHP,NADULT,NCHIL,NTV,NHOURS GO TO 10 C C ERROR RETURNS 800 IF (I2.EQ.0) GO TO 900 805 WRITE(*,1030) WRITE(*,1040) GO TO 10 810 WRITE(*,1050) WRITE(*,1040) GO TO 10 820 WRITE(*,1060) I,NHOUSE(I) WRITE(*,1040) GO TO 10 C C COST MODULE 900 DO 910 I=1,46 IF (NVISIT(I).EQ.0) GO TO 910 NCOST=NCOST+NCRUR+6*NVISIT(I) 910 CONTINUE DO 920 I=47,75 IF (NVISIT(I).EQ.0) GO TO 920 NCOST=NCOST+NCURB+3*NVISIT(I) 920 CONTINUE NCOST=NCOST+10*NRESP WRITE(*,1100) NCOST WRITE(12,1100) NCOST IF (PNRSP.GT.0.) WRITE(12,1110) ISEED1 STOP C C 1000 FORMAT(' ADDRESS',4X,'VALUE',3X,'1',3X,'2',3X,'3',3X,'4',4X, &'5',4X,'6',4X,'7',4X,'8',4X,'9') 1010 FORMAT(' ENTER DISTRICT NUMBER, HOUSE NUMBER') 1020 FORMAT(1X,I3,I5,2X,I6,4(2X,I2),5(2X,I3)) 1030 FORMAT(' THERE HAS BEEN AN INPUT ERROR.') 1040 FORMAT(' RE-ENTER DISTRICT NUMBER, HOUSE NUMBER'/ &' SET DISTRICT NUMBER = 0 TO STOP PROGRAM.') 1050 FORMAT(' DISTRICT NUMBERS MUST BE BETWEEN -75 AND 75.') 1060 FORMAT(' IN DISTRICT ',I2,' HOUSE NUMBERS MUST BE BETWEEN 1 & AND ',I4) 1070 FORMAT(' THIS HOUSEHOLD IS NOT AT HOME.') 1080 FORMAT(1X,I3,I5,2X,I6,3X,'NOT AT HOME') 1090 FORMAT(' DONE') 1100 FORMAT(' THE COST OF THIS SESSION IS ',I6,' DOLLARS.') 1110 FORMAT(' SEED NUMBER: ',I12) 1120 FORMAT(1X,I3,I5,2X,I6,3X,'FROM COUNTY RECORDS') 1140 FORMAT(1X,I3,I5,2X,I6,3(2X,I2),3X,'*',5(2X,I3)) 1150 FORMAT(' ENTER AN INTEGER BETWEEN 1 AND 1,000,000') 5000 FORMAT(' DEMONSTRATION EDUCATIONAL SAMPLE SURVEY PROGRAM'/ &' COPYRIGHT (C) 1992, TED CHANG AND SHARON LOHR') END C C************************************************************************ C SUBROUTINE ANSWER(I,J) C C SUBROUTINE TO GENERATE ANSWERS TO QUESTIONNAIRE FOR ADDRESS I,J C INTEGER NHOUSE(75),NHOURS(5),IRURAL(75) REAL FPD(75),XHP(75),ERRHSE(5) INTEGER ISEED,JSEED,NHP,NADULT,NCHIL,NTV REAL PRICE,RANDN,RANDU INTEGER IRANDP C INTEGER I,J,J1,NEIGH,NSEED,II,NFAM,NHRT,NHRMX REAL ERR1,ERR2 REAL RLAM,FCHIL,FACT COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS COMMON IRURAL,JSEED,NHP,NADULT,NCHIL,NTV,PRICE,NHOURS C C HOUSE PRICE GENERATION MODULE IF (I.GT.43) GO TO 150 JSEED=NSEED(I,J,ISEED) ERR1=1.5*RANDN(JSEED) GO TO 190 150 DO 160 NEIGH=1,5 J1=MOD(NHOUSE(I)+J+NEIGH-2,NHOUSE(I)) JSEED=NSEED(I,J1,ISEED) ERRHSE(NEIGH)=RANDN(JSEED) 160 CONTINUE ERR1=.25*(.5*ERRHSE(1)+ERRHSE(2)+ERRHSE(3)+ERRHSE(4)+.5*ERRHSE(5)) 190 NHP=30000+7000*XHP(I)+10000*ERR1 NHP=MAX0(10000,NHP) C C FAMILY COUNT MODULE RLAM=FPD(I)+ERR1-1 RLAM=AMAX1(RLAM,.5) NFAM=IRANDP(RLAM,JSEED)+1 FCHIL=RANDU(JSEED) NCHIL=FCHIL*(NFAM-1) NADULT=NFAM-NCHIL C C NUMBER OF TV'S AND PRICE RESPONSE MODULE NTV=.2*NFAM+.00001*NHP+RANDN(JSEED)+1.0 NTV=MAX0(0,NTV) NTV=MIN0(5,NTV) ERR2=RANDN(JSEED) PRICE=1.75*XHP(I)+2.5*ERR1+IRURAL(I)+NFAM+3.*ERR2 PRICE=AMAX1(PRICE,0.) IF (NTV.EQ.0) PRICE=0. C C WATCHING HABITS MODULE NHOURS(2)=(NFAM+ERR2+IRURAL(I)+RANDN(JSEED))*1.5 NHOURS(3)=(NFAM+ERR2+IRURAL(I)+3.*RANDN(JSEED))*4.0 NHOURS(4)=6+NCHIL*3+IRURAL(I)+4.0*RANDN(JSEED) IF (NCHIL.EQ.0) NHOURS(4)=0 NHOURS(5)=(NFAM+ERR2+IRURAL(I)+RANDN(JSEED))*4.0 NHRT=0 DO 510 II=2,5 NHOURS(II)=MAX0(NHOURS(II),0) NHRT=NHOURS(II)+NHRT 510 CONTINUE NHOURS(1)=NHRT*(1+.5*ABS(RANDN(JSEED))) NHRMX=70.+46.*SQRT(FLOAT(NTV))+10.*RANDU(JSEED) IF (NTV.EQ.0) NHRMX=0 IF (NHOURS(1).LE.NHRMX) RETURN FACT=FLOAT(NHRMX)/NHOURS(1) DO 520 II=1,5 NHOURS(II)=NHOURS(II)*FACT 520 CONTINUE RETURN END C C************************************************************************ C INTEGER FUNCTION NSEED(I1,I2,I3) C C SUBROUTINE TO COMBINE THREE INTEGERS INTO A SEED C INTEGER I1,I2,I3,J(3),JSEED,K,IND1,NSEED1 REAL X DATA JSEED/2147483645/ J(1) = MOD(IABS(I1),JSEED) + 1 J(2) = MOD(IABS(I2),JSEED) + 1 J(3) = MOD(IABS(I3),JSEED) + 1 X=RANDU(J(1)) IND1=1 NSEED1=0 DO 20 K=1,9 NSEED1=NSEED1*10 IND1=MOD(J(IND1),2)+IND1 IND1=MOD(IND1,3)+1 X=RANDU(J(IND1)) NSEED1=NSEED1+MOD(J(IND1),10) X=RANDU(J(IND1)) 20 CONTINUE X=RANDU(NSEED1) NSEED=NSEED1 RETURN END C C************************************************************************** C REAL FUNCTION RANDU(IX) C C UNIFORM RANDOM NUMBER GENERATOR. C REF: SCHRAGE, L. 1979 'A MORE PORTABLE FORTRAN RANDOM NUMBER GENERATOR' C A. C. M. TRANSACTIONS ON MATHEMATICAL SOFTWARE V5 132-138. C INTEGER A,P,IX,B15,B16,XHI,XALO,LEFTLO,FHI,K DATA A/16807/,B15/32768/,B16/65536/,P/2147483647/ XHI=IX/B16 XALO=(IX-XHI*B16)*A LEFTLO=XALO/B16 FHI=XHI*A+LEFTLO K=FHI/B15 IX=(((XALO-LEFTLO*B16)-P)+(FHI-K*B15)*B16)+K IF (IX.LT.0) IX=IX+P RANDU=FLOAT(IX)*4.656612875E-10 RETURN END C C************************************************************************** C REAL FUNCTION RANDN(ISEED) C C STANDARD NORMAL RANDOM NUMBER GENERATOR C REAL RANDU INTEGER ISEED,ISEED1,I REAL X1,X2,S ISEED1=ISEED DO 10 I=1,1000 X1=2.*RANDU(ISEED)-1. X2=2.*RANDU(ISEED)-1. S=X1*X1+X2*X2 IF (S.LT.1.) GO TO 20 10 CONTINUE WRITE(*,1000) ISEED1 RANDN=X1 RETURN 20 RANDN=X1*SQRT(-2.*ALOG(S)/S) RETURN 1000 FORMAT(' ERROR IN RANDN. PLEASE COPY THE FOLLOWING NUMBER AND & INFORM INSTRUCTOR'/1X,I12) END C C************************************************************************** C INTEGER FUNCTION IRANDP(RLAM,ISEED) C C POISSON RANDOM NUMBER GENERATOR. C INTEGER ISEED,ISEED1,I REAL S,RLAM,RANDU ISEED1=ISEED S=0. DO 10 I=1,1000 S=S-ALOG(RANDU(ISEED)) IF (S.GT.RLAM) GO TO 20 10 CONTINUE WRITE(*,1000) ISEED1 IRANDP=1000 RETURN 20 IRANDP=I-1 RETURN 1000 FORMAT(' ERROR IN IRANDP. PLEASE COPY THE FOLLOWING NUMBER AND & INFORM INSTRUCTOR.'/1X,I12) END C ************************************************************** BLOCK DATA INITL INTEGER NHOUSE(75),ISEED REAL FPD(75),XHP(75) COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS DATA ISEED/4214202/ DATA NHOUSE/142,153,135,128,110,103,105,385,296,287, &253,172,198,432,248,251,221,297,235,171, &135,254,203,244,202,103,102,115,180,190, &152,141,143,135,178,221,174,101,95,130, &152,169,91,283,562,312,897,734,963,642, &525,726,674,585,553,583,911,1051,918,799, &545,895,1313,968,717,651,886,912,898,759, &722,753,793,725,802/ DATA XHP/ 5.0, 4.1, 5.0, 3.7, 3.9, 4.0, 5.7, 6.8, 6.3, 5.6, & 4.3, 3.6, 5.0, 6.7, 5.3, 3.4, 5.4, 6.9, 5.9, 3.2, & 5.3, 5.0, 6.7, 6.7, 5.6, 3.7, 4.5, 4.5, 5.4, 5.6, & 5.6, 4.5, 3.8, 3.8, 4.5, 4.4, 3.5, 3.3, 4.0, 3.7, & 3.6, 3.8, 3.0, 4.3, 3.9, 3.1, 4.6, 4.5, 4.3, 3.5, & 9.3, 5.5, 3.5, 2.6, 1.9, 9.4, 7.8, 4.0, 1.0, 2.1, &10.3, 6.5, 3.7, 4.7, 5.6, 9.2, 7.6, 6.7, 6.0, 7.2, & 8.2, 7.1, 7.1, 7.7, 7.2/ DATA FPD/7*3.8,4*3.6,2*3.8,3*3.6,3.8,2*3.6,2*3.8,3*3.6,19*3.8, &7*3.6,2*3.5,3.0,2*2.0,3.5,3.0,2.5,2*2.0,3.5,3.0,2.0,2.5,3.0,3.5, &3.2,2*3.0,6*3.5/ DATA NCRUR/60/, NCURB/20/, NCINT/10/, NCRVIS/6/, NCUVIS/3/ END ****************************************************************** * ADDGEN program * ****************************************************************** C PROGRAM TO GENERATE RANDOM ADDRESSES FOR SURVEY SIMULATION EXERCISES C -------------------------------------------------------------------- C C C. G. MCLAREN, SIMON FRASER UNIVERSITY, DEC. 1984 C MODIFIED BY T. CHANG, UNIVERSITY OF KANSAS, FEB. 1985, SEPT. 1986 C T. CHANG, UNIVERSITY OF VIRGINIA, APRIL 1991 C maintenance fix January 1995 C COPYRIGHT (C) 1992, T. CHANG AND C. G. MCLAREN C C THIS PROGRAM REQUIRES THE FOLLOWING INPUT C C A RANDOM NUMBER GENERATOR SEED: INTEGER BETWEEN 1 AND 1000000. C THE LIST OF DISTRICTS TO BE SAMPLED: THE DISTRICT NUMBERS ARE C TO BE SEPARATED BY COMMAS; CONSECUTIVE DISTRICTS CAN BE ENTERED C BY TYPING FIRST AND LAST DISTRICT NUMBERS SEPARATED BY - (DASH); C THE LIST CAN BE CONTINUED TO A NEW LINE BY ENDING THE PREVIOUS LINE WITH C A DOLLAR SYMBOL $. C THE NUMBER OF DISTRICTS TO BE SAMPLED. C C THIS PROGRAM PRODUCES ADDRESSES FOR A SIMPLE RANDOM SAMPLE FROM THE C LISTED DISTRICTS. THE LIST OF ADDRESSES TOGETHER WITH THE RANDOM C START IS PUT IN FILE 11. C C REMARK: comment lines below contain modifications for use with C Language Systems FORTRAN for the MacIntosh C These modifications set the 'creator' on output files; C allowing double click opening of output files C C********************************************************************** C INTEGER H(1000), DS(75), DIS(75) CHARACTER YES1, YES2,CH CHARACTER*8 INUNIT INTEGER I,IX,IIX,NADS,NTOT,IDIS,NDIS,N,IH DATA DS/142,153,135,128,110,103,105,385,296,287, &253,172,198,432,248,251,221,297,235,171, &135,254,203,244,202,103,102,115,180,190, &152,141,143,135,178,221,174,101,95,130, &152,169,91,283,562,312,897,734,963,642, &525,726,674,585,553,583,911,1051,918,799, &545,895,1313,968,717,651,886,912,898,759, &722,753,793,725,802/ DATA YES1/'y'/, YES2/'Y'/ C 22 WRITE(*,1100) 1100 FORMAT(' ENTER FILENAME FOR ADDRESS SET--8 OR FEWER LETTERS') READ(*,'(A)',END=22,ERR=22) INUNIT OPEN(UNIT=11,FILE=INUNIT) c in Language Systems FORTRAN for the Mac use the following instead c OPEN(UNIT=11,FILE=INUNIT,CREATOR='EDIT') c creator 'ttxt' for teach text 5 WRITE(*,109) 109 FORMAT(' ENTER RANDOM START--ANY INTEGER BETWEEN 1 AND 1000000') READ(*,*,END=5,ERR=5) IX IF (IX.LT.1.OR.IX.GT.1000000) GO TO 5 IIX = IX C NADS=0 20 WRITE(*,104) 104 FORMAT(' ENTER DISTRICTS FROM WHICH YOU WISH TO SAMPLE') CALL RDIS(NDIS,DIS) CALL ISRTA(DIS,NDIS) NTOT=0 DO 1 I=1,NDIS 1 NTOT=NTOT+DS(DIS(I)) WRITE(*,122) NDIS, NTOT 122 FORMAT(I8,' DISTRICTS WITH ',I7,' HOUSEHOLDS HAVE BEEN SPECIFIED') 4 WRITE(*,105) 105 FORMAT(' ENTER NUMBER OF ADDRESSES TO BE GENERATED (MAX 1000)') READ(*,*,END=4,ERR=4) N IF (N.GT.1000.OR.N.LT.1) GO TO 4 IF (N.GT.NTOT) GO TO 905 CALL NRSAM(N,NTOT,H,IX) IDIS=1 NTOT=0 DO 8 I=1,N 6 IF (H(I)-NTOT.LE.DS(DIS(IDIS))) GO TO 7 NTOT=NTOT+DS(DIS(IDIS)) IDIS=IDIS+1 GO TO 6 7 IH=H(I)-NTOT WRITE(11,101) DIS(IDIS),IH 8 CONTINUE 101 FORMAT(I4,I6,I25) NADS=NADS+N WRITE(*,120) 120 FORMAT(' DO YOU WANT TO SPECIFY A NEW DISTRICT SET'/' ANSWER & YES OR NO') READ(*,121) CH 121 FORMAT(A1) IF (CH.EQ.YES1.OR.CH.EQ.YES2) GO TO 20 IH=0 WRITE(11,101) IH,IH,IIX WRITE(*,108) NADS,IIX 108 FORMAT(' ',I5,' RANDOM ADDRESSES GENERATED WITH RANDOM START',I8) CLOSE(11) STOP C 905 WRITE(*,111) N, NTOT 111 FORMAT(' SAMPLE SIZE ',I4,' IS GREATER THAN POPULATION SIZE ' &, I6,' --REENTER.') GO TO 4 END C C*********************************************************************** C SUBROUTINE RDIS(NDIS,DIS) C SUBROUTINE TO READ A DISTRICT SPECIFICATION LIST INTEGER NUM(2),IVAL(2),DIS(1) CHARACTER*1 ICHR(14),LIN(80) INTEGER I,J,K,L,M,MM,NDASH,NDIS DATA ICHR/'0','1','2','3','4','5','6','7','8','9',' ',',','-','$'/ 1 NDIS=0 L=1 K=0 NDASH=0 18 READ(*,100) LIN 100 FORMAT(80A1) DO 20 J=1,80 DO 2 I=1,14 IF (LIN(J).EQ.ICHR(I)) GO TO 3 2 CONTINUE WRITE(*,101) LIN(J) 101 FORMAT(' ILLEGAL CHARACTER ',A1,' REENTER LIST.') GO TO 1 3 IF (I.EQ.11.AND.J.LT.80) GO TO 20 IF (I.EQ.11.AND.J.EQ.80) I=12 IF (I.GT.10) GO TO 5 IF (K.LT.2) GO TO 4 WRITE(*,102) NUM(1), NUM(2), LIN(J) 102 FORMAT(' 3 DIGIT NUMBER - ',2I1,A1,' ILLEGAL--REENTER LIST') GO TO 1 4 K=K+1 NUM(K)=I-1 GO TO 20 5 IF (I.EQ.14) GO TO 18 IF (K.GT.0) GO TO 6 WRITE(*,103) 103 FORMAT(' NO NUMBER BETWEEN SEPARATORS--REENTER LIST') GO TO 1 6 IF (I.EQ.12) GO TO 7 NDASH=NDASH+1 IF (NDASH.EQ.1) GO TO 8 WRITE(*,104) 104 FORMAT(' TWO CONSECUTIVE - SEPARATORS NOT ALLOWED--REENTER LIST') GO TO 1 7 IF (NDASH.EQ.1) L=2 8 IVAL(L)=0 DO 9 I=1,K M=K-I 9 IVAL(L)=IVAL(L)+NUM(I)*10**M K=0 IF (IVAL(L).GE.1.AND.IVAL(L).LE.75) GO TO 10 WRITE(*,105) IVAL(L) 105 FORMAT(' DISTRICT NUMBER ',I5,' OUT OF RANGE--REENTER LIST') GO TO 1 10 IF (L.EQ.2) GO TO 13 IF (NDASH.EQ.1) GO TO 20 IF (NDIS.EQ.0) GO TO 12 DO 11 I=1,NDIS IF (IVAL(L).NE.DIS(I)) GO TO 11 WRITE (9,106) DIS(I) GO TO 1 11 CONTINUE 106 FORMAT(' DISTRICT ',I5,' REPEATED--REENTER LIST') 12 NDIS=NDIS+1 DIS(NDIS)=IVAL(L) GO TO 20 13 IF (IVAL(1).LT.IVAL(2)) GO TO 14 WRITE(*,107) IVAL(1),IVAL(2) 107 FORMAT(' DISTRICTS',I5,' TO ',I5,' NOT ALLOWED--REENTER LIST') GO TO 1 14 IF (NDIS.EQ.0) GO TO 16 DO 15 I=1,NDIS IF (DIS(I).LT.IVAL(1).OR.DIS(I).GT.IVAL(2)) GO TO 15 WRITE(*,106) DIS(I) GO TO 1 15 CONTINUE 16 M=IVAL(1) MM=IVAL(2) DO 17 I=M,MM NDIS=NDIS+1 17 DIS(NDIS)=I NDASH=0 L=1 20 CONTINUE RETURN END C C************************************************************************* C SUBROUTINE NRSAM(N,NTOT,NOS,IX) C SUBROUTINE TO GENERATE A RANDOM NON REPLACEMENT SAMPLE INTEGER NOS(1),KL(1000) INTEGER I,J,K,L,M,N,NN,NTOT,IX REAL X,RANDU IF (N.LT.NTOT) GO TO 2 DO 1 I=1,NTOT 1 NOS(I)=I RETURN 2 NN=N IF (N.GT.NTOT/2) NN=NTOT-N X=RANDU(IX) NOS(1)=MOD(IX,NTOT)+1 DO 7 I=2,NN K=I-1 3 X=RANDU(IX) NOS(I)=MOD(IX,NTOT)+1 DO 4 L=1,K IF (NOS(I).LT.NOS(L)) GO TO 5 IF (NOS(I).EQ.NOS(L)) GO TO 3 4 CONTINUE GO TO 7 5 K=K-L+1 J=NOS(I) DO 6 M=1,K 6 NOS(I-M+1)=NOS(I-M) NOS(L)=J 7 CONTINUE IF (N.LE.NTOT/2) RETURN DO 8 I=1,NN 8 KL(I)=NOS(I) KL(NN+1)=NTOT+1 J=1 K=1 DO 10 I=1,NTOT IF (I.LT.KL(J)) GO TO 9 J=J+1 GO TO 10 9 NOS(K)=I K=K+1 10 CONTINUE RETURN END C C************************************************************************** C SUBROUTINE ISRTA(Y,M) C SUBROUTINE TO SORT M ELEMENTS IN ARRAY Y INTO ASCENDING ORDER C INTEGER M,MM,I,K,KK,L INTEGER Y(M), Z IF (M.LE.1) RETURN MM=M-1 DO 4 I=1,MM Z=Y(I+1) DO 1 K=1,I IF (Z-Y(I+1-K)) 1,1,2 1 CONTINUE K=I+1 2 KK=K-1 IF (KK.EQ.0) GO TO 4 DO 3 L=1,KK 3 Y(I+2-L)=Y(I+1-L) Y(I+2-K)=Z 4 CONTINUE RETURN END C C************************************************************************** C REAL FUNCTION RANDU(IX) C C UNIFORM RANDOM NUMBER GENERATOR. C REF: SCHRAGE, L. 1979 'A MORE PORTABLE FORTRAN RANDOM NUMBER GENERATOR' C A. C. M. TRANSACTIONS ON MATHEMATICAL SOFTWARE V5 132-138. C INTEGER A,P,IX,B15,B16,XHI,XALO,LEFTLO,FHI,K DATA A/16807/,B15/32768/,B16/65536/,P/2147483647/ XHI=IX/B16 XALO=(IX-XHI*B16)*A LEFTLO=XALO/B16 FHI=XHI*A+LEFTLO K=FHI/B15 IX=(((XALO-LEFTLO*B16)-P)+(FHI-K*B15)*B16)+K IF (IX.LT.0) IX=IX+P RANDU=FLOAT(IX)*4.656612875E-10 RETURN END ********************************************************************* * CENSUS program * ********************************************************************* C CENSUS PROGRAM FOR EDUCATIONAL SAMPLE SURVEY C ------ ------- --- ----------- ------ ------ C C TED CHANG, UNIVERSITY OF KANSAS, JUNE 1984, JUNE 1985, SEPT 1986. C REVISED TED CHANG, UNIVERSITY OF VIRGINIA, APRIL 1991 C REVISED S. LOHR , ARIZONA STATE UNIVERSITY, JULY 1992 C bug fix Aug. 1992 C C COPYRIGHT (C) 1992, T. CHANG C C THIS PROGRAM REQUIRES THE FOLLOWING INPUT: C ISEED1 = INTEGER NONRESPONSE FUNCTION SEED C NDIST = NUMBER OF DISTRICTS TO BE ENUMERATED. C IF NDIST=75, ALL DISTRICTS WILL BE ENUMERATED. DISTRICT AND C STRATA MEAN AND VARIANCES WILL BE PRINTED ON FILE 11; INDIVIDUAL C HOUSE DATA WILL BE PRINTED ON FILE 10. NO FURTHER INPUT IS C NEEDED. C IF NDIST IS LESS THAN 75, THE DISTRICTS TO BE ENUMERATED SHOULD C BE LISTED--ONE DISTRICT TO A LINE. INDIVIDUAL HOUSE DATA WILL BE C PRINTED ON FILE 10; DISTRICT STATISTICS ON FILE 11. C C NSTRAT=THE NUMBER OF STRATA. C ISTRAT(J,I)=1 IF DISTRICT J IS IN STRATUM I; OTHERWISE ISTRAT(J,I)=0. C THE STRATA CAN BE DEFINED USING THE FOLLOWING DIMENSION AND DATA C STATEMENTS. C C THIS PROGRAM CALCULATES A VARIETY OF BETWEEN AND WITHIN DISTRICT C VARIANCES FOR USE WITH TWO STAGE CLUSTER DESIGNS WITH DISTRICTS AS PSU'S C IN THE NOTATION OF EXERCISE 5 (TWO STAGE CLUSTER SAMPLING): C B1, B2, AND B3 ARE AS DEFINED IN THAT EXERCISE C W2 IS W OF EXERCISE 5 C W1 = SUM[(M(i)S(2,i))^2]/[N(Mbar^2)] C THEN ASSUMING SECOND STAGE FPC'S ARE NEGLIGIBLE, THE VARIANCES OF A C TWO STAGE ESTIMATE OF A POPULATION MEAN ARE C DESIGN I: Clusters chosen with equal probability, C subsampling proportional to cluster size C unbiased estimate C MSE = ((1-f1)/n) B1 + W2/(nmbar) C DESIGN IA: Clusters chosen with equal probability, C subsampling with constant size C unbiased estimate C MSE = ((1-f1)/n) B1 + W1/(nmbar) C DESIGN II: Clusters chosen with equal probability, C subsampling proportional to cluster size C ratio to cluster size estimate C MSE = ((1-f1)/n) B2 + W2/(nmbar) C DESIGN IIA: Clusters chosen with equal probability, C subsampling with constant size C ratio to cluster size estimate C MSE = ((1-f1)/n) B2 + W1/(nmbar) C DESIGN III: Clusters chosen with probability proportional to size C and with replacement C subsampling with constant size C unbiased estimate C MSE = B3/n + W2/(nmbar) C C FOR ESTIMATING OVER THE POPULATION OF HOUSEHOLDS WILLING TO PAY AT C LEAST $N FOR CABLE SERVICE A RATIO ESTIMATE WAS USED: C ESTIMATED TOTAL OF ITEM IN HOUSEHOLDS WILLING TO PAY $N C (each household not willing to pay $N is counted as zero) C --------------------------------------------------------- C ESTIMATED NUMBER OF HOUSES WILLING TO PAY $N C SEE COCHRAN, SECTION 11.12. NOTICE THAT IN THIS CASE THE RATIO AND C UNBIASED ESTIMATORS COINCIDE C C NOTE: NONRESPONSE OF TYPE 3 (RANDOM ANSWERS) IS BROKEN INTO TWO C CATEGORIES: PATHOLOGICAL LIARS AND THE OTHERS. C THERE IS AN EXTRA COLUMN DUE TO THE ELIMINATION OF A FORM C NONRESPONSE FROM THE SURVEY PROGRAM. C C NOTE: A CORRELATION OF -9.999 WITH HOUSE PRICE IS GIVEN FOR ANY C ITEM WITH ZERO STANDARD DEVIATION. THIS IS NOT A PROGRAMMING ERROR. C C-------------------------------------------------------------- C DIMENSION ISTRAT(75,11) DIMENSION NHOUSE(75), FPD(75), XHP(75) DIMENSION NHOURS(5), IRURAL(75) COMMON IRURAL,JSEED,NHP,NADULT,NCHIL,NTV,PRICE,NHOURS COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS DIMENSION NR(5),FMEAN(75,15),STAN(75,15),NPOP(75),NNR(75,5) DIMENSION PNR(75,5),CORR(75,15) DOUBLE PRECISION SUM(75,15,6),SUMSQ(75,15,6),SUMCRS(75,15,6) CHARACTER*1 ANS,YCAP/'Y'/, YSMALL/'y'/ CHARACTER*4 ACOMMA,ABLANK,ATO,LINE2(50) CHARACTER*8 QUESTN(15) LOGICAL IPRINT DIMENSION LINE1(50) DOUBLE PRECISION R(15),Y,YSQ,YCR,STRSUM(15,6),XHST,XHST2 DIMENSION PNRS(5), STRSTA(15,9) DIMENSION NHPD(75,8), NHPS(8), NHPDIV(8) C DATA NSTRAT/11/, ISTRAT/75*1, 46*1,29*0, 43*1,32*0, 43*0,3*1,29 &*0, 46*0,4*1,25*0, 50*0,25*1, 50*0,1,4*0, &1,4*0,1,4*0,1,9*0, 56*0,1,9*0,1,2*0,6*1, 51*0,1,9*0, &1,2*0,1,2*0,2*1,6*0, 52*0,1,4*0,1,4*0,2*1,11*0, 53*0,2*1,3*0,2*1, &15*0/ C DATA ACOMMA/', '/, ATO/' TO '/, ABLANK/' '/ DATA QUESTN/'H. VALUE','1 ADULTS','2 CHILDN','3 TVS ', &'4 CPRICE','$5 ','$10 ','$15 ', &'$20 ','$25 ','5 TV HRS','6 NEWS ','7 SPORTS', &'8 CHILDN','9 MOVIES'/ DATA NHPDIV/0,40000,50000,60000,70000,80000,90000,100000/ C NHST=0 AVFAM=0. AVNHP = 0. SDNHP = 0. DO 1 I=1,43 NHST=NHST+NHOUSE(I) AVNHP = AVNHP + XHP(I)*NHOUSE(I) SDNHP = SDNHP + XHP(I)*XHP(I)*NHOUSE(I) AVFAM=AVFAM+NHOUSE(I)*FPD(I) 1 CONTINUE AVRUR=FLOAT(NHST)/43. DO 7 I=1,43 IRURAL(I)=2 IF (NHOUSE(I).LE.AVRUR) IRURAL(I)=3 7 CONTINUE DO 2 I=44,75 NHST=NHST+NHOUSE(I) AVNHP = AVNHP + XHP(I)*NHOUSE(I) SDNHP = SDNHP + XHP(I)*XHP(I)*NHOUSE(I) AVFAM=AVFAM+NHOUSE(I)*FPD(I) IRURAL(I)=0 IF (I.GE.51) GO TO 2 IRURAL(I)=1 IF (I.GE.47) GO TO 2 IRURAL(I)=2 2 CONTINUE AVFAM=AVFAM/NHST WITHIN = (43.*AVRUR*4.591837 + (NHST-43.*AVRUR)*0.4445079)/NHST SDNHP = 7000.*SQRT(SDNHP/NHST - (AVNHP/NHST)**2 + WITHIN) AVNHP = 7000.*(AVNHP/NHST) + 30000. 25 WRITE(*,1220) 1220 FORMAT(' ENTER DESIRED THREE NONRESPONSE RATES:'/ & ' NOT-AT-HOMES, REFUSALS, RANDOM ANSWERS') READ(*,*,ERR=25,END=25) PNRSP1,PNRSP2,PNRSP3 IF(PNRSP1 .LT. 0.0 .OR. PNRSP2 .LT. 0.0 .OR. PNRSP3 .LT. 0.0 &.OR. PNRSP1 .GT. 1.0 .OR. PNRSP2 .GT. 1.0 .OR. PNRSP3 .GT. 1.0) & THEN WRITE(*,1230) 1230 FORMAT(' NONRESPONSE RATES MUST BE BETWEEN 0 AND 1') GO TO 25 END IF P4=.2*PNRSP3 WRITE(*,*) 'ENTER NONRESPONSE SEED' READ(*,*) ISEED1 C DO 4 I=1,75 DO 6 J=1,8 6 NHPD(I,J)=0. DO 5 J=1,5 NNR(I,J)=0 5 CONTINUE DO 4 J=1,15 DO 4 K=1,6 SUM(I,J,K)=0. SUMSQ(I,J,K)=0. SUMCRS(I,J,K)=0. 4 CONTINUE WRITE(*,*) 'DO YOU WANT TO PRINT INDIVIDUAL HOUSE ANSWERS?' WRITE(*,*) 'BE SURE YOU HAVE ADEQUATE DISK SPACE!!!' READ(*,1125) ANS IPRINT=((ANS.EQ.YCAP).OR.(ANS.EQ.YSMALL)) WRITE(*,*) 'ENTER NUMBER OF DISTRICTS TO BE CENSUSED' READ(*,*) NDIST C DO 790 K=1,NDIST IF (NDIST.EQ.75) THEN I=K WRITE(*,*) I ELSE WRITE(*,*) 'ENTER DISTRICT NUMBER' READ(*,*) I ENDIF IF (IPRINT) WRITE(10,1000) I, ISEED1 NHOUSD=NHOUSE(I) DO 690 J=1,NHOUSD DO 40 I2=1,5 NR(I2)=0 40 CONTINUE J1=J CALL ANSWER(I,J1) C C NONRESPONSE MODULE JSEED1=NSEED(ISEED1,ISEED,JSEED) FRESP=RANDU(JSEED1) PNHOME=PNRSP1*EXP(1.-(NADULT+NCHIL)/AVFAM) IF (FRESP.LT.PNHOME) NR(1)=1 IF (PNRSP2 .GT. 0.999) THEN NR(2) = 1 ELSE FRESP=RANDU(JSEED1) DEV = (NHP - AVNHP)/SDNHP PREFUS = 1.-EXP(- .5 * DEV**2 * (1./(1.-PNRSP2)**2 - 1.) ) IF (FRESP.LT.PREFUS) NR(2) = 1 END IF FRESP=RANDU(JSEED) IF (FRESP.LT.P4) NR(4)=1 IF ((FRESP.GT.P4).AND.(FRESP.LT.PNRSP3)) NR(5)=1 C DO 20 KK=1,7 KK1=KK+1 IF (NHP.GE.NHPDIV(KK1)) GO TO 20 NHPD(I,KK)=NHPD(I,KK)+1 GO TO 600 20 CONTINUE NHPD(I,8)=NHPD(I,8)+1 C 600 IPRICE=MIN0(IFIX(PRICE/5.),5)+1 NPRICE=5*IPRICE-5 IF (IPRINT) WRITE(10,1010) J,NHP,NADULT,NCHIL,NTV,NPRICE &,NHOURS,NR Y=NHP SUM(I,1,IPRICE)=SUM(I,1,IPRICE)+Y SUMSQ(I,1,IPRICE)=SUMSQ(I,1,IPRICE)+Y*Y SUM(I,2,IPRICE)=SUM(I,2,IPRICE)+NADULT SUMSQ(I,2,IPRICE)=SUMSQ(I,2,IPRICE)+NADULT*NADULT SUM(I,3,IPRICE)=SUM(I,3,IPRICE)+NCHIL SUMSQ(I,3,IPRICE)=SUMSQ(I,3,IPRICE)+NCHIL*NCHIL SUM(I,4,IPRICE)=SUM(I,4,IPRICE)+NTV SUMSQ(I,4,IPRICE)=SUMSQ(I,4,IPRICE)+NTV*NTV SUM(I,5,IPRICE)=SUM(I,5,IPRICE)+NPRICE SUMSQ(I,5,IPRICE)=SUMSQ(I,5,IPRICE)+NPRICE*NPRICE SUMCRS(I,1,IPRICE)=SUMSQ(I,1,IPRICE) SUMCRS(I,2,IPRICE)=SUMCRS(I,2,IPRICE)+Y*NADULT SUMCRS(I,3,IPRICE)=SUMCRS(I,3,IPRICE)+Y*NCHIL SUMCRS(I,4,IPRICE)=SUMCRS(I,4,IPRICE)+Y*NTV SUMCRS(I,5,IPRICE)=SUMCRS(I,5,IPRICE)+Y*NPRICE DO 610 II=1,5 NNR(I,II)=NNR(I,II)+NR(II) I2=II+5 IF (IPRICE.GT.II) THEN SUM(I,I2,IPRICE)=SUM(I,I2,IPRICE)+1 SUMSQ(I,I2,IPRICE)=SUM(I,I2,IPRICE) SUMCRS(I,I2,IPRICE)=SUMCRS(I,I2,IPRICE)+Y ENDIF I2=II+10 SUM(I,I2,IPRICE)=SUM(I,I2,IPRICE)+NHOURS(II) SUMSQ(I,I2,IPRICE)=SUMSQ(I,I2,IPRICE)+NHOURS(II)*NHOURS(II) SUMCRS(I,I2,IPRICE)=SUMCRS(I,I2,IPRICE)+NHOURS(II)*Y 610 CONTINUE 690 CONTINUE C C DISTRICT STATISTICS MODULE DO 710 I2=1,5 710 PNR(I,I2)=FLOAT(NNR(I,I2))/NHOUSD NPOP(I)=0 DO 720 KK=1,6 720 NPOP(I)=.1+NPOP(I)+SUM(I,2,KK)+SUM(I,3,KK) DO 700 I2=1,15 FMEAN(I,I2)=0. STAN(I,I2)=0. CORR(I,I2)=0. DO 705 KK=1,6 FMEAN(I,I2)=FMEAN(I,I2)+SUM(I,I2,KK) STAN(I,I2)=STAN(I,I2)+SUMSQ(I,I2,KK) 705 CORR(I,I2)=CORR(I,I2)+SUMCRS(I,I2,KK) FMEAN(I,I2)=FMEAN(I,I2)/NHOUSD TEMP=(STAN(I,I2)-NHOUSD*FMEAN(I,I2)**2)/(NHOUSD-1) STAN(I,I2)=SQRT(ABS(TEMP)) TEMP=(CORR(I,I2)-NHOUSD*FMEAN(I,I2)*FMEAN(I,1))/(NHOUSD-1) IF (STAN(I,I2).GT.0.) THEN CORR(I,I2)=TEMP/(STAN(I,I2)*STAN(I,1)) ELSE CORR(I,I2)=-9.999 ENDIF 700 CONTINUE IF (.NOT.IPRINT) GO TO 790 WRITE(10,1020) WRITE(10,1030) WRITE(10,1040) I,NHOUSE(I),NPOP(I),(FMEAN(I,I2),I2=1,15) WRITE(10,1050) WRITE(10,1040) I,NHOUSE(I),NPOP(I),(STAN(I,I2),I2=1,15) WRITE(10,1110) WRITE(10,1045) I,NHOUSE(I),NPOP(I),(CORR(I,I2),I2=1,15) WRITE(10,1095) (PNR(I,I2),I2=1,5) 790 CONTINUE C C SUMMARY OF DISTRICT STATISTICS MODULE WRITE(11,1070) ISEED1 WRITE(11,1020) WRITE(11,1030) DO 791 I=1,75 IF (NPOP(I).EQ.0) GO TO 791 WRITE(11,1040) I,NHOUSE(I),NPOP(I),(FMEAN(I,I2),I2=1,15) 791 CONTINUE WRITE(11,1070) ISEED1 WRITE(11,1020) WRITE(11,1050) DO 792 I=1,75 IF (NPOP(I).EQ.0) GO TO 792 WRITE(11,1040) I,NHOUSE(I),NPOP(I),(STAN(I,I2),I2=1,15) 792 CONTINUE WRITE(11,1070) ISEED1 WRITE(11,1020) WRITE(11,1110) DO 793 I=1,75 IF (NPOP(I).EQ.0) GO TO 793 WRITE(11,1045) I,NHOUSE(I),NPOP(I),(CORR(I,I2),I2=1,15) 793 CONTINUE WRITE(11,1070) ISEED1 WRITE(11,1060) DO 794 I=1,75 IF (NPOP(I).EQ.0) GO TO 794 WRITE(11,1065) I,(PNR(I,I2),I2=1,5) 794 CONTINUE C C STRATUM STATISTICS MODULE: DISTRICT LIST SUBMODULE IF (NSTRAT.LE.0) STOP DO 999 I=1,NSTRAT NLINE=1 J=0 800 J=J+1 IF (J.GE.76) GO TO 890 IF (ISTRAT(J,I).EQ.0) GO TO 800 C START OF A SERIES OF DISTRICTS IN STRATUM ILO=J 820 IHI=J IF (NPOP(J).EQ.0) GO TO 999 J=J+1 IF ((J.LE.75).AND.(ISTRAT(J,I).EQ.1)) GO TO 820 C END OF A SERIES OF DISTRICTS IN STRATUM LINE1(NLINE)=ILO LINE2(NLINE)=ACOMMA NLINE=NLINE+1 IF (IHI.EQ.ILO) GO TO 800 LINE1(NLINE)=IHI LINE2(NLINE)=ACOMMA NLINE=NLINE+1 IF (IHI.NE.(ILO+1)) LINE2(NLINE-2)=ATO GO TO 800 890 IF (NLINE.LE.2) GO TO 999 NLINE=NLINE-1 LINE2(NLINE)=ABLANK C C CALCULATION OF STRATUM STATISTICS MODULE N1=0 NPOPS=0 NHST=0 XHST2=0. DO 900 K=1,5 900 PNRS(K)=0. DO 901 K=1,8 901 NHPS(K)=0 C DO 990 IPRICE=1,6 WRITE(11,1070) ISEED1 WRITE(11,1080) (LINE1(K),LINE2(K),K=1,NLINE) IF (IPRICE.NE.1) THEN NPRICE=5*IPRICE-5 WRITE(11,1120) NPRICE ENDIF IPR1=IPRICE-1 IPR4=IPRICE+4 DO 910 K=1,15 STRSTA(K,8)=0. STRSTA(K,9)=0. DO 910 J=1,6 910 STRSUM(K,J)=0. NHSTC=0 C DO 950 J=1,75 IF (ISTRAT(J,I).EQ.0) GO TO 950 DO 916 K=1,8 916 NHPS(K)=NHPS(K)+NHPD(J,K) NHOUSD=NHOUSE(J) IF (IPRICE.EQ.1) THEN N1=N1+1 NPOPS=NPOPS+NPOP(J) NHST=NHST+NHOUSD NHSTC=NHST XHST2=XHST2+NHOUSD**2 DO 920 K=1,5 920 PNRS(K)=PNRS(K)+NNR(J,K) ELSE NPOPS=.1+NPOPS-SUM(J,2,IPR1)-SUM(J,3,IPR1) DO 921 KK=IPRICE,6 921 NHSTC=.1+NHSTC+SUM(J,IPR4,KK) ENDIF C DO 930 K=1,15 Y=0. YSQ=0. YCR=0. DO 935 KK=IPRICE,6 Y=Y+SUM(J,K,KK) YSQ=YSQ+SUMSQ(J,K,KK) 935 YCR=YCR+SUMCRS(J,K,KK) STRSUM(K,1)=STRSUM(K,1)+Y STRSUM(K,5)=STRSUM(K,5)+YSQ STRSUM(K,6)=STRSUM(K,6)+YCR 930 CONTINUE 950 CONTINUE C DO 905 K=1,15 IF (NHSTC.LE.1) GO TO 975 R(K)=STRSUM(K,1)/NHSTC 905 IF (IPRICE.EQ.1) R(K)=0. DO 915 J=1,75 IF (ISTRAT(J,I).EQ.0) GO TO 915 NHOUSD=NHOUSE(J) NHOUSC=0 DO 906 KK=IPRICE,6 906 NHOUSC=.1+NHOUSC+SUM(J,IPR4,KK) IF (IPRICE.EQ.1) NHOUSC=NHOUSD DO 925 K=1,15 Y=0. YSQ=0. DO 945 KK=IPRICE,6 Y=Y+SUM(J,K,KK) 945 YSQ=YSQ+SUMSQ(J,K,KK) YSQ=YSQ+(R(K)**2)*NHOUSC-2.*R(K)*Y Y=Y-R(K)*NHOUSC STANR=(YSQ-Y*Y/NHOUSD)/(NHOUSD-1) STANR=SQRT(ABS(STANR)) STRSUM(K,2)=STRSUM(K,2)+Y*Y STRSUM(K,3)=STRSUM(K,3)+NHOUSD*Y STRSUM(K,4)=STRSUM(K,4)+Y*Y/NHOUSD STRSTA(K,8)=STRSTA(K,8)+(NHOUSD*STANR)**2 STRSTA(K,9)=STRSTA(K,9)+NHOUSD*(STANR**2) 925 CONTINUE 915 CONTINUE C XHST=NHST XHSTC=NHSTC DO 960 K=1,15 Y=STRSUM(K,1) STRSTA(K,1)=Y/XHSTC STRSTA(K,3)=(XHSTC*STRSUM(K,5)-Y*Y)/(XHSTC*(XHSTC-1.)) STRSTA(K,2)=SQRT(ABS(STRSTA(K,3))) STRSTA(K,4)=(XHSTC*STRSUM(K,6)-Y*STRSUM(1,1))/(XHSTC*(XHSTC-1)) IF (STRSTA(K,2).GT.0.) THEN STRSTA(K,4)=STRSTA(K,4)/(STRSTA(K,2)*STRSTA(1,2)) ELSE STRSTA(K,4)=-9.999 ENDIF IF (IPRICE.GT.1) Y=0. STRSTA(K,5)=N1*(N1*STRSUM(K,2)-Y*Y)/((N1-1)*XHST**2) STRSTA(K,6)=N1**2*(XHST**2*STRSUM(K,2)-2.*Y*XHST*STRSUM(K,3) &+XHST2*Y**2)/((XHST**4)*(N1-1)) STRSTA(K,7)=(XHST*STRSUM(K,4)-Y**2)/(XHST**2) STRSTA(K,8)=STRSTA(K,8)*N1/(XHST**2) STRSTA(K,9)=STRSTA(K,9)/XHST IF (IPRICE.EQ.1) GO TO 960 DO 965 KK=5,9 965 STRSTA(K,KK)=(XHST**2)*STRSTA(K,KK)/(XHSTC**2) 960 CONTINUE 975 WRITE(11,1090) N1,NHSTC,NPOPS IF (NHSTC.LE.1) GO TO 990 IF (IPRICE.EQ.1) THEN DO 970 K=1,5 970 PNRS(K)=PNRS(K)/NHST WRITE(11,1095) PNRS DO 971 K=1,7 K1=K+1 NTOP=NHPDIV(K1)-1 971 WRITE(11,1130) NHPDIV(K),NTOP,NHPS(K) WRITE(11,1135) NHPDIV(8),NHPS(8) ENDIF WRITE(11,1100) (QUESTN(K),(STRSTA(K,J),J=1,9) ,K=1,15) 990 CONTINUE 999 CONTINUE STOP C C 1000 FORMAT(' DISTRICT ',I2,' CENSUS WITH ISEED1 = ',I12/'0NO.',4X, &'VALUE',3X,'1',3X,'2',3X,'3',3X,'4',4X,'5',4X,'6',4X,'7',4X, &'8',4X,'9',1X,'N1',1X,'N2',1X,'N3',1X,'N4',1X,'N5') 1010 FORMAT(1X,I4,2X,I6,4(2X,I2),5(2X,I3),5(2X,I1)) 1020 FORMAT(4X,'HOUSE',5X,'POP',3X,'HVALUE',5X,'1',5X,'2',5X,'3', &6X,'4',5X,'$5',3X,'$10',3X,'$15',3X,'$20',3X,'$25',7X,'5',5X,'6' &,5X,'7',5X,'8',5X,'9') 1030 FORMAT(9X,'COUNT',6X,4(2X,'MEAN'),3X,'MEAN',10X,'PROPORTIONS', &12X,5(2X,'MEAN')) 1040 FORMAT((1X,I2,2X,I4,2X,I6,2X,F7.0,3(2X,F4.2),2X,F5.2,1X,5(1X,F5.3) &,2X,F6.2,4(1X,F5.2))) 1045 FORMAT(1X,I2,2X,I4,2X,I6,F9.3,3F6.3,2F7.3,4F6.3,F8.3,4F6.3) 1050 FORMAT(9X,'COUNT',8X,'STANDARD DEVIATIONS') 1060 FORMAT(5X,'1',5X,'2',11X,'3A',4X,'3B',4X,'NONRESPONSE & PROPORTIONS') 1065 FORMAT(1X,I2,5(2X,F4.2)) 1070 FORMAT(' CENSUS WITH ISEED1 = ',I12) 1080 FORMAT(' STRATUM: ',10(I2,A4)/(10X,10(I2,A4))) 1090 FORMAT(' DISTRICTS: ',I2,5X,'HOUSEHOLDS: ',I5,5X,'POPULATION: ' &,I6) 1095 FORMAT(' NONRESPONSE PROPORTIONS--TYPE 1: ',F5.3,3X,'TYPE 2: ', &F5.3,3X,'UNUSED',F5.3,3X,'TYPE 3A: ',F5.3,3X,'TYPE 3B: ',F5.3) 1100 FORMAT(11X,'MEAN',7X,'ST. DEV.',3X,'VARIANCE',6X,'CORR.',4X, &'BETWEEN DISTRICTS VARIANCE',16X,'WITHIN DISTRICTS VARIANCE'/ &56X,'B1',12X,'B2',12X,'B3',12X,'W1',12X,'W2'/ &1X,A8,2X,F8.0,3X,F8.0,3X,E11.5,3X,F6.3,5(3X,E11.5)/ &(1X,A8,2X,F8.3,3X,F8.3,3X,E11.5,3X,F6.3,5(3X,E11.5))) 1110 FORMAT(9X,'COUNT',8X,'CORRELATIONS WITH HOUSE PRICE') 1120 FORMAT(' PROFILE OF HOUSES WILLING TO PAY $',I2) 1125 FORMAT(A1) 1130 FORMAT(' NUMBER OF HOUSES VALUE',I7,' TO',I7,':',I7) 1135 FORMAT(' NUMBER OF HOUSES VALUE',I7,' AND OVER: ',I7) END C C******************************************************************* C SUBROUTINE ANSWER(I,J) C C SUBROUTINE TO GENERATE ANSWERS TO QUESTIONAIRE FOR ADDRESS I,J C DIMENSION NHOUSE(75),FPD(75),XHP(75),NHOURS(5),ERRHSE(5) DIMENSION IRURAL(75) COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS COMMON IRURAL,JSEED,NHP,NADULT,NCHIL,NTV,PRICE,NHOURS C C HOUSE PRICE GENERATION MODULE IF (I.GT.43) GO TO 150 JSEED=NSEED(I,J,ISEED) ERR1=1.5*RANDN(JSEED) GO TO 190 150 DO 160 NEIGH=1,5 J1=MOD(NHOUSE(I)+J+NEIGH-2,NHOUSE(I)) JSEED=NSEED(I,J1,ISEED) ERRHSE(NEIGH)=RANDN(JSEED) 160 CONTINUE ERR1=.25*(.5*ERRHSE(1)+ERRHSE(2)+ERRHSE(3)+ERRHSE(4)+.5*ERRHSE(5)) 190 NHP=30000+7000*XHP(I)+10000*ERR1 NHP=MAX0(10000,NHP) C C FAMILY COUNT MODULE RLAM=FPD(I)+ERR1-1 RLAM=AMAX1(RLAM,.5) NFAM=IRANDP(RLAM,JSEED)+1 FCHIL=RANDU(JSEED) NCHIL=FCHIL*(NFAM-1) NADULT=NFAM-NCHIL C C NUMBER OF TV'S AND PRICE RESPONSE MODULE NTV=.2*NFAM+.00001*NHP+RANDN(JSEED)+1.0 NTV=MAX0(0,NTV) NTV=MIN0(5,NTV) ERR2=RANDN(JSEED) PRICE=1.75*XHP(I)+2.5*ERR1+IRURAL(I)+NFAM+3.*ERR2 PRICE=AMAX1(PRICE,0.) IF (NTV.EQ.0) PRICE=0. C C WATCHING HABITS MODULE NHOURS(2)=(NFAM+ERR2+IRURAL(I)+RANDN(JSEED))*1.5 NHOURS(3)=(NFAM+ERR2+IRURAL(I)+3.*RANDN(JSEED))*4.0 NHOURS(4)=6+NCHIL*3+IRURAL(I)+4.0*RANDN(JSEED) IF (NCHIL.EQ.0) NHOURS(4)=0 NHOURS(5)=(NFAM+ERR2+IRURAL(I)+RANDN(JSEED))*4.0 NHRT=0 DO 510 II=2,5 NHOURS(II)=MAX0(NHOURS(II),0) NHRT=NHOURS(II)+NHRT 510 CONTINUE NHOURS(1)=NHRT*(1+.5*ABS(RANDN(JSEED))) NHRMX=70.+46.*SQRT(FLOAT(NTV))+10.*RANDU(JSEED) IF (NTV.EQ.0) NHRMX=0 IF (NHOURS(1).LE.NHRMX) RETURN FACT=FLOAT(NHRMX)/NHOURS(1) DO 520 II=1,5 NHOURS(II)=NHOURS(II)*FACT 520 CONTINUE RETURN END C C******************************************************************* C INTEGER FUNCTION NSEED(I1,I2,I3) C C SUBROUTINE TO COMBINE THREE INTEGERS INTO A SEED C INTEGER I1,I2,I3,J(3),JSEED,K,IND1,NSEED1 REAL X DATA JSEED/2147483645/ J(1) = MOD(IABS(I1),JSEED) + 1 J(2) = MOD(IABS(I2),JSEED) + 1 J(3) = MOD(IABS(I3),JSEED) + 1 X=RANDU(J(1)) IND1=1 NSEED1=0 DO 20 K=1,9 NSEED1=NSEED1*10 IND1=MOD(J(IND1),2)+IND1 IND1=MOD(IND1,3)+1 X=RANDU(J(IND1)) NSEED1=NSEED1+MOD(J(IND1),10) X=RANDU(J(IND1)) 20 CONTINUE X=RANDU(NSEED1) NSEED=NSEED1 RETURN END C C********************************************************************* C REAL FUNCTION RANDU(IX) C C UNIFORM RANDOM NUMBER GENERATOR. C REF: SCHRAGE, L. 1979 'A MORE PORTABLE FORTRAN RANDOM NUMBER GENERATOR' C A. C. M. TRANSACTIONS ON MATHEMATICAL SOFTWARE V5 132-138. C INTEGER A,P,IX,B15,B16,XHI,XALO,LEFTLO,FHI,K DATA A/16807/,B15/32768/,B16/65536/,P/2147483647/ XHI=IX/B16 XALO=(IX-XHI*B16)*A LEFTLO=XALO/B16 FHI=XHI*A+LEFTLO K=FHI/B15 IX=(((XALO-LEFTLO*B16)-P)+(FHI-K*B15)*B16)+K IF (IX.LT.0) IX=IX+P RANDU=FLOAT(IX)*4.656612875E-10 RETURN END C C********************************************************************* C REAL FUNCTION RANDN(ISEED) C C STANDARD NORMAL RANDOM NUMBER GENERATOR C ISEED1=ISEED DO 10 I=1,1000 X1=2.*RANDU(ISEED)-1. X2=2.*RANDU(ISEED)-1. S=X1*X1+X2*X2 IF (S.LT.1.) GO TO 20 10 CONTINUE WRITE(11,1000) ISEED1 RANDN=X1 RETURN 20 RANDN=X1*SQRT(-2.*ALOG(S)/S) RETURN 1000 FORMAT('0ERROR IN RANDN. PLEASE COPY THE FOLLOWING NUMBER AND & INFORM INSTRUCTOR'/1X,I12) END C C********************************************************************* C INTEGER FUNCTION IRANDP(RLAM,ISEED) C C POISSON RANDOM NUMBER GENERATOR. C ISEED1=ISEED S=0. DO 10 I=1,1000 S=S-ALOG(RANDU(ISEED)) IF (S.GT.RLAM) GO TO 20 10 CONTINUE WRITE(11,1000) ISEED1 IRANDP=1000 RETURN 20 IRANDP=I-1 RETURN 1000 FORMAT('0ERROR IN IRANDP. PLEASE COPY THE FOLLOWING NUMBER AND & INFORM INSTRUCTOR.'/1X,I12) END C BLOCK DATA INITL INTEGER NHOUSE(75),ISEED REAL FPD(75),XHP(75) COMMON /PARAM/ISEED,NHOUSE,XHP,FPD,NCRUR,NCURB,NCINT,NCRVIS,NCUVIS DATA ISEED/4214202/ DATA NHOUSE/142,153,135,128,110,103,105,385,296,287, &253,172,198,432,248,251,221,297,235,171, &135,254,203,244,202,103,102,115,180,190, &152,141,143,135,178,221,174,101,95,130, &152,169,91,283,562,312,897,734,963,642, &525,726,674,585,553,583,911,1051,918,799, &545,895,1313,968,717,651,886,912,898,759, &722,753,793,725,802/ DATA XHP/ 5.0, 4.1, 5.0, 3.7, 3.9, 4.0, 5.7, 6.8, 6.3, 5.6, & 4.3, 3.6, 5.0, 6.7, 5.3, 3.4, 5.4, 6.9, 5.9, 3.2, & 5.3, 5.0, 6.7, 6.7, 5.6, 3.7, 4.5, 4.5, 5.4, 5.6, & 5.6, 4.5, 3.8, 3.8, 4.5, 4.4, 3.5, 3.3, 4.0, 3.7, & 3.6, 3.8, 3.0, 4.3, 3.9, 3.1, 4.6, 4.5, 4.3, 3.5, & 9.3, 5.5, 3.5, 2.6, 1.9, 9.4, 7.8, 4.0, 1.0, 2.1, &10.3, 6.5, 3.7, 4.7, 5.6, 9.2, 7.6, 6.7, 6.0, 7.2, & 8.2, 7.1, 7.1, 7.7, 7.2/ DATA FPD/7*3.8,4*3.6,2*3.8,3*3.6,3.8,2*3.6,2*3.8,3*3.6,19*3.8, &7*3.6,2*3.5,3.0,2*2.0,3.5,3.0,2.5,2*2.0,3.5,3.0,2.0,2.5,3.0,3.5, &3.2,2*3.0,6*3.5/ DATA NCRUR/60/, NCURB/20/, NCINT/10/, NCRVIS/6/, NCUVIS/3/ END ************************************************************* * sample exercises (text form) * ************************************************************* Sample Exercises Using the SURVEY Simulation Programs The following sample exercises were designed for an introductory course using Cochran (1977). Most of them take a long time, and, depending upon how quickly your class catches on, you may find some of them too repetitious. The population values given in Assignment 1 depend upon the BLOCK DATA sections of SURVEY and can be checked by using the CENSUS program. In changing the BLOCK DATA values in SURVEY it is important to keep a reasonable geographic variation in district house value and population density. These files were made by manually changing the formulas in formatted Microsoft Word files for the MacIntosh. We hope that we have converted all the mathematical symbols and apologize if our failure to do so causes gibberish to print out. The original formatted files, for those of you with a Mac running Word, are included. The references are Cochran, W. G. (1977). Sampling Techniques. Wiley. Wolter, K. M. (1985). Introduction to Variance Estimation. Springer- Verlag. Your comments and suggestions are always welcome. We hope you find the simulation programs useful. Ted Chang tcc8v@virginia.edu Sharon Lohr lohr@cholla.la.asu.edu C. Graham McLaren The SURVEY simulation programs and their associated exercises are made available free of charge to the educational community. Permission to copy for classroom use is hereby granted. Appropriate attribution would be appreciated. All other rights are reserved. No warranty express or implied is provided and the use of the SURVEY simulation programs and their associated exercises is completely at the risk of the user. Copyright (C) 1992, Sharon Lohr, Ted Chang, C. G. McLaren Assignment 1 SIMPLE RANDOM SAMPLING AND THE SURVEY SIMULATIONS WARNING: This assignment takes longer than you think. Start early. INTRODUCTION TO SURVEY: The computer program SURVEY simulates the results and costs that might be experienced in actual sample surveys. The exercises using SURVEY are designed to provide a practical illustration of the theoretical aspects of survey design, and to allow comparisons between the different designs discussed in the course. THE SIMULATED SURVEY ENVIRONMENT: Stephens County Stephens County is a fictitious county in the midwestern part of the United States with a population of approximately 103,000. It has two main cities: Lockhart City, population 57,500, and Eavesville, population 11,700. Both cities are commercial and transportation centers and boast a variety of light industries. Among the county's industrial products are farm chemicals, pet foods, cable and wire, aircraft radios, greeting cards, corrugated paper boxes, industrial gases, and pipe organs. The county has three smaller municipalities: Villegas, Weldon, and Routledge with populations between one and two thousand. These cities are local commercial centers. The surrounding areas are agricultural, although a sizeable number of persons commute to the larger cities. The county's main agricultural products are beef cattle, wheat, sorghum and soybeans. Stephens County has been organized into 75 districts with the houses within a district numbered consecutively starting with 1. For the purposes of these exercises, you may assume that houses in the same district with close numbers are physically close. The Stephens County Cablevision Company has been formed to provide cable TV service to Stephens County. It has commissioned this survey to help it with its pricing and programming decisions. DISTRICT MAP OF STEPHENS COUNTY SCALE I---------I 5 miles I---------I---------I---------I---------I---------I---------I I I I I I I I I 1 I 2 I 3 I 4 I 5 I 6 I I I I I I I I I 44 I I I I I I I---------I---------I---------I---------I---------I---------I I I I I I I I I 7 I 8 I 9 I 10 I 11 I 12 I I I I I I I I I I I I I I I I---------I---------I---------I---------I---------I---------I I I I 51 52 53 54 55 I I 45 I I 13 I 14 I I 15 I I I I I 56 57 58 59 60 I I 16 I I I I I I I I---------I---------I 61 62 63 64 65 I---------I---------I I I I I I I I 17 I 18 I 66 67 68 69 70 I 19 I 20 I I I I I I I I I I 71 72 73 74 75 I I I I---------I---------I---------I---------I---------I---------I I I I I I I I I 21 I 22 I 23 I 24 I 25 I 26 I I I I I I I I I I I I I I I I---------I---------I---------I---------I---------I---------I I I I I I I I I 27 I 28 I 29 I 30 I 31 I 32 I I I I I I I I I I I I I I I I---------I---------I---------I---------I---------I---------I I 46 I I I I I I I I 34 I 35 I 36 I 37 I 38 I I 33 I I I I I I I I I I I I I I---------I---------I---------I---------I---------I---------I I I I I 47 48 I I I I 39 I 40 I 41 I I 42 I 43 I I I I I 49 50 I I I I I I I I I I I---------I---------I---------I---------I---------I---------I INCORPORATED MUNICIPALITIES: LOCKHART CITY - 51 TO 75 EAVESVILLE - 47 TO 50 WELDON - 45 VILLEGAS - 44 ROUTLEDGE - 46 STEPHENS COUNTY DISTRICT INFORMATION Column 1: District number Column 2: Number of houses Column 3: Cumulative house count Column 4: Population Column 5: Mean assessed house valuation (1) (2) (3) (4) (5) 1 142 142 526 65248. 2 153 295 624 58759. 3 135 430 508 62319. 4 128 558 560 59416. 5 110 668 455 57202. 6 103 771 404 59290. 7 105 876 421 71122. 8 385 1261 1488 79265. 9 296 1557 1112 75921. 10 287 1844 994 68254. 11 253 2097 929 60660. 12 172 2269 628 53569. 13 198 2467 768 65182. 14 432 2899 1595 77907. 15 248 3147 864 65739. 16 251 3398 915 53771. 17 221 3619 864 68257. 18 297 3916 1099 78449. 19 235 4151 812 70772. 20 171 4322 687 52711. 21 135 4457 525 66739. 22 254 4711 923 66249. 23 203 4914 708 74757. 24 244 5158 825 75766. 25 202 5360 799 68989. 26 103 5463 388 56994. 27 102 5565 398 58940. 28 115 5680 448 60448. 29 180 5860 693 69111. 30 190 6050 766 69685. 31 152 6202 633 70276. 32 141 6343 572 63819. 33 143 6486 610 58636. 34 135 6621 491 55554. 35 178 6799 699 62361. 36 221 7020 811 60052. 37 174 7194 719 55699. 38 101 7295 390 53322. 39 95 7390 312 57174. 40 130 7520 446 55702. 41 152 7672 533 53285. 42 169 7841 672 56866. 43 91 7932 371 50710. 44 283 8215 1029 60057. 45 562 8777 2079 57233. 46 312 9089 1149 52719. 47 897 9986 3263 62034. 48 734 10720 2623 60764. 49 963 11683 3490 60010. 50 642 12325 2318 54498. 51 525 12850 1825 95123. 52 726 13576 2497 68406. 53 674 14250 1948 53634. 54 585 14835 1219 48643. 55 553 15388 1090 43493. 56 583 15971 1977 95110. 57 911 16882 2691 84394. 58 1051 17933 2663 57657. 59 918 18851 1824 36706. 60 799 19650 1636 44308. 61 545 20195 1853 101906. 62 895 21090 2588 74815. 63 1313 22403 2642 55560. 64 968 23371 2457 62813. 65 717 24088 2203 69846. 66 651 24739 2197 93771. 67 886 25625 2711 82902. 68 912 26537 2750 76832. 69 898 27435 2671 72062. 70 759 28194 2650 79887. 71 722 28916 2568 87383. 72 753 29669 2652 80341. 73 793 30462 2763 79833. 74 725 31187 2560 83354. 75 802 31989 2870 80522. (1) (2) (4) (5) 1-43 7932 29985 65511 RURAL 44-46 1157 4257 56706 VILLEGAS, WELDON, ROUTLEDGE 1-46 9089 34242 64390 RURAL 47-50 3236 11694 59649 EAVESVILLE 51-75 19664 57505 71117 LOCKHART CITY 1-75 31989 103441 68045 STEPHENS COUNTY The house counts (and cumulative house counts) will be correct on any computer. The population and average house assessed values will be correct for programs run on the Math. Department Sun system, but will be slightly inaccurate for other computers. THE INTERVIEW QUESTIONNAIRE: The Stephens County Cablevision Company has supplied an interview questionnaire for your use: Stephens County Cablevision Inc. Lockhart City 1. How many persons aged 12 and older live at this address? 2. How many persons aged 11 and younger live at this address? 3. How many television sets do you have? 4. (Interviewer: Ask questions 4a,b,c,d,e and record the highest price-- $0, $5, $10, $15, $20, or $25--respondent is willing to pay.) a. If Cable TV cost $5 per month, would you subscribe? b. If Cable TV cost $10 per month, would you subscribe? c. If Cable TV cost $15 per month, would you subscribe? d. If Cable TV cost $20 per month, would you subscribe? e. If Cable TV cost $25 per month, would you subscribe? 5. How many hours per week does your household watch TV? Estimate the number of hours in question 5 that are spent watching each of the following types of programming: 6. News and "public affairs" 7. Sports 8. Children's programming 9. Movies In addition, for each surveyed household, the Company has obtained from the county tax assessor the assessed valuation of that household's living quarters. This information is meant to provide a measure of family income (without having to ask about it). The SURVEY program was written in 1983. Clearly it has been dated by inflation and the addition of other types of optional cable channels (e.g. MTV). RUNNING THE SURVEY PROGRAM: SURVEY is available in FORTRAN source code. It will hopefully compile with most FORTRAN compilers. It has been successfully compiled on the MacIntosh, IBM PC (running OS/2), and a variety of mainframes. SURVEY first asks you to enter the desired nonresponse rates. For now, we're assuming that everyone in Stephens County is always at home and cooperative, so enter 0 0 0 Then, when asked, enter the address of each household to be questioned in the form district number, house number . SURVEY responds "DONE" to each correctly entered address. When you have finished your list of houses enter zero for the district number followed by any house number. SURVEY places the answers that each house gives in the file you have specified. You may then edit or print the file from within the word processor. You will also need to use a statistical package (MINITAB is recommended) with macro capabilities. On request the instructor will put an operating version of SURVEY onto the University Prime computer system or the Mathematics Department SUN system. The Primes have MINITAB available, but users of the SUN system will have to use S. SAMPLE RUN (IBM PC under OS/2): DEMONSTRATION EDUCATIONAL SAMPLE SURVEY PROGRAM TED CHANG, UNIVERSITY OF KANSAS, SEPT 1986 ENTER FILENAME CONTAINING ADDRESSES--8 OR FEWER LETTERS IF ENTERING FROM TERMINAL, TYPE T t ENTER FILENAME FOR OUTPUT--8 OR FEWER LETTERS myoutput ENTER DESIRED THREE NONRESPONSE RATES: NOT-AT-HOMES, REFUSALS, RANDOM ANSWERS 0 0 0 ENTER DISTRICT NUMBER, HOUSE NUMBER 23,45 DONE ENTER DISTRICT NUMBER, HOUSE NUMBER 22,96 DONE ENTER DISTRICT NUMBER, HOUSE NUMBER 53,47 DONE ENTER DISTRICT NUMBER, HOUSE NUMBER 583,22 DISTRICT NUMBERS MUST BE BETWEEN -75 AND 75. RE-ENTER DISTRICT NUMBER, HOUSE NUMBER SET DISTRICT NUMBER = 0 TO STOP PROGRAM. ENTER DISTRICT NUMBER, HOUSE NUMBER 55,9999 IN DISTRICT 55 HOUSE NUMBERS MUST BE BETWEEN 1 AND 553 RE-ENTER DISTRICT NUMBER, HOUSE NUMBER SET DISTRICT NUMBER = 0 TO STOP PROGRAM. ENTER DISTRICT NUMBER, HOUSE NUMBER 0,0 THE COST OF THIS SESSION IS 205 DOLLARS. This is what the file myoutput looks like: ADDRESS VALUE 1 2 3 4 5 6 7 8 9 23 45 62673 2 1 3 15 130 11 28 7 20 22 96 85553 2 2 1 20 86 10 29 7 32 53 47 83183 2 0 3 10 52 4 39 0 8 THE COST OF THIS SESSION IS 205 DOLLARS. VALUE = house value, and the numbers in columns labelled 1 through 9 are the household's responses to questions 1 through 9. A sample MINITAB session using the output from the SURVEY program is included. Notice that the first and last lines of the output file 'myoutput' must be stripped before it can be processed in MINITAB SAMPLE MINITAB RUN: MINITAB RELEASE 6.2 *** Macintosh VERSION COPYRIGHT 1987,1990 by Minitab, Inc. - all rights reserved MINITAB is a registered trademark of Minitab, Inc. Macintosh is a registered trademark of Apple Computer, Inc. U.S. FEDERAL GOVERNMENT USERS SEE HELP FGU DEC. 4, 1990 *** STORAGE AVAILABLE 38000 MTB > help You are in Minitab (Macintosh Version). Minitab is a general purpose statistics package. There is a HELP facility that helps you learn about Minitab. To see how it works, type: HELP HELP To leave Minitab, type: STOP MTB > READ 'myoutput' c1-c12 3 ROWS READ ROW C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 1 23 45 62673 2 1 3 15 130 11 28 7 20 2 22 96 85553 2 2 1 20 86 10 29 7 32 3 53 47 83183 2 0 3 10 52 4 39 0 8 MTB > mean c3 k1 MEAN = 77136 MTB > stan c3 k2 ST.DEV. = 12582 MTB > let k3=sqrt(3.)*k1/k2 MTB > print k1-k3 K1 77136.3 K2 12581.5 K3 10.6191 MTB > let c13=c4+c5 MTB > name c3 'value' c4 'adult' c5 'child' c6 'TVs' c7 'price' MTB > print c3-c7 c13 ROW value adult child TVs price C13 1 62673 2 1 3 15 3 2 85553 2 2 1 20 4 3 83183 2 0 3 10 2 MTB > stop SURVEY PROGRAM ASSUMPTIONS: To make as realistic a simulation as possible, certain assumptions have been programmed into SURVEY. These assumptions should be used in efficient design. Some of the assumptions of SURVEY are quite obvious. For example each (occupied) address has at least one adult and anyone who does not have a TV will not be willing to subscribe to cable service. Some of the other SURVEY program assumptions are: 1. All other factors being equal, a household with a higher income will tend to have a more expensive house. 2. Assessed valuation is a reasonably accurate estimate of house price. 3. All other factors being equal, a household with a higher income will tend to be willing to pay more for cable service. 4. All other factors being equal, a household with a higher income will tend to own more television sets. This tendency is much weaker than that of assumption 3 because of the low cost and longevity of most TV sets. In addition cable service involves a monthly (as opposed to one time) payment for a service much of which can be obtained at low cost by using an antenna. 5. Due to zoning and development practices, urban neighborhoods tend to be more homogeneous than rural neighborhoods. 6. Within a neighborhood, larger families will tend to have larger houses and these houses will tend to have higher assessed values. 7. Larger families tend to watch more TV (not per person, but in total), have more TV's, and be more willing to subscribe to cable TV. 8. All other factors being equal, a family's willingness to subscribe to cable TV decreases as the other entertainment options available to it increase. These options decrease the further one moves from the population concentrations in Stephens County. PROBLEMS In this and in all future simulations, your writeup must include sufficient details so that I can follow your calculations without reading your computer output. In particular, all formulae used must be clearly identified and a reasonable collection of intermediate values must be given. The computer output should be appended and be sufficiently well annotated so that I can easily find the proper place to check more completely any particular computation. Efficient grading requires that IF I CAN'T QUICKLY DECIPHER WHAT YOU DID, THE PROBLEM WILL BE ASSUMED TO BE INCORRECT. 1. Why is the following procedure not suitable for drawing a simple random sample of addresses in Lockhart City? a. Randomly select a district between 51 and 75. b. Randomly select a house from those in the chosen district. c. Reject both district and house selection if the house is already in the sample. d. Repeat a - c until the desired sample size is achieved. 2. Notice that no district in Lockhart City has more than 1313 houses. Prove that the following procedure does produce a simple random sample of houses in Lockhart City: a. Randomly select a district between 51 and 75. b. Randomly select a random number (the potential house selection) between 1 and 1313. c. Reject both district and house selection if the house number exceeds the number of houses in the district or if the house is already in the sample. d. Repeat a-c until the desired sample size is achieved. 3. Let N=6, n=3. For purposes of studying sampling distributions, we assume that all population values are known: y1 = 98 y2 = 102 y3 = 154 y4 = 133 y5 = 190 y6 = 175 We are interested in mu, the population mean. What is the value of mu? For each of the following sampling plans, find (i) E[ybar]; (ii) V[ybar]; (iii) Bias(ybar); (iv) MSE(ybar). Which sampling plan do you think is the best? Why? Plan 1. 8 possible samples sample Units Prob. 1 (1,3,5) 1/8 2 (1,3,6) 1/8 3 (1,4,5) 1/8 4 (1,4,6) 1/8 5 (2,3,5) 1/8 6 (2,3,6) 1/8 7 (2,4,5) 1/8 8 (2,4,6) 1/8 Plan 2. Three possible samples sample Units Prob 1 (1,4,6) 1/4 2 (2,3,6) 1/2 3 (1,3,5) 1/4 4. Select a simple random sample of size 10 from Lockhart City. Use any random number table. Hand in a list of the random numbers you selected and the addresses to which they correspond. Describe exactly how you converted a random number to an address. 5. Use the SURVEY program to obtain the answers to the questionaire for your 10 randomly selected addresses. Hand in a printout of the output file. Estimate the following from your sample of 10 households. Give standard errors for your estimates: a. The average number of TVs per household in Lockhart City. b. The average price a household in Lockhart City is willing to pay for cable TV service. Actually we only know for each sampled household the price it is willing to pay for service rounded down to the nearest $5. Recognizing this limitation to question 4 of the survey questionnaire, use the answers to that question as the prices that the sampled houses are willing to pay. Assignment 2 RATIO ESTIMATION IN SIMPLE RANDOM SAMPLING COMPUTER GENERATION OF RANDOM ADDRESSES: For any sampling scheme to work effectively, the units must be randomly selected. This is a laborious process and many sample surveys are ruined by attempts to short-cut it. As we shall be selecting thousands of addresses for the simulation studies, a program called ADDGEN has been developed to randomly select addresses from any specified set of districts. ADDGEN will ask the user for a random start. This is any integer between 1 and 1000000 which the program uses as a start point in a long table of random numbers. Given the same start, districts, sample size, and type of computer, ADDGEN will always produce the same sample of addresses. It is extremely important that you record the start in order to repeat a particular sample for further analysis in future assignments. The random start is written on the last line of the output file from ADDGEN. The program then asks for the districts from which you wish to sample. Any subset of the districts 1 to 75 can be specified. You simply enter the desired district numbers along a line separated by commas. If you want some consecutive districts you only need to type the first and last district numbers separated by a - (dash symbol). If you need to continue your list onto a new line simply end the previous line with a $ (dollar symbol), press return, and continue on the next line. Finally the program asks for the number of addresses to be selected from the specified districts. The program ADDGEN generates an output file named by you in a format suitable for input into the survey program. When running SURVEY, you merely type in the name of the file you created using ADDGEN. The program ADDGEN was written by C. Graham McLaren, a graduate student at Simon Fraser University in British Columbia. SAMPLE RUN: The following is a journal of a sample run which was made using the above procedure. The program ADDGEN was used to create a random sample of size 5 from districts 1-49,60,70. The output file "address" from ADDGEN can be fed to SURVEY. ENTER FILENAME FOR ADDRESS SET--8 OR FEWER LETTERS address ENTER RANDOM START--ANY INTEGER BETWEEN 1 AND 1000000 219654 ENTER DISTRICTS FROM WHICH YOU WISH TO SAMPLE 1-49,60,70 51 DISTRICTS WITH 13241 HOUSEHOLDS HAVE BEEN SPECIFIED ENTER NUMBER OF ADDRESSES TO BE GENERATED (MAX 1000) 5 DO YOU WANT TO SPECIFY A NEW DISTRICT SET ANSWER YES OR NO no 5 RANDOM ADDRESSES GENERATED WITH RANDOM START 219654 Below are the contents of the file 'address:' 4 67 8 246 18 94 18 191 24 244 0 0 219654 PROBLEMS 1. Use the program ADDGEN to generate 200 random addresses in Lockhart City and then the program SURVEY to obtain the responses of these houses. Using a sample mean estimate, estimate a. The average price a household is willing to pay for cable TV. b. The average number of TV's in a household in Lockhart City. Be sure to give standard errors for all estimates. (Use the fpc, even though it may not be strictly necessary.) 2. Using the same sample of size 200, repeat problem 5 using a ratio to house assessed value estimate. Given two estimates ybar1 and ybar2 of YBAR, the relative precision of ybar1 to ybar2 is Var(ybar2)/Var(ybar1). Equivalently, if sample sizes n1 and n2 are required to achieve the same standard error using ybar1 and ybar2, respectively, the relative precision is n2/n1. Estimate from your sample the relative precision of of the ratio to the sample mean estimate for both the average number ot TV's in a household in Lockhart City and the average price a household is willing to pay for cable TV. How are your results related to the survey program assumptions. 3. Estimate using your sample of size 200 (with standard errors): a. The proportion of houses willing to pay $10 for cable service. This really means, of course, at least $10. b. The average number of hours spent per week watching TV in households willing to pay $10 for cable service. c. The total number of adults in households willing to pay $10 for cable service. 4. Using your sample of size 200, estimate the average assessed valuation in Lockhart City. Does a 95% confidence interval include the known value of $71117? This approach, of estimating from a sample a known quantity, is often used to check the representativeness of a sample. 5. Draw a histogram or stem-and-leaf diagram of the responses in your sample to question 8 (number of hours watching children's TV). Does the distribution of number of hours spent watching children's TV for households in Lockhart City appear normal? Find an approximate 95% confidence interval for the mean number of hours spent watching children's TV. Based on your histogram, is constructing a confidence interval an appropriate thing to do? Why or why not? (Hint: do you think that the sampling distribution of the mean viewing time for children's TV could be normal?) Assignment 3 STRATIFICATION The objectives of stratification are to control the error in estimation by ensuring that samples are representative of the population; to ease administration of a survey by partitioning the task; and to provide separate and independant estimates in different parts of the population. The theory indicates that we will be successful in the first objective if our strata differ from each other but have units with little variation within each stratum. This observation leads to the idea of using knowledge of our population to group similar units together to for strata. In our case we are rather fortunate to have extensive knowledge of the characteristics of Stephens County in the tables and district map of Assignment 1. PROBLEMS We are interested in estimating the average price a household in Stephens County is willing to pay for cable TV service. Because the cable TV company may not necessarily extend service to all parts of the county, separate and independent estimates are also desired for Lockhart City and Eavesville. 1. Is it reasonable to believe that the information in assignment 1 can be used to stratify Stephens County in order to improve the precision of our estimates? Why? Give any other reasons for stratification. Are these relevant to Stephens County? 2. Use any considerations you like to divide Stephens County into strata. Your stratification should divide Lockhart City into approximately five strata. Shade your strata on a map of the county. Why did you choose them this way? Count the total number of households in each of your strata. (You may use the program ADDGEN to do this.) The remainder of these exercises concern Lockhart City ONLY. 3. Using ADDGEN generate a stratified random sample of size 200 form Lockhart City with your stratification of part 2 above and proportional allocation. Find the responses using the program SURVEY. Estimate the average price a household in Lockhart City is willing to pay for cable service and the average number of TV's per household in Lockhart City. How do these estimates compare with those obtained in assignment 2 with simple random sampling and sample mean and ratio estimates? General theory indicates that to obtain estimates with minimum variance for a fixed cost, or equivalently to minimize cost for a specified variance, the sample allocation should be proportional to v(h) = N(h)S(h)/sqrt[c(h)], where N(h) is the size, S(h) the standard deviation, and c(h) the unit sampling costs in stratum h. 4. Pilot studies are often used to estimate S(h). In this case we are fortunate to have a very large pilot study from assignment 2. Divide your sample from assignment 2 into the strata you chose in part 2 and thus obtain estimates of the variances S(h)^2 in each of the strata for the average price a household is willing to pay for cable TV service. MINITAB can be very helpful here. The sampling costs in Stephens County are given below: SAMPLING COSTS IN STEPHENS COUNTY $60 per rural district visited (1-46) $20 per urban district visited (47-50) $6 per rural household visited (whether home or not) $3 per urban household visited (whether home or not) $10 processing cost per completed interview At the present time, all households will be home to be interviewed. We will study nonresponse at a later time. As an example of the above costs, if the addresses visited and interviewed were 3-47, 3-25, 5-16, 51-25, and 51-36, the sampling cost printed at the end of the output from the program SURVEY would be 2*60 + 1*20 + 3*6 + 2*3 + 5*10 = $214 5. Using your estimates of S(h) optimally allocate a sample of size 200 to estimate the average price a household in Lockhart City is willing to pay for cable TV service. Using that allocation take a stratified random sample of Lockhart City and estimate the average price a household is willing to pay for cable TV service and the average number of TV's per household. 6. Under what conditions can optimal allocation be expected to perform much better than proportional allocation. Do these conditions occur in Lockhart City? Comment on the relative performance that you observed between these two allocations. 7. Using the variances estimated in assignment 2, what size sample would be needed with simple random sampling to achieve the same precision in estimating the average price a househould is willing to pay as a stratified sample of size 200 using the strata you have designed and optimal allocation? proportional allocation? 8. Are there any deficiencies in your design? How would you correct them if you were to do this exercise a second time? Assignment 4 MORE ON STRATIFICATION General theory indicates that stratification is most effective when division into strata divides the population into groups which are internally relatively homogeneous with respect to the characteristic being measured, but which differ as much as possible from each other. Based upon the SURVEY program assumptions discussed in Assignment 1 a stratification of Lockhart City based on district average house valuation should improve our estimates of the average price a house is willing to pay for Cable TV. Consider the following design for a survey of 200 households: DESIGN I: AN EFFICIENT STRATIFICATION OF LOCKHART CITY BASED UPON DISTRICT AVERAGE HOUSE VALUE--PROPORTIONAL ALLOCATION Districts Number of Houses Sample Size Range of Average House Values 51,56,61,66 2304 24 93771-101906 57,67,70-75 6351 65 79833-87383 52,62,65,68,69 4148 42 68406-76832 53,58,63,64 4006 40 53634-62813 54,55,59,60 2855 29 36706-48643 We will also consider two other designs: a stratification of Lockhart City based upon district population and simple random sampling. DESIGN II: AN INEFFICIENT STRATIFICATION OF LOCKHART CITY BASED UPON DISTRICT POPULATION--PROPORTIONAL ALLOCATION Districts Number of Houses Sample Size Range of District Populations 54,55 1138 12 1090-1219 51,53,56,59-61 4044 41 1636-1977 52,64-66 3062 31 2197-2497 58,62,63,69-72,74 7116 72 2560-2671 57,67,68,73,75 4304 44 2691-2870 DESIGN III: SIMPLE RANDOM SAMPLE OF LOCKHART CITY, SIZE 200 There is no reason to believe that district population would have any direct relationship to the price a family is willing to pay for Cable TV service and the design of the SURVEY program is consistent with this reasoning. We expect therefore that DESIGN II should not work as well as DESIGN I. The SURVEY program does assume a relationship between family size and willingness to subscribe to Cable TV and a stratification of Lockhart City based upon district average family size will be somewhere between DESIGN I and DESIGN II in efficiency. General theory indicates that for large populations any stratification with proportional allocation will be more efficient than simple random sampling and so we expect both DESIGN I and DESIGN II to perform better than DESIGN III. 50 samples using each of these three designs were run and the results are summarized below: Design I: Average of the sample means: 9.8859 Average of the estimated variances: .12055 Design II: Average of the sample means: 9.9089 Average of the estimated variances: .19652 Design III: Average of the sample means: 9.8970 Average of the estimated variances: .21702 PROBLEMS 1. Based upon the above results what size sample using Design II is needed to achieve the same precision as a sample of size 200 and Design I? What size sample with simple random sampling achieves the same precision as a sample of size 200 and Design I? 2. Take a simple random sample of size 200 from Lockhart City and post stratify it according to house valuation using the following information obtained from the County Tax Assessor: DISTRIBUTION OF RESIDENTIAL ASSESSED VALUATIONS LOCKHART CITY: House Value Number 0 to 39999 1007 40000-49999 1835 50000-59999 2768 60000-69999 2565 70000-79999 4546 80000-89999 4449 90000-99999 1923 100000 and above 571 TOTAL 19664 Estimate the average price a household is willing to pay for Cable TV and the average number of TV's per household. 3. We know the distribution of the number of hours spent watching TV for the U.S. as a whole, given in the table below. DISTRIBUTION OF TV VIEWING TIME--U.S.: Number of hours spent watching TV Proportion of households Less than 25 23.4% 25 - 50 22.6% 50-100 36.4% More than 100 17.6% Use this information to post-stratify your SRS from problem 2, and compare and critically evaluate your estimates for problems 2 and 3. 4. We have studied 3 ways to use house value to improve the precision of our estimate of the average price a house is willing to pay for Cable TV: a. We can use it as the 'X-variable' in a ratio estimate. b. We can use district averages to stratify Lockhart City. c. We can use it for post stratification. After correcting if necessary your calculations in Assignment 2 compare these three methods with simple random sampling. Comment on their relative performance in improving an estimate of the average price a house is willing to pay for Cable TV. Comment also on how 4a and 4c compare to simple random sampling and the sample mean to estimate the average number of TV's. What would you surmise would happen if DESIGN I were used to estimate the average number of TV's? How are the results of our experiments related to the SURVEY program assumptions? Discuss also the types of situtations in which you expect each of 4a, 4b, and 4c to be preferable to the other two. 5. Stratify the rural areas (districts 1 to 43) to estimate the average price a household is willing to pay for Cable TV. Using proportional allocation, allocate a sample of size 200 among your strata. Take your sample and estimate the average price a rural household is willing to pay for Cable TV. Using the method discussed in Cochran, Section 5A.11, estimate the variance you would have achieved if simple random sampling were used in Districts 1 to 43. Compare the gain in precision from stratification for Lockhart City and for the rural areas. Examine carefully your samples in Lockhart City and the rural areas and see if you can account for the differences, if any. Consider also if the SURVEY program assumptions might explain such differences. Assignment 5 TWO STAGE CLUSTER DESIGNS In stratification, we design the strata to be homogeneous within strata, but heterogeneous between strata. We then sample in each stratum with the aim of improving on the precision of simple random sampling by ensuring a representative sample. In cluster sampling, we try to make the clusters homogeneous between clusters, but with each cluster roughly representative of the population with respect to the variables being measured. When the cost of visiting each cluster is high (such as when travel costs are significant), we try to reduce costs by limiting our sample to a small number of clusters. Although for a fixed sample size, cluster sampling is usually less precise than simple random sampling, we hope that by controlling our sampling costs we can increase the sample size to more than compensate for the lower precision. The sampling costs in Stephens County are as follows: $60 for each rural district (1-46) visited $20 for each urban district (51-75) visited $6 for each rural household visited $3 for each urban household visited $10 for each completed interview We consider in this assignment three possible two stage designs: DESIGN I--Clusters chosen with equal probability, unbiased estimate DESIGN II--Clusters chosen with equal probability, ratio to cluster size estimate DESIGN III--Clusters chosen with probability proportional to size and with replacement, unbiased estimate A design is said to be "self-weighting" if each subunit (in this case household) has an equal chance of being chosen in the sample. This translates into each subunit being equally weighted in the final estimate. The first two designs above are self-weighted if the size of the subsample in each selected cluster is proportional to the size of the cluster. The third design is self-weighted if the size of the subsamples are constant from cluster to cluster. We will use self-weighting forms of the designs above. APPROXIMATING FORMULAE: We use the following notation N=number of districts n=number of districts in the sample Mbar=average number of households per district mbar=average number of sampled households per cluster Y(i)=sum of all values in district i S(2,i)^2=variance of the values in district i M(i)=number of households in district i Y=sum of all values in the population Ybar=Y/N Ydblbar=Y/(NMbar) Consider DESIGN I, II, and III in their self weighting form and assume the second stage fpc's are negligible. Then the formulae for the true MSE (not the sample estimate) of the estimator of the population mean per subunit reduce to: DESIGN I: MSE = ((1-f1)/n) B1 + W/(nmbar) DESIGN II: MSE = ((1-f1)/n) B2 + W/(nmbar) DESIGN III: MSE = B3/n + W/(nmbar) where the between cluster variances are: B1 = SUM (Y(i)-Ybar)^2/((N-1)(Mbar^2)) B2 = SUM (Y(i)-M(i)Ydblbar)^2/((N-1)(Mbar^2)) B3 = SUM M(i)(Ybar(i)-Ydblbar)^2/(NMbar) and the within cluster variance is: W = SUM M(i)S(2,i)^2/(NMbar). Here SUM represents a sum over all the districts in the population. These formulas give the true variance. They should not be used to estimate that variance from the sample. PROBLEMS We are interested in seeing if a cluster design can improve on the precision of a simple random sample of size 100 from the rural areas of Stephens County while keeping the same cost. To do this we need to know the cost of sampling 100 houses randomly in districts 1-43. Fortunately we can generate ten samples from districts 1-43 using ADDGEN and then without actually incurring any sampling costs we can calculate the cost of executing each of those ten samples. Using this procedure, I have been able to show that a simple random sample of size 100 from districts 1-43 would visit an average of 37.5 districts and cost an average of $3850. 1. Design a two stage cluster sampling scheme for the rural areas (districts 1-43) of Stephens County which chooses between 25% and 50% of the districts (clusters) with equal probability, subsamples within each chosen district with sample size proportional to district size (number of houses), and costs about $3850. Execute the sample and estimate the average price a rural household is willing to pay for cable TV using both an unbiased estimate and a ratio to district size estimate. Be sure to give standard errors. 2. Correcting, if necessary, your answer to part 5 of assignment 4 and making any necessary modifications, what would have been the standard error of a simple random sample of size 100 from districts 1-43? The correct answer is about 70 cents. 3. Using the same number of clusters as you did for exercise 1, design a self-weighted sampling scheme along the lines of DESIGN III for districts 1-43 which costs about $3850. The method of choosing districts given by Lahiri (Cochran page 251) would probably be easiest. In your writeup include enough details on how you chose the districts to be sampled so that I can check if you did it cort the precision which would have occured if we had adopted simple random sampling.ith standard error. For future reference in studying a similar population, it is often useful to consider what would have occured if a different sampling technique had been adopted. We used, for example, in assignment 4 a stratified sample of districts 1-43 and estimated from it the precision which would have occured if we had adopted simple random sampling. 4. Estimate from your samples the within cluster variance W and the between cluster variances B1, B2, B3 for the rural areas (1-43). Finally using your estimates of B1, B2, B3, and W, find the approximate MSE of designs with a variety of values for n and m. What is the optimal average second stage sample per cluster? What cluster designs, if any, are an improvement over simple random sampling when a fixed budget of $3850 is adopted? 5. I have been able to determine that a simple random sample of size 200 from districts 1-43 would visit about 42 districts and cost about $5720. Are any two stage designs an improvement over simple random sampling to determine the average price a rural household is willing to pay if a budget of $5720 is adopted for sampling? Use the results of problem 4-- you do not need to take a new sample. What happens to an optimally chosen cluster design as the budget gets larger and larger? 6. Comment on the relative performance of DESIGNs I, II, and III. How does your experience with these estimators compare to the theoretical discussion of their properties given in your textbook and in class? 7. A simple random sample of size 200 from Lockhart City would most likely visit all 25 districts and cost $3100. Using the estimates of B1, B2, B3, and W given in class of Lockhart City design a two stage sampling scheme for Lockhart City that costs $3100. Does it appear that cluster sampling is cost effective in Lockhart City for the purpose of estimating the average price a household is willing to pay for cable service? Assignment 6 VARIANCE ESTIMATION FOR COMPLEX SURVEYS Most surveys taken by the Census Bureau or large survey organizations such as Gallup or Research Triangle Institute are complex, with several stages of clustering as well as stratification. In addition, weights are often used make the demographic information of the sample agree with information for the population from the U.S. Decennial Census, and to adjust for nonresponse. In such a situation, calculating variances for each response would be extremely tedious, requiring a new computer program to calculate the variance for each response and survey. Since many of the quantities of interest are nonlinear functions of population means (e.g. mu_y/mu_x), we need a flexible method for calculating variances. Several general methods for the estimation of variances of complex functions of the population means have been derived, methods that can be applied to almost any theory. These are described in Wolter (1985). One of the simplest is the random groups method, also known as interpenetrating subsampling. In this method, the basic survey design is replicated k times. This may be done by drawing k different samples, or by drawing one sample, and later splitting it into k parts, each part being a miniature version of the basic sampling design. The jackknife is also useful for estimating the variance of nonlinear quantities. If we are interested in the parameter theta, then we can also estimate theta by dividing the sample into k subgroups, as in the random group method, but instead of estimating theta in each subgroup separately, we estimate theta using all of the data except that in the jth subgroup, for j = 1,...,k. The estimate obtained by omitting the jth subgroup is theta-hat(j). Let theta-hat(.) be the average of all of the theta-hat(j). Then theta-hat(.) also estimates theta, and the variance of theta-hat, the estimate of theta using all of the data, can be estimated by (k-1) \sum (theta-hat(j) - theta-hat(.))2/k. PROBLEMS For these exercises, draw a simple random sample of size 200 from Lockhart City. You may use the sample from assignment 2 if you wish. We want to estimate R = mu_y/mu_x, the ratio of the price a household is willing to pay for cable TV (Y) to the assessed value of the house (X). 1. Randomly divide your sample into 10 different subsamples, each of size 20. This can be done by creating a new variable SCRAMBLE which has 200 uniform random numbers between 0 and 1. Sort the data by the variable SCRAMBLE; then assign the first 20 observations to group 1, the second 20 to group 2, etc. The group means can be easily calculated by doing a one-way analysis of variance on the data. Now find the ratio r_i = ybar_i/xbar_i for each group. The r_i are 10 (almost) independent observations; if we use rbar to estimate R, then the estimated variance of rbar is sum (r_i - rbar)^2/90. 2. Calculate 200 different estimates of R, each using all but one of the 200 data points. One way of doing the calculations is to define two new variables, SUMX and SUMY, to be the sum of all 200 observations for X and Y, respectively. Then the variable YJACK = (SUMY - Y)/199 contains the 200 values of ybar(j). Use the variables YJACK and XJACK to calculate the jackknife estimate of the variance of r = ybar/xbar. 3. How do your variance estimates from 1 and 2 compare with the usual Taylor series-based estimate of the variance of this ratio? Alternative assignment 6 COMPLICATED SURVEYS Most large surveys will involve a combination of the ideas we have discussed: e.g. several layers of stratification on top of several layers of clustering with ratio and poststratified estimates sprinkled throughout. The formulas for estimating errors can become horrendous, if they are derivable, especially if there are several layers of clustering. Cochran, sections 11.17 to 11.21, gives some of the techniques used to handle this problem. We discuss here some simple principles which have fairly wide applicability. 1. Principle of stratification: Let Y(h) be the population total in stratum h and let Yhat(h) be an unbiased sample estimate of Yh. Then Yhat = SUM[yhat(h)] unbiasedly estimates Y = SUM[Y(h)] and MSE(Yhat) = SUM[MSE(Yhat(h)]. This principle follows from the independence of the samples in the various strata. 2. Principle of cluster sampling with replacement: Suppose the primary sampling units (first stage clusters) are chosen with probability proportional to z(i) (SUM[z(i)] = 1) and with replacement. Let Yhat(i) be an unbiased estimate of the population total Y(i) in cluster i. Then an unbiased estimate of the population total Y is Yhat = {SUM[yhat(i)/z(i)]}/n where n is the number of PSU's chosen and the sum is a sum over the sampled PSU's. The variance of Yhat can be unbiasedly estimated by Vhat(Yhat) = SUM[(Yhat(i)/z(i)-Ybar)^2]/(n(n-1)). In this principle the design of the subsample in each cluster can depend upon this cluster but if the same cluster is chosen more than once, this design must be executed and a separate value of Yhat(i) found for each time the cluster appears in the first stage sample. 3. Principle of random groups: A generalization of 2. above is the principle of random groups. In this principle, a design is independently executed (replicated) k times, yielding k random groups. Let Xhat(i) be an estimate of a population total X derived from the data in the ith random group. Then Xhat(1),...,Xhat(k) are independently and identically distributed. It follows that Xhat=SUM[Xhat(i)]/k satisfies E(Xhat)=E(Xhat(i)) and V(Xhat) can be unbiasedly estimated by Vhat(Xhat)=SUM[(Xhat(i)-Xhat)^2]/[k(k-1)]. Usually when random groups are applied k is relatively small, whereas the sample size in each random group is sufficiently large to make a normal distribution assumption for each Xhat(i) reasonable. In this case, one approximates the distribution of (Xhat- E(Xhat))/sqrt(Vhat(Xhat)) with a t-distribution with k-1 degrees of freedom. If each random group represents a cluster which is chosen with replacement and with probability proportional to z, letting Xhat(i) = Yhat(i)/z(i) leads to the conclusion that the principle of random groups generalizes the principle of cluster sampling with replacement. 4. Principle of ratio estimation: Suppose we want to estimate the population ratio R = Y/X. Let Yhat and Xhat be unbiased estimates of Y and X respectively. Then for "large" sample sizes Rhat = Yhat/Xhat is an approximately unbiased estimate of R and the variance of Rhat can be approximately unbiasedly estimated by taking the formula for the estimated variance of Yhat and replacing each sample value y in that formula by (y - Rx) and dividing the resulting number by X^2 (or Xhat^2 if X is unknown). Examples of this procedure are the combined ratio estimate for stratified designs and the formula for estimating the variance of a ratio in 2-stage cluster designs. The main assumptions are that Xhat and Yhat are linear estimators (namely a linear combination of the sample values), that they come from the same sample, and that they are the same type (that is the formula for Xhat is obtained from the formula for Yhat by replacing each y with an x). Thus, for example, using a stratified mean for Yhat and a simple random sample mean for Xhat is excluded. Examples of qualifing estimators are the sample mean in a simple random sample, the post- stratified mean in a simple random sample, the stratified mean, and the ppz estimator in (2) above. In the case of the post-stratified mean, the resulting formula for the variance is correct in the 1/n term, but not in the 1/n^2 term. Although these four principles are only a small proportion of the formulas we discussed in this course, everything else we discussed sheds light on how to design a survey for greatest effectiveness. For example, the equations given above for multistage cluster sampling with replacement give no information about the role of between and within cluster variance in the total variance of the estimator. This information is provided by the formula for the true variance (as opposed to its estimate) of the ppz estimator. PROBLEMS We are interested in estimating the average price a household is willing to pay for cable TV service in Stephens County and separately in the following four strata: STRATUM I: Rural areas, districts 1-43 STRATUM II: Three villages, districts 44-46 STRATUM III: Eavesville, districts 47-50 STRATUM IV: Lockhart City, districts 51-75 You have been commissioned to execute this survey for the Stephens County Cablevision Company. Your contracted price is $8000 of which approximately 80% is for the sampling costs including analysis. The remaining 20% is the markup for your expert services. Design a survey for the cable company and execute it. Use your knowledge of Stephens County to make an efficient design. Write a short report to the company outlining your conclusions. Although I do not expect you to spend an excessive amount of your time analyzing your sample, I think the Company is entitled to more than just a single estimate of average price. Try to include a reasonable selection of other items it might be interested in. For example, the company is probably more interested in profiles of households willing to pay certain preset sums (e.g. $15) as opposed to averages for all households since it is less important to design a package of attractive optional services for households which have no intention of subscribing. For this purpose, it is useful to note that MINITAB allows you to execute the commands in a file you have previously written into. This allows you to efficiently execute the same collection of commands repeatedly. Second alternative assignment 6 COMPLICATED SURVEYS Most large surveys will involve a combination of the ideas we have discussed: e.g. several layers of stratification on top of several layers of clustering with ratio and poststratified estimates sprinkled throughout. The formulas for estimating errors can become horrendous, if they are derivable, especially if there are several layers of clustering. Cochran, sections 11.17 to 11.21, gives some of the techniques used to handle this problem. We discuss here some simple principles which have fairly wide applicability. 1. Principle of stratification: Let Y(h) be the population total in stratum h and let Yhat(h) be an unbiased sample estimate of Yh. Then Yhat = SUM[yhat(h)] unbiasedly estimates Y = SUM[Y(h)] and MSE(Yhat) = SUM[MSE(Yhat(h)]. This principle follows from the independence of the samples in the various strata. 2. Principle of cluster sampling with replacement: Suppose the primary sampling units (first stage clusters) are chosen with probability proportional to z(i) (SUM[z(i)] = 1) and with replacement. Let Yhat(i) be an unbiased estimate of the population total Y(i) in cluster i. Then an unbiased estimate of the population total Y is Yhat = {SUM[yhat(i)/z(i)]}/n where n is the number of PSU's chosen and the sum is a sum over the sampled PSU's. The variance of Yhat can be unbiasedly estimated by Vhat(Yhat) = SUM[(Yhat(i)/z(i)-Ybar)^2]/(n(n-1)). In this principle the design of the subsample in each cluster can depend upon this cluster but if the same cluster is chosen more than once, this design must be executed and a separate value of Yhat(i) found for each time the cluster appears in the first stage sample. 3. Principle of random groups: A generalization of 2. above is the principle of random groups. In this principle, a design is independently executed (replicated) k times, yielding k random groups. Let Xhat(i) be an estimate of a population total X derived from the data in the ith random group. Then Xhat(1),...,Xhat(k) are independently and identically distributed. It follows that Xhat=SUM[Xhat(i)]/k satisfies E(Xhat)=E(Xhat(i)) and V(Xhat) can be unbiasedly estimated by Vhat(Xhat)=SUM[(Xhat(i)-Xhat)^2]/[k(k-1)]. Usually when random groups are applied k is relatively small, whereas the sample size in each random group is sufficiently large to make a normal distribution assumption for each Xhat(i) reasonable. In this case, one approximates the distribution of (Xhat- E(Xhat))/sqrt(Vhat(Xhat)) with a t-distribution with k-1 degrees of freedom. If each random group represents a cluster which is chosen with replacement and with probability proportional to z, letting Xhat(i) = Yhat(i)/z(i) leads to the conclusion that the principle of random groups generalizes the principle of cluster sampling with replacement. 4. Principle of ratio estimation: Suppose we want to estimate the population ratio R = Y/X. Let Yhat and Xhat be unbiased estimates of Y and X respectively. Then for "large" sample sizes Rhat = Yhat/Xhat is an approximately unbiased estimate of R and the variance of Rhat can be approximately unbiasedly estimated by taking the formula for the estimated variance of Yhat and replacing each sample value y in that formula by (y - Rx) and dividing the resulting number by X^2 (or Xhat^2 if X is unknown). Examples of this procedure are the combined ratio estimate for stratified designs and the formula for estimating the variance of a ratio in 2-stage cluster designs. The main assumptions are that Xhat and Yhat are linear estimators (namely a linear combination of the sample values), that they come from the same sample, and that they are the same type (that is the formula for Xhat is obtained from the formula for Yhat by replacing each y with an x). Thus, for example, using a stratified mean for Yhat and a simple random sample mean for Xhat is excluded. Examples of qualifing estimators are the sample mean in a simple random sample, the post- stratified mean in a simple random sample, the stratified mean, and the ppz estimator in (2) above. In the case of the post-stratified mean, the resulting formula for the variance is correct in the 1/n term, but not in the 1/n^2 term. Although these four principles are only a small proportion of the formulas we discussed in this course, everything else we discussed sheds light on how to design a survey for greatest effectiveness. For example, the equations given above for multistage cluster sampling with replacement give no information about the role of between and within cluster variance in the total variance of the estimator. This information is provided by the formula for the true variance (as opposed to its estimate) of the ppz estimator. PROBLEMS We are interested in estimating the average price a household is willing to pay for cable TV service in Stephens County and separately in the following four strata: STRATUM I: Rural areas, districts 1-43 STRATUM II: Three villages, districts 44-46 STRATUM III: Eavesville, districts 47-50 STRATUM IV: Lockhart City, districts 51-75 It is proposed to make the probability that any given house appears in the sample be approximately 1%. More specifically the design in each stratum will be STRATUM I: 16 districts chosen with probability proportional to size (with replacement), 5 houses sampled in each chosen district. STRATUM II: A stratified random sample with 3, 6 and 3 houses chosen in districts 44, 45, and 46 respectively; stratified mean estimates. STRATUM III: Simple random sample size 32 with a ratio to house assessed value estimate. STRATUM IV: Simple random sample size 197 with a poststratified (on house value) estimate. This design was chosen partially for educational reasons and partially because it is likely to be reasonably efficient. The choice of 5 houses in each subsample in STRATUM I was chosen because it was felt that this was the maximum one could reasonably expect an interviewer to accomplish in a half day. This preference for 5 houses over any other size subsample cannot be adequately modeled using the cost function we have hypothesized for Stephens County. 1. Execute a sample according to the above design and estimate with standard error the average price a household in Stephens County is willing to pay for cable TV service. Also estimate with standard error the average price a household in each of the four strata is willing to pay for service. 2. The Stephens County Cable TV Company has just spent $8000 on this study ($6169 in sampling costs plus a reasonable markup for your expert services). Write a short report to the company outlining your conclusions. Although I do not expect you to spend an excessive amount of your time analyzing your sample, I think the Company is entitled to more than just a single estimate of average price. Try to include a reasonable selection of other items it might be interested in. For this purpose, it is useful to note that MINITAB allows you to execute the commands in a file you have previously written into. This allows you to efficiently execute the same collection of commands repeatedly. 3. Do you have any other ideas for designs that would cost about the same amount ($6200)? Assignment 7 ADJUSTING FOR NONRESPONSE AFTER DATA COLLECTION So far, in this class, we have been concentrating on designing surveys so as to have the sampling error (i.e., the standard error of our estimates) as small as possible for the amount of money spent. A small sampling error, however, is meaningless if the data are of poor quality, either because of a high nonresponse rate or because responses were not accurate. To date, the residents of Stephens County have been an ideal population--they have always been home, have agreed to participate in the survey, and have always answered truthfully. Unfortunately, most counties in the U.S. do not have such cooperative residents. In government surveys carried out by the Census Bureau, typically about 5% of the households contacted are either unreachable or the participants refuse to participate in the survey. For surveys carried out by private organizations, the nonresponse rate tends to be closer to 30%, with higher nonresponse in amateurish efforts. Many surveys carried out by graduate studentsin their research projects have nonresponse rates of 80 or 90 %; these surveys are worthless for describing characteristics of the population. The nonresponse has a critical effect on estimates from the data: often, nonrespondents differ from the respondents. For example, in the National Crime Survey, nonrespondents are more likely to be members of the demographic groups with higher victimization rates. Thus, an estimate of the overall victimization rate that is based only on the respondents is likely to be too small, and, since we don't have the missing data, we do not know what the bias is! There are various ways of adjusting for nonresponse after the fact, but THEY DO NOT COMPENSATE FOR NOT HAVING THE DATA. That said, I still think it is better to adjust the estimates than not to adjust the estimates--if you just report the results from the respondents as valid for the whole population, you are assuming that the nonrespondents are just like the respondents, i.e., that the respondents comprise a random sample of the population. This is usually not true. Even though the methods for adjustment are inadequate, when used appropriately they are still better than doing nothing at all about the nonresponse. When running SURVEY, you may have noticed the prompt ENTER DESIRED THREE NONRESPONSE RATES: NOT-AT-HOMES, REFUSALS, RANDOM ANSWERS If you enter .3 0 0 in response, about 30% of the households in Stephens County will "not be home." If you enter 0 .3 0 about 30% of the households in Stephens County will refuse to say how much they would be willing to pay to subscribe to cable TV. If you enter 0 0 .3 about 30% of the households in Stephens County will give random answers to certain questions. PROBLEMS 1. Generate 200 random addresses for a simple random sample of the households in Stephens County. You will use this same list of addresses for all of the problems in this assignment. Draw a simple random sample of size 200 with no nonresponse to give the base values you would have if all households responded. Estimate the means for the assessed value of the house, and for each of questions 1 through 9. 2. Using the list of addresses from question 1, draw a simple random sample of size 200 with 30% unit nonresponse rate. You will find that about 30% of the households have the information on district, household number, and assessed value, but the words "NOT AT HOME" instead of answers to questions 1 through 9. Find the means for the assessed value of the house, and for questions 1 through 9 for just the responding households. How do these compare with the results from the full simple random sample? Is there evidence of nonresponse bias? 3. Poststratify your sample from question 2 using the strata you constructed in assignment 3. For the responding households in each stratum, assign a weight of number of households from sample in the stratum WEIGHT = ______________________________________ number of responding households in the stratum Now calculate the weighted estimates for the price a household is willing to pay for cable TV and the number of TV's. Are these closer to the values from question 1? What are you assuming about the nature of the nonresponse when you use this weighting scheme? Do you think these assumptions are justified? 4. For the respondents, fit the linear regression model y = a + b x, where y = price household is willing to pay for cable, and x = assessed value of the house. Now, for the nonrespondents, impute the predicted value from this regression model for the missing y values, and use the "completed" data set to estimate the average price a household is willing to pay for cable. Compare this estimate to the previous one, and to the estimate from the full data set. Is the standard error given by your statistical package correct here? Why or why not? 5. Generate another set of data from the same address list, this time with a 30% item nonresponse rate. (The nonresponse parameters are 0, .3, 0.) What is the average price the respondents are willing to pay for cable? Using the respondents, develop a regression model for cable price based on the other variables. Impute the predicted values from this model into your missing observations and recalculate your estimate. 6. Perform another imputation on the data, this time using a hot-deck procedure. Impute the value of the household immediately preceding the one with the missing item (if that one also has missing data, move up through the previous households until you find one that has the data and then impute that value). How does the value using this imputation scheme differ from the estimate in question 5? One way to do the imputation in SAS would be to impute the value for the first household, if missing, by hand. Then use the LAG function inside a DO loop: IF PRICE = . THEN PRICE = LAG(PRICE); Alternative assignment 7 NONRESPONSE AND DOUBLE SAMPLING DESIGNS Quite often in surveys some of the sampled units will have missing data. This is the problem of "nonresponse". Experience has shown, especially in surveys of people, that nonresponders differ in critical ways from responders, and if the nonresponse rate is significant, inference based only upon the responders will be substantially flawed. A well designed survey will make a valiant attempt to control its nonresponse. For example, a mail survey, for which response rates under 50% are not uncommon, might attempt to telephone the nonresponders. Unfortunately eliciting a response from a nonresponder is usually quite expensive. For example the personnel costs in eliciting a response by telephone are much higher than obtaining a response by mail. Thus if nonresponders form a significant proportion of the population and the budget is limited, complete elimination of nonresponse is not feasible. A solution to this dilemma is provided by a double sampling scheme. In this design a preset proportion of the nonresponders are vigorously resampled to get a response and their answers are then used to estimate the results for the population of nonresponders. Unfortunately, despite all efforts it is usually impossible to get rid completely of nonresponse and often the best one can do is to use double sampling or other means to get the nonresponse rate down to an acceptably low level and then to assume the remaining nonresponders are identical to the responding population. This sometimes called "imputing" values to the nonresponders. Nonresponse is an example of a nonsampling error. The errors considered until now, that is those due to the random selection of the sample, are called sampling errors. Other nonsampling errors include inaccurate answers and biases and correlations introduced by interviewers. In poorly executed surveys the nonsampling errors can exceed the sampling errors. NONRESPONSE IN THE SURVEY PROGRAM: The SURVEY program allows one to simulate the problem of nonresponse in the form of households which are "not at home." Other forms of nonresponse such as uncooperativeness or orneriness do not exist in Stephens County. To simulate "not at home" in the SURVEY simulation enter .3 for the first nonresponse rate and 0 for the other two. SURVEY will make approximately 30% of households in Stephens County unavailable for sampling. SURVEY assumes that a smaller household is more likely to fail to be at home than a larger household, and the nonresponders will be biased towards the smaller households. SURVEY will also ask for a nonresponse seed. The nonresponding houses are determined from the nonresponse seed. If you try to elicit a response from a nonresponding house during the same run of SURVEY, that house will still fail to respond. However, if SURVEY is run a second time and the nonresponse seed changed, the collection of nonresponding houses will also change. OTHER USES OF DOUBLE SAMPLING IN THE SURVEY SIMULATIONS: Double sampling is also used when information about a variable x that be used to improve the design of a survey is inexpensive to obtain. In Stephens County, the tax assessor will supply the assessment of a house for $1 per house (no "district visiting" fees apply). This assessment can be obtained by making the sign of the district number in the household address negative. Thus, for example, if the SURVEY program is fed the address -32, 65, the output file will include the value of the house whose address is 32, 65. PROBLEMS 1. Take a simple random sample of size 200 from Lockhart City using SURVEY. Use nonresponse rates of .3, 0, 0. Estimate, using the responders only, the average number of persons per household in Lockhart City. It is known that Lockhart City has 19664 houses and a population 57505. Is there any evidence that the responding households are not representative of Lockhart City? 2. A simple random sample of size 200 from Lockhart City will cost $3100 to execute if everybody responds. Suppose now we have a budget of $4000 to spend on a survey of Lockhart City. The sampling costs in Lockhart City are as follows: $20 for each district visited $3 for each household visited, whether home or not $10 interview and processing costs for each completed interview Note that if we revisit a district to sample a household that failed to respond to a previous run of SURVEY, we will incur again the district visiting charge. Decide upon a policy of how you many times you will visit a household before giving up on it. Then try to design a double sampling scheme for Lockhart City which will cost approximately $4000. This means choosing a sample size for the initial sample and a fraction of the nonresponding households that you will visit repeatedly until you either get a response or the maximum number of "call backs" that you have set is reached. Try to use the optimum allocation formula to get some insight on how these design parameters are set. The cost structure for Lockhart City does not fit precisely the cost structure hypothesized in the optimum allocation scheme and so may find it quite difficult to use that formula with any great confidence. Similarly you should not be distressed if after executing your scheme you find that you have not come very close to the target budget of $4000. 3. Execute the design you chose for Lockhart City. You may find that despite your Herculean efforts some of the nonresponding houses that you choose to repeatedly revisit are never home. In that case, you must impute to those houses the average of the responses from the initally nonresponding houses that you did succeed in resampling. After doing so, estimate using the formulas for double sampling the average price a household is willing to pay for cable service in Lockhart City together with its standard error. Note also the total cost of your survey. 4. Suppose we decide to sample the rural areas (districts 1-43) of Stephens county using a two stage cluster design with each district representing a cluster and an overall sampling fraction of 2%. 8 district are to be chosen with probability proportional to size and with replacement; 20 households to be sampled in each chosen district. 50% of the initially nonresponding households are to be revisited a maximum of 3 times to elicit a response. Denote by M(i) the number of households in district i, x(i,j) the responses of the sampled houses in district i that responded in the initial survey and y(i,j) the elicited responses of the sampled houses in district i that failed to respond to the initial survey. The double sampling estimate of the total amount that the households in district i are willing to pay for cable service is Yhat(i)=(M(i)/20)[SUM(x(i,j))+2SUM(y(i,j))]. Since districts 1-43 have a total of 7932 houses, it follows that Yhat = SUM[7932Yhat(i)]/(8M(i)) estimates the total amount that households in districts 1-43 are willing to pay for cable service and the principle of cluster sampling with replacement can be used to estimate V(Yhat). Using this design and a "not at home" rate of 30%, estimate the average amount that a household in districts 1-43 is willing to pay for cable TV service. The factor 7932M(i)/(8M(i)20) or approximately 1/.02 that each x(i,j) is multiplied by in Yhat is called its weight. The term weighting the data is also used. Similarly each y(i,j) has a weight of about 1/.01. ************************************************************************ * end of transmission * ************************************************************************