Assignment 6 NONRESPONSE AND DOUBLE SAMPLING DESIGNS due: April 23, 1996 Quite often in surveys some of the sampled units will have missing data. This is the problem of "nonresponse". Experience has shown, especially in surveys of people, that nonresponders differ in critical ways from responders, and if the nonresponse rate is significant, inference based only upon the responders will be substantially flawed. A well designed survey will make a valiant attempt to control its nonresponse. For example, a mail survey, for which response rates under 50% are not uncommon, might attempt to telephone the nonresponders. Unfortunately eliciting a response from a nonresponder is usually quite expensive. For example the personnel costs in eliciting a response by telephone are much higher than obtaining a response by mail. Thus if nonresponders form a significant proportion of the population and the budget is limited, complete elimination of nonresponse is not feasible. A solution to this dilemma is provided by a double sampling scheme. In this design a preset proportion of the nonresponders are vigorously resampled to get a response and their answers are then used to estimate the results for the population of nonresponders. Unfortunately, despite all efforts it is usually impossible to get rid completely of nonresponse and often the best one can do is to use double sampling or other means to get the nonresponse rate down to an acceptably low level and then to assume the remaining nonresponders are identical to the responding population. This sometimes called "imputing" values to the nonresponders. Nonresponse is an example of a nonsampling error. The errors considered until now, that is those due to the random selection of the sample, are called sampling errors. Other nonsampling errors include inaccurate answers and biases and correlations introduced by interviewers. In poorly executed surveys the nonsampling errors can exceed the sampling errors. NONRESPONSE IN THE SURVEY PROGRAM: The SURVEY program allows one to simulate the problem of nonresponse in the form of households which are "not at home." Other forms of nonresponse such as uncooperativeness or orneriness do not exist in Stephens County. To simulate "not at home" in the SURVEY simulation enter .3 for the first nonresponse rate and 0 for the other two. SURVEY will make approximately 30% of households in Stephens County unavailable for sampling. SURVEY assumes that a smaller household is more likely to fail to be at home than a larger household, and the nonresponders will be biased towards the smaller households. SURVEY will also ask for a nonresponse seed. The nonresponding houses are determined from the nonresponse seed. If you try to elicit a response from a nonresponding house during the same run of SURVEY, that house will still fail to respond. However, if SURVEY is run a second time and the nonresponse seed changed, the collection of nonresponding houses will also change. OTHER USES OF DOUBLE SAMPLING IN THE SURVEY SIMULATIONS: Double sampling is also used when information about a variable x that be used to improve the design of a survey is inexpensive to obtain. In Stephens County, the tax assessor will supply the assessment of a house for $1 per house (no "district visiting" fees apply). This assessment can be obtained by making the sign of the district number in the household address negative. Thus, for example, if the SURVEY program is fed the address -32, 65, the output file will include the value of the house whose address is 32, 65. PROBLEMS 1. Take a simple random sample of size 200 from Lockhart City using SURVEY. Use nonresponse rates of .3, 0, 0. Estimate, using the responders only, the average number of persons per household in Lockhart City. It is known that Lockhart City has 19664 houses and a population 57505. Is there any evidence that the responding households are not representative of Lockhart City? 2. A simple random sample of size 200 from Lockhart City will cost $3100 to execute if everybody responds. Suppose now we have a budget of $4000 to spend on a survey of Lockhart City. The sampling costs in Lockhart City are as follows: $20 for each district visited $3 for each household visited, whether home or not $10 interview and processing costs for each completed interview Note that if we revisit a district to sample a household that failed to respond to a previous run of SURVEY, we will incur again the district visiting charge. Decide upon a policy of how you many times you will visit a household before giving up on it. Then try to design a double sampling scheme for Lockhart City which will cost approximately $4000. This means choosing a sample size for the initial sample and a fraction of the nonresponding households that you will visit repeatedly until you either get a response or the maximum number of "call backs" that you have set is reached. Try to use the optimum allocation formula to get some insight on how these design parameters are set. The cost structure for Lockhart City does not fit precisely the cost structure hypothesized in the optimum allocation scheme and so may find it quite difficult to use that formula with any great confidence. Similarly you should not be distressed if after executing your scheme you find that you have not come very close to the target budget of $4000. 3. Execute the design you chose for Lockhart City. You may find that despite your Herculean efforts some of the nonresponding houses that you choose to repeatedly revisit are never home. In that case, you must impute to those houses the average of the responses from the initally nonresponding houses that you did succeed in resampling. After doing so, estimate using the formulas for double sampling the average price a household is willing to pay for cable service in Lockhart City together with its standard error. Note also the total cost of your survey. 4. Suppose we decide to sample the rural areas (districts 1-43) of Stephens county using a two stage cluster design with each district representing a cluster and an overall sampling fraction of 2%. 8 district are to be chosen with probability proportional to size and with replacement; 20 households to be sampled in each chosen district. 50% of the initially nonresponding households are to be revisited a maximum of 3 times to elicit a response. Denote by M(i) the number of households in district i, x(i,j) the responses of the sampled houses in district i that responded in the initial survey and y(i,j) the elicited responses of the sampled houses in district i that failed to respond to the initial survey. The double sampling estimate of the total amount that the households in district i are willing to pay for cable service is Yhat(i)=(M(i)/20)[SUM(x(i,j))+2SUM(y(i,j))]. Since districts 1-43 have a total of 7932 houses, it follows that Yhat = SUM[7932Yhat(i)]/(8M(i)) estimates the total amount that households in districts 1-43 are willing to pay for cable service and the principle of cluster sampling with replacement can be used to estimate V(Yhat). Using this design and a "not at home" rate of 30%, estimate the average amount that a household in districts 1-43 is willing to pay for cable TV service. The factor 7932M(i)/(8M(i)20) or approximately 1/.02 that each x(i,j) is multiplied by in Yhat is called its weight. The term weighting the data is also used. Similarly each y(i,j) has a weight of about 1/.01.