Tuesday, October 26, 2010

Scrap the National Household Survey?

I left off on my last post suggesting that if the voluntary National Household Survey is to go forward, it should be completely decoupled from the Census. (The recent court challenge brought by the Canadian Council for Social Development provides faint hope for another possible outcome, and I will comment on this in a future post).

But, assuming that the content previously collected in the census long form is now to be collected on a voluntary basis, what would be the best way to proceed if this proposition were to be considered coolly and rationally, instead of in the rushed, under the gun, last minute manner that was played out after Statistics Canada was punk'd by the government last spring? The NHS as currently planned has problems numerous and deep, to the point of being fatally flawed. Running it as part of the census operation, in the weeks immediately following the census, puts the census itself at risk. I am convinced that it will be much more difficult this time to get the cooperation of Canadians even for the mandatory census. I think we can safely predict that thousands will refuse to fill out the Census, even if it is mandatory, and openly challenge the government to coerce them to do so, under the threat of fines or jail, knowing full well that the Prime Minister and his cabinet support their resistance. It will take all of Statistics Canada's skill and energy, over a much longer period than usual, to get a good outcome just for the basic Census. Throwing the voluntary NHS into the mix as part of the Census operation will only compound the problem and risk a meltdown of the whole system. (As an aside, when Statistics Canada was ordered to bring forth a voluntary option for the long form content, I imagine they thought that this one, the NHS, was too outlandish ever to be selected by Cabinet: more response burden, less reliable data, more expensive - a loser all down the line.)

So, move the NHS as far away from the Census as possible for the sake of preserving the integrity of the census itself. There is no deadline for the NHS, and, unlike the census, no requirement that it be held at a given time. Since 1986, post-censal surveys, linked to the Census, but operationally distinct from it, have been conducted many months after the census. For example, in the case of the 1991 Health and Activity Limitation Survey, data were collected from August to October 1991, several months after the 2001 Census Day, which was June 4, 1991. More recently, data collection for the 2006 Survey on the Vitality of Official-Language Minorities took place from October 2006 to January 2007, nearly six months after the 2006 Census. So, there should be no problem shifting the NHS back in time so that it is well clear from Census data collection. Even though there may be a lag in reference dates between the Census and the follow up survey, this has not caused any quality issues for post-censal surveys in the past nor should it pose any problem for the NHS now.

The second problem will be getting any cooperation at all from Canadians to fill out this voluntary survey. Moving it several months after the census will help, but all those who resisted filling out the census proper will almost surely refuse to complete the NHS. Many of those who agree with the government's point of view that these questions are intrusive or outright silly will gladly take advantage of its voluntary nature and refuse to respond. Among those who disagree with the government's decision to discontinue the mandatory long form, a not uncommon reaction will be to show their displeasure by boycotting the NHS. And those who simply find it burdensome, who are too busy, or who feel that they have already done their civic duty by completing the mandatory census, will just let it slide. So, I believe the controversy surrounding this decision, and the confused and mixed messages coming from the government have poisoned the well and have made it nearly impossible to achieve even the modest 50% response rate that Statistics Canada now expects.

The third problem is the sheer size of the thing. One third of households: over 4 million questionnaires! That's crazy. You can get reliable estimates of social characteristics of the population for all Census Metropolitan Areas, representing over 85% of the Canadian population, with a sample of 25,000 (see the General Social Survey). The Labour Force Survey, which provides estimates of employment for all CMAs, economic regions and EI regions, uses a sample of around 54,000 households. The Canadian Community Health Survey uses a sample of 65,000 respondents annually to produce detailed health variables for 121 subprovincial health regions. How can they do it with such small numbers, orders of magnitude smaller than the planned sample size for the NHS?

Three reasons. First, the sample sizes for these surveys do not support estimates for small areas, such as city blocks, census tracts (which are like neighbourhoods) and small rural communities. The expectation, or rather the hope, is that the very large sample size of the NHS will support the production of this type of small area data (which is the main strength of the mandatory census). Second, these surveys acheive higher response rates than what is anticipated for the NHS. The sample for the voluntary NHS consists of one in three households because a response rate of no better than 50% is expected, yielding about the same number of usable responses as the 1 in 5 sample did for the mandatory census. Third, these surveys do not fear non-response bias (i.e. where the characteristics of non-respondents are systematically different from those of respondents, thus skewing estimates to represent only respondents and not the whole population). This is because, up until now, they have been able to compare their estimates of these characteristics to a benchmark, the mandatory census, and to correct for any biases they find. This is what the NHS will not be able to do, regardless of its sample size. So in summary, the very large sample size was chosen so as to support the production of small area data, it was bumped up by 50% to account for high expected non-response but in the end the resulting data will nonetheless contain unmeasurable biases that will make it suspect. Why try to produce small area data if you can't stand behind the results? Actually, why try to produce estimates for any area if you can't stand behind the results? It makes no sense.

And that takes us to the last, and most serious problem with the NHS as currently planned, non-response bias. All surveys are subject to non-response bias. In the case of sample surveys like those mentioned above (GSS, LFS, CCHS), the presence of bias can be detected by comparing their estimates to the census estimates, which is taken as an accurate benchmark, and correcting for any bias detected. How do we know that the Census itself does not contain non-response bias. Actually we don't, but because it is mandatory and response rates of 97% or 98% are achieved, any bias is so small as to have no practical effect on the estimates. So the census can confidently be used as a benchmark to detect and correct for biases in other sample surveys. As previously mentioned, this cannot be done for the NHS. With response rates of 50% or less, the potential for non-response bias is huge but without a benchmark against which to compare, such as the mandatory census, the biases cannot measured or corrected. The only remedy for this would be to have a data source with a 98% response rate, for the same population for the same reference period, a practical impossibility. (In addition, the NHS estimates containing unknown and unmeasurable bias cannot serve as a benchmark for other sample surveys, as the census did, leaving surveys such as the GSS, LFS and CCHS, and all other household sample surveys, high and dry). So this leads to a major impasse. Why conduct a massive survey, burdening one third of Canadian households, and costing $110 million, to produce data no one can trust?

So what's to be done? The long form variables are very valuable to data users, as demonstrated by the near unanimous outcry against the government's decision. So it's not a question of just scrapping the NHS and leaving it at that. If I was blue skying about what to do for a replacement, I think I would give up on the small area data. That is specifically the price of abandoning the mandatory nature of the census. It is what the mandatory census can give you that no other vehicle can. If this was not made clear when the decision was taken, it should have been. But what is done is done and there is no way to make it better. So aim for a sample design to support quality data at a higher level of geography. Maybe all CMAs and CAs and some broad rural areas per province. Then, I would try to reduce expected non-response. Divorce it from the census and the name "National Household Survey", which is now like response kryptonite. Break it down into manageable, less burdensome, content modules and spread it out over time, thus exercising the longer term strategy of using the census infrastructure for sample surveys. Then, I would try to deal with response bias by running a split panel for each content module, with a mandatory and a voluntary component, allowing the survey to essentially benchmark itself.

In the end, while still not as useful was the mandatory census, such an approach would provide more useful, usable data than the NHS as currently planned, for less money, less response burden and less damage to the national statistical system.

Monday, October 18, 2010

End of the line for the 2011 Census long form

With the announcement that the Fédération des communautés francophones et acadiennes will not appeal the Federal Court's ruling concerning the 2011 Census long form, it's all over now for the 2011 Census.

For the first time since Confederation, that's 143 years ago, the decennial census will not include questions on housing, religion, education, race and occupation of each person (1871 Census), birthplace, citizenship and period of immigration (1901 Census) and all the other population characteristics that make the census such a powerful source of information, beyond basic demographics. Instead, the 2011 Census will be a very basic affair, just 8 questions. For this, Canadian taxpayers will pay $550 million dollars over the full eight year cycle it takes to plan, develop, run and publish the results of a census of population. The full census would have cost $80 million more, for a total of $630 million. That's right, over 85% of the cost of taking the census is taken up just finding and counting the population. It's very cost effective to ask the extra, mandatory, questions to 1 in 5 households as part of the census.

Now instead, that $80 million, plus a promised additional $30 million, will be used to conduct the untested, voluntary National Household Survey, which will be sent to one in 3 households. Census systems, which were completely re-engineered for the 2006 Census to enable a mail-out/mail back method of collection, with an Internet response option, will need to be re-jigged to trigger the mailout of the NHS questionnaire upon receipt of a completed Census questionnaire. This wasn't the methodology that was used in the National Census Test in 2009, which was supposed to be a dress rehearsal for the real thing.

So this is definitely a high risk gambit, which may not only produce poor results for the $110 million spent on the NHS, but which could jeopardize the $550 million being spent on the Census itself. The previous, previous Chief Statistician, Ivan Fellegi, used to say that there is only one way of running a census: running scared. It's a huge beast, which can easily spin out of control, with a high potential for huge cost overruns and unacceptable undercounts. That's why everything is tested and tried in advance, extensive consultation is conducted and high profile endorsements are sought.

I don't know about you, but after filling out and returning my mandatory Census questionnaire, I am not sure I would be that enthusiastic about receiving another, longer questionnaire. Having performed my civic duty, I might be tempted to give it a pass, especially as it is voluntary. And if I were of the political persuasion that shares the Prime Minister's, and Tony Clement's, and Maxime Bernier's, low opinion of government data collection, expressed clearly and frequently on the national airwaves - it's almost like PSA's for not completing the census - I would very certainly not fill it out. So I think we can predict very low response rates for the NHS and probably more difficulty getting cooperation for the mandatory census itself.

And why are we doing this again? Oh yes,

"We recognize that some people are a bit hesitant regarding their private life," Harper said in the House of Commons on Tuesday in response to a question from Bloc Quebecois Leader Gilles Duceppe. "We intend to work co-operatively with the population. We don't threaten to prosecute them for being hesitant. We work with them as adults."

Except when it comes to Swiss bank accounts, of course.

Prime Minister Stephen Harper vowed Thursday that the government would pursue with "the full extent of the law" Canadians who are using secret Swiss bank accounts to avoid paying taxes.

Anyway, at this time, the fiscally prudent thing to do would be to decouple the NHS from the Census altogether in a bid to reduce the risks for the Census proper. If the content that was previously collected in the census long form is to be collected henceforth on a voluntary basis, do the research and testing that will provide for the best, most cost-effective results, including, in the absence of a mandatory census as a benchmark, how to quantify and correct for the biases that will inevitably be found in voluntary surveys (for example, it's pretty safe to predict that Conservative supporters will be underrepresented in the NHS as it now stands).

Maybe it's too late. The train has left the station and it's barreling down the track. Let's hope against hope that we are not headed for a train wreck.