The Indexer uses a wide range of data sources, most from public agencies. To make the best use of the data in the Indexer, it is critical to understand the origins, structure and limitations of these data sources. No data is perfect, and it is incumbent on users of data to understand the imperfections and to make accommodations for them in interpretation and analysis.
Nearly all data originally comes from one of two sources:
Survey data. As the name suggests, survey data is gathered intentionally through surveys that ask consistent questions to a defined group of respondents. Most surveys—the decennial census being a notable exception—use sampling techniques to survey a subset of the total universe. This results in sampling errors that must be taken into consideration.
Administrative Data. This is data collected as a by-product of another function, such as tax collections, recording of births and deaths or school enrollments. Administrative data generally covers larger groups than surveys, but since data gathering is not the primary function of the activity, any group not included in the administrative process will not be included in the data. Errors and omissions can go unnoticed or unheeded. Whereas errors in survey data are estimated as part of the survey process, errors and omissions in administrative data are not generally estimated.
Agencies may combine survey and administrative data when making estimates in a climate of uncertainty. For example, to estimate local population change, the Census Bureau uses survey data from the American Community Survey as well as administrative data from the Internal Revenue Service. Similarly, the Bureau of Labor statistics uses both employer survey data and unemployment insurance data to estimate employment growth. Each primary data source has its strengths and weaknesses, and agencies charged with producing quality data will often use both.
Following are the major data sources used by the Indexer, with notes about important limitations.
Census Population and Housing Unit Estimates. The Census Bureau combines a number of data sources to provide annual intercensal estimates of population and housing. This data includes population counts, age, gender and ethnicity. Estimates are provided for cities, counties and metropolitan areas. The program also provides data on components of county and metro area population growth (births, deaths, net migration) that also includes estimates of foreign in-migration.
Limitations. It must be emphasized that these are estimates, as distinguished from the survey count of the decennial census. The estimates are derived from a number of data sources. Margins of error are not given. Importantly, the annual estimates of past years are updated each year as new or more accurate data becomes available.
Decennial Census—short form. Indexer will use data from the decennial censuses of 1990, 2000 and 2010. The basic census questionnaire, that is supposed to be collected from every household and individual in the county, known as the “short form” has a limited number of questions. It covers age, ethnicity, family and household relationships, tenure (rent vs own).
Limitations: Since the short form census covers nearly all households and individuals, it is generally considered quite accurate. As with any survey there will be some level of non-response, and that is likely to be concentrated in certain groups.
Decennial Census—long form. The Census Bureau used a “long form” questionnaire, but discontinued this practice after the 2000 census. The long form contained detailed questions about homes, incomes, commutes, education and other factors in people’s lives. The long form was given to one out of every six households in the country in 2000, but sampling varied by area. Indexer uses long form data for 1990 and 2000.
Limitations. The long-form sampling was large enough that no error margins are given. As with the short form, some groups will likely be underrepresented due to non-response.
American Community Survey. The Census Bureau replaced the decennial census long form with the annual American Community Survey (ACS). The first ACS data is available for 2005. The ACS samples about one percent of the households in the country each year, using a lengthy questionnaire that covers a long list of topics. ACS data can be retrieved for a wide range of geographies, including cities, school districts and legislative districts. There is very good overlap between the old long form and the ACS, making comparisons possible over time. The annual schedule of the ACS makes it invaluable in tracking rapidly changing areas like Puget Sound. The Indexer makes extensive use of ACS data.
ACS data for 2005 through 2017 is available at the American Factfinder. ACS data for 2010 through 2018 is available at the new Census data portal. All decennial census and ACS data will migrate to the new data portal in 2020 and the American Factfinder will be discontinued.
Limitations. The use of a 1 percent sampling by the ACS introduces the problem of sampling error. All but the very highest level ACS data points are accompanied by a “margin of error” (MOE). The MOE is stated as a “plus or minus” figure that is added and subtracted from the reported figure to create a range. The MOE is set such that there is a 90 percent chance that the “correct” answer lies within that range. For example, ACS may report a population figure of 9,500 with an error margin of 300, indicating that there is a 90 percent chance that the actual population of that place is between 9,200 and 9,800.
MOEs can be set aside where sample sizes are large and the MOEs are, consequently, small. But in some cases, the MOE is so large that the data is meaningless. For a universe like King County, with over two million people, a one percent sample yields 20,000 responses, so most general questions within that sample will have a small MOE. But a one percent sample in a city of 5,000 people yields only 50 responses, and no information from this sample this will have reasonable statistical validity. In general, the ACS does not report data for single years for areas with fewer than 65,000 people.
The Census Bureau has two ways to address the problem of statistical validity for small jurisdictions. First is the designation of Public Use Microdata Areas (PUMAs) of around 100,000 people. This can mean a portion of a larger city, an entire medium-sized city, or a group of smaller cities. Census will report one-year data for PUMAs, but some PUMA data can still have large MOEs.
The second way that Census has of dealing with high MOEs is to provide data with a trailing five-year average. For example, the 2018 five-year average will include all data for 2014 through 2018, but presented as a one-year equivalent. This significantly reduces MOEs, but for small areas the MOEs can still be quite large for fine-grained data. And use of five-year data can miss rapid change.
For clarity, The Indexer will generally not include MOEs. Where MOEs render a measure or a trend statistically insignificant, The Indexer will not publish that data.
Washington State Office of Financial Management (OFM) Population Estimates. OFM is charged by statute to produce intercensal population estimates for counties and cities. These estimates are used for state program funding as well as for planning under the state Growth Management Act (GMA). OFM produces population estimates for cities and counties as of April 1 of each year, and delivers those estimates by July 1 of that year. OFM also produces estimates of the components of population change (births, deaths, net migration) for counties.
To arrive at local estimates, OFM first estimates the population of the entire state. It then divides that population among the 39 counties. Then, it estimates the population of cities and unincorporated areas within those counties. It relies on a number of data sources, including housing construction, vacancy reports, school enrollments and program participation.
Limitations: Because of the tight timelines under which OFM must produce data, it cannot always have solid data from which to work. Birth and death records are released with some lags, but they are relatively consistent. Migration data is the real challenge, since there is no direct administrative capture of migration. There are several proxies, but these are all incomplete.
OFM uses a “residual” method to estimate net migration. First it estimates the county population and subtracts that from the prior year’s county population estimate to arrive at an annual growth figure. Then it estimates births and deaths and subtracts the “natural” growth (births minus deaths) from the estimated growth, and assigns the difference (the residual) to net migration. Thus, since birth and death rates are fairly accurate, all the error in the original growth estimate shows up as error in net migration.
OFM acknowledges that its migration figures are not very accurate from year to year. It suggests that data users create rolling averages of migration to arrive at more accurate figures. But unlike the Census Population Estimate program, OFM does not update past estimates based on new data, such as that available from the ACS and from the Internal Revenue Service (IRS).
Migration is central to understanding population growth in the Puget Sound region, both in terms of people moving to the region from outside, and in terms of movements within the region. The Indexer recognizes the challenges OFM faces with its timelines. Nonetheless, the Indexer will mostly rely on the Census estimates for net migration and will use IRS data for local area migration.
Office of the Superintendent of Public Instruction (OSPI) Assessment Data. Nearly all students in the state take the Smarter Balanced Assessment test each year. Test scores are reported in detail by OSPI, at the individual school level and by ethnicity and gender.
Limitations: The data is quite complete, but some students can receive exemptions from testing and some can take an alternative test. These numbers are low and likely consistent across schools and districts, so should not make it difficult to compare results across areas or years.
Office of the Superintendent of Public Instruction Enrollment Data. In October of each year schools report a count of all students in each school by grade and a variety of demographics.
Limitations: Enrollments can change over a year, and some areas have much higher turnover than others. The student population in June may be quite different from that in October. There are different race classifications for state reporting and federal reporting. Indexer generally uses the federal ethnicity reporting categories.
National Transit Database Agency Profiles. The Federal Transit Administration collects common data from all transit systems in the country that receive federal aid (which is nearly all agencies) and reports that data in one-page agency profiles. Data includes ridership, budgets, service levels and capital stock, with these data combined into measures of service effectiveness.
Limitations: Smaller transit systems do not report some data. Some agencies break out bus service by commuter (long distance) and local, and cost allocations may not be transparent.
Census Bureau On The Map– Longitudinal Employer-Household Dynamics (LEHD) program. On The Map is a part of the Census Bureau’s LEHD program that tracks the movements of individuals through various parts of their work life through administrative data. On The Map provides estimations of commuting patterns by matching individual workers’ home addresses on their federal tax returns with their work address as submitted by their employer on unemployment insurance (UI) filings. While this match does not translate directly to an actual commute (for example, the person may work from home) it does provide useful estimates.
Limitations: On The Map, like other programs that use UI filings, suffers from a data problem. While employers are required to file UI reports based on the actual work site of employees, many skip this step and simply report all employees with a headquarters address. This practice can be seen in data that report large numbers of implausible commutes. Census does not have an estimate of the size of this error. It will tend to affect smaller communities more than larger ones.
Internal Revenue Service (IRS) migration data. The IRS tracks address changes on federal tax returns and reports them by county pairs. For each county pair, the IRS reports the number of returns that had address changes, the number of exemptions on those returns, and the total income reported by those returns. The county moves are aggregated into groups that cover moves within a county, between counties in the same state, from outside the state and from abroad.
Limitations. Overall, the IRS migration data is considered very good administrative data. Nearly all U.S. citizens and permanent residents appear on a tax return, so the coverage is very high. There are, however, some important limitations.
First, the data only cover people who have filed a tax return the previous year. This will exclude non-citizens who have just moved to the U.S. from other countries (the “overseas” counts in the IRS migration data cover U.S. citizens living abroad, such as retirees or military personnel stationed overseas). It will also exclude people who fail to file tax returns for a variety of reasons.
Second, some filers may use addresses other than their current address of residence, and there is no penalty for this. Young people away at college may use a parent’s address. Military personnel may also use a parent’s address. With electronic filing and automatic withdrawal and deposit of tax payments, there is no compelling need to use a current address.
Third, there are long lags in release of the data. The IRS waits almost a year to ensure that it has collected all possible tax returns from the prior year. Then it takes another year to release the data. So, for example, address changes on tax returns filed during 2018, for the 2017 tax year were released at the end of 2019.
Fourth, the IRS suppresses data where there are fewer than 10 returns in a county pair. This is not a problem for in-state migration, but can be a problem when trying to understand migration from the more rural areas of nearby states. The IRS provides state-to-state summaries, but does not provide state-to-county summaries, and because of data suppression, all counties in a state may not be available to add up to a state total.
Washington State Department of Licensing (DOL) drivers reports. DOL issues monthly reports on the number of people who apply for a Washington State driver’s license from out-of-state. The reports provide state-to-county figures, as well as data from other countries. This data provides an overview of in-migration to all parts of the state. Reports are issued within a month of the closing date, so are very timely.
Limitations. This data is quite accurate as far as it goes. But it does not count people who did not have a driver’s license from another state, so excludes youth, non-drivers and foreign immigrants (the overseas data is for U.S. citizens that have been living abroad). The inclusion of military personnel will be uneven, as uniformed personnel and their families stationed in Washington can continue to use their homestate license and are not required to obtain a Washington license.
The other limitation is that it does not cover out-migration. There is no accurate count of people who trade in a Washington driver’s license in another state. So while the data is very helpful in understanding where new residents come from, and giving a general sense of the pace of migration, it does not tell us anything about net migration flows.
Environmental Protection Agency (EPA) Air Quality Index. The EPA collects data on a wide range of air pollutants and publishes this data by metropolitan area. Several data points are combined into a single Air Quality Index that indicates the overall level of toxic air pollution. Index levels are grouped into five categories, ranging from “Good” to “Very Unhealthy.” The number of days each year that meet the criteria for each category are reported.
Limitations. The measurements themselves are quite accurate, but trends may not indicate any issues that can be addressed locally. Weather patterns have a large impact on air quality, as atmospheric conditions will determine the degree to which pollution is dissipated or stays around. Recent years have seen an increase in unhealthy air due to wildfires that are often over 100 miles away. The number and intensity of fires is also weather dependent. It is best to look at air quality data through multi-year averaging to get a clearer picture of trends in locally-generated pollution.
Bureau of Economic Analysis (BEA) employment and income data. BEA provides a wide range of economic data at the state, county and metropolitan area level. BEA gets its basic data from other government agencies and creates unique datasets that illustrate important aspects of local economies. The Indexer uses BEA data for much of its reporting on county-level economics.
Limitations. BEA data is very high quality. By using several data sources it diminishes limitations inherent in each source. Each BEA dataset has notations on methodology which will discuss limitations of that particular data.
Bureau of Labor Statistics (BLS) employment data. BLS collects employment data through surveys and through administrative sources such as unemployment insurance filings. Data is provided at the state, county and metro area level. Data is available on a quarterly or monthly basis, and much of the data can be seasonally adjusted.
Limitations. BLS surveys are large, but even so, they will be subject to some sampling error. Unemployment insurance data, while very detailed with respect to industries, does not capture the self-employed and is subject to mis-reporting of job locations.
Bureau of Labor Statistics (BLS) price data. BLS provides the most commonly used measures of prices and inflation. Inflation indices are calibrated for different geographies and baskets of goods. The Indexer uses the BLS “All urban consumers” index to adjust data for inflation.
Limitations. Price measures are always based on baskets of goods and services across many geographic areas, so can only approximate the actual inflation experienced by consumers in any one place. Traded goods tend to have stable prices across the country, but untraded goods, like housing and utilities, can vary widely and are captured imperfectly in inflation measures.
Washington State Department of Employment Security (DES) employment data. DES is part of the national network of state agencies that coordinate employment data collection with the U.S. Bureau of Labor Statistics (BLS). Data will be similar, using a combination of survey and UI tax return information. DES reconciles their data more often the BLS, and will have more up-to-date data in many cases. DES also issues “covered employment” reports that only include employees covered by UI. These reports go into significant industry detail, using six-digit NAICS codes at the state level and three-digit codes at the county level.
Limitations. DES notes that its data “excludes proprietors, self-employed, members of armed forces, and private household employees.” Self-employment levels vary across the state, and many people have both employer income (subject to UI) and self-employed income. The BEA, which reports both, shows that about 20 percent of all jobs in Washington State are classified as self-employment, but not all of those will be the sold source of income for an individual.
Unionstats.com, union membership and coverage. This privately produced database is maintained by Barry Hirsch at Georgia State University and David Macpherson at Trinity University. It is based on data from the Current Population Survey (CPS), which is undertaken jointly by the Census Bureau and the Bureau of Labor Statistics (BLS).
Limitations. The CPS is a very large and sophisticated survey, so error rates can be expected to be quite low. Sampling at the metro area level should be adequate to identify trends in union membership and coverage.
Washington State Department of Health (DOH) death statistics. DOH publishes detailed data on locations and causes of death in the state. This includes details on death by various natural causes as well as death by accidents. Data is provided at the county level.
Limitations. The quality of the data is determined by the accuracy of the cause of death coding on death certificates. In many cases a decedent will have multiple causes of death (e.g. substance abuse leading to an accident) and only one will typically be reported. The data is accurate in reporting what appears on death certificates.
Defense Manpower Data Center (DMDC) military personnel. The DMDC publishes monthly reports on the uniformed, national guard, reserve and civilian personnel in each state for the Army, Navy, Air Force, Marine Corps and Coast Guard. Data also includes personnel in overseas installations.
Limitations. The data are highly accurate and up to date. Data is not provided for individual military installations.