Search site

School of Geography

Synthesising the UK Population at individual level from the 2011 Census

Supervisors: Dr Alison Heppenstall and Prof John Stillwell

Due to issues of confidentiality, individual level data for the UK are rarely available. Having access to data at the individual level would be an invaluable resource to both researcher and practitioner, allowing greater realism in both predictions and insights into population dynamics and applied modelling. Most population census data in the UK are released at an aggregate level with only a relatively small sample of individuals being available from different micro data sets (Samples of Anonymised Records, Longitudinal Study). Various methods have been used by researchers (for example aggregate models, microsimulation and statistical approaches) to create individual level data from these data. However, simulating large populations is computationally intensive and is generally based on only 6 to 8 characteristics (from the census or alternative sources survey/data sources). Many of the current approaches are limited by creating populations that have no household relationships or sense of geography within them.

The next stage is to develop a methodology to create a realistic population, place them into households (preserving relationships) and to accurately situate them in a geographical area.

Aim of project

This research aims to create an individual population from the 2011 census year. It is important to note that the outputs from this project are not intended to be a replacement for Census information, or for the SARs, but a dataset that can be potentially used by researchers and practitioners alike to provide insights into population dynamics/applied modelling where the actual census information is not complete or appropriate.
The aims of the project can be broken down into the following:

  1. The development of a methodology for building (bottom up) a realistic synthetic population that will be located in houses and in different spatial units
  2. Creation of an individual level population for 2011 for publication to the wider academic community
  3. To use a series of validation tests to assess how realistic/accurate the data set is
  4. To perform analysis on the 2011 synthesised data sets through specific case-studies
  5. Knowledge of the Java programming language and databases would be an advantage for applicants interested within this PhD

Proposed Methodology

  1. Creation of the population
    The first run of the algorithm creates a population of individuals sampled from the SAR and is representative of the aggregate census information for a given Output Area / Enumeration District.
  2. Validation: what is a good fit? How do we measure this?
    There are several methods available for assessing how good a synthetic population “matches” the real population (for example, Z2-score, Total Absolute Error (TAE) and Classification Error (CE)). However, the error in each synthetic population produced can be assessed at different levels. Which do we choose? At the most abstract level, attributes can be used to assess the model performance or the contribution of each constraint attribute to the model output. At a more detailed level, individual attributes can be explored zone by zone, providing insights into spatial variations within the model. Do we explore the attributes within individual categories to assess whether one particular attribute is particularly difficult to fit? Is it best to use one validation method, or several in combination?
  3. Allocation of individuals into households
    The second run of the algorithm takes the individuals created and allocates them into households that accurately represent the household population observed in the Census information for the given geographical zone.
  4. Incorporation of geodemographic and lifestyle information
    The final stage of the process is to allocate the households to finer geographical areas than the census zones the population has been created for. The proposed process is to allocate each household the geographical coordinates of a suitable postcode centroid. The appropriate postcode will be located using a combination of geodemographic data sources and commercial databases. Data from the large sample of individual households captured by the Acxiom Research Opinion Poll will be used to generate additional life style characteristics relating to household income and expenditure.
  5. Aggregation of populations to different geographies
    Synthetic households with fine geographical locations and resident synthetic individuals with realistic relationships to each other (married couples, parent children and single parent families) could be aggregated into different boundaries to produce population estimates across different years or different administrative areas. Alternatively, these synthetic populations can be used to populate dynamic microsimulation models of population progression and emerging applied Agent-Based models applied research areas such as health, town planning, transport planning or education.
  6. Distributing the resulting populations
    Once the synthetic population representing 2011 has been built and validated, the data will be made available to the research community to download. It is anticipated that this will be through the use of a dedicated server based at Leeds. The original 2011 synthesised data will remain at Leeds. Aside from the academic community, likely users are Education Leeds, Safer Leeds and the NHS

Appropriate data sources

The research within this PhD will utilise the following data sets:

  • SARs individual datasets (SAMs files and household SARs)
  • Census information for 2011
  • Access to appropriate postcode directories (2011)
  • Access to Address-Point database possibly for 2011 (OS)
  • Access to geodemographic databases (Experian) for both 2011
  • Access to Acxiom ROP data

Case-studies

Once the population has been built and validated, case-studies focusing on spatial-temporal changes over past censual years will be performed. This could be, for example, spatio-temporal changes in ethnicity (see Norman, P., 2008; Sabater, A. and Simpson, L., 2009)

Funding

For information on funding opportunities click here

Enquiries

For project related enquiries please contact the supervisors.
For application enquiries please contact Jacqui Manton