Disaggregative Spatial Interpolation

Andy Turner and Stan Openshaw

Centre for Computational Geography
www.ccg.leeds.ac.uk

Paper presented at the GISRUK 2001 conference, Glamorgan, Wales, April, 2001.

Abstract

One type of spatial interpolation is the process of transforming values of a spatial variable (for a given region) from one set of sub-regions, called source regions, into values for a different set of sub-regions, called target regions. This paper focuses on computer based methods designed to perform this type of process. In general, the smaller target regions are compared with source regions the harder spatial interpolation is. In 1998 an evaluation of spatial interpolation methods (SIMs) was undertaken for a European Commission funded project, called MEDALUS III, which was concerned with various aspects of Mediterranean land degradation and land use change. One type of SIM was developed for this project so as to cope better where interpolation involved a large change in scale. For various reasons it has taken until now to publish this work, but hopefully doing so will still be useful. So, the paper introduces various types of SIMs, outlines a selection of SIMs that interpolate values of a spatial variable from source to target regions, and details some developments that can be useful where target regions are much smaller than source regions. These developments are based on the Smart SIM which involves using various auxiliary information to guide the interpolation. The use of neural networks to represent the relationships between generated predictor variables and the variable being interpolated, and more sophisticated preprocessing to generate the predictors is described. Various applications of the Smart SIM can be criticised because the relationships between predictors was subjectively evaluated and the GIS preprocessing involved was simplistic. The benefits of developing the Smart SIM are illustrated with a population density example.

Key words

interpolation, disaggregation, neural networks

1. Introduction

A major difficulty concerns how to transform the values of a spatial variable for one set of regions, called source regions, into values of the same variable for another set of regions, called target regions. This is a type of spatial interpolation. If the specified target regions have a much higher general level of spatial resolution compared with source regions the problem is distinctly difficult and this special case which is the subject of this paper is referred to as disaggregative spatial interpolation (DSI).

Spatial interpolation is a type of spatial modelling. All existing spatial interpolation methods (SIMs) transform values of a specific spatial variable related to source points, lines or regions (maybe using auxiliary spatial data and autocorrelation assumptions) into values of the same spatial variable for different target points lines or regions. Some SIMs transform point values into density surfaces such as that described in Bracken (1994), Martin and Braken (1991). It would seem the first step in this is to assign data for zones or area type regions to points that are their centroids. The next step involves passing a kernel across the surface in a uniform manner to generate an output density surface. There are various SIMs that deal with transforming data for lines into data for regular gridded surfaces. For example consider the methods available in many GIS for transforming a set of contours to grids containing information about slope, aspect or level. Lam (1983) provides a dated review of SIMs available at the time. Flowerdew and Openshaw (1987) reviews the problems of transferring data from one set of areal units to another incompatible set. Flowerdew and Green (1991) outline some statstical methods for transferring data between zonal systems. More recently Deichmann (1996) provides some useful information about the benefits and problems of various modelling approaches to generating socio-economic data, and Hunting Technical Services (1998) provides an overview and classification of SIMs. As always, apologies for incompleteness, if more recent or better reviews of spatial interpolation and SIMs are available it would be good to consider and integrate this information.

The MEDALUS III project addressed issues of land use change and land degradation in the Mediterranean climate region of the European Union, see Medalus (URL) for details. The research on this project undertaken at Leeds Centre for Computational Geography involved developing a land use and land degradation modelling system that integrated available environmental data at a spatial resolution of 1 decimal-minute, see Openshaw and Turner (1998, 1999, URL) for details. For modelling purposes a common spatial framework for all the available data was chosen and the process of manipulating data into this framework was a key step. The chosen spatial framework was a coverage of grid cells arranged in terms of latitude and longitude, in a so called geographical projection, with an origin to the south west of the Iberian peninsular. Available data related to the physical environment was easily transformed into the chosen 1 decimal-minute spatial framework in a satisfactory manner using GIS as most of the source data was at a similar spatial resolution and was in a similar gridded format. A big problem was that the highest resolution demographic and socio-economic data were only available for Nuts3 regions. Nuts3 regions in the Mediterranean cover many hundreds in some cases thousands of 1 decimal-minute cells in the chosen spatial framework Nuts3 regions are 2D irregular polygons mapped onto the surface of the European Union that vary considerably in size and are subject to boundary changes over time. A SIM was needed so as to make disaggregate estimates at a 1 decimal-minute resolution of the desirable demographic and socio-economic variables using source Nuts3 data as constraints. The first step was to review existing SIMs with respect to DSI.

Section 2 outlines three SIMs that have been used to interpolate population density. Section 3 considers data issues. Section 4 describes how neural networks and more sophisticated preprocessing techniques were used to enhance the smart SIM outlined in Section 2. Section 5 concludes and provides some ideas for further research.

2. Spatial Interpolation Methods

This section outlines three SIMs for transforming spatial variable values from one set of source regions into another set of target regions. The first is the most simplistic, the second is a clever extension of this, and the last is the smart one.

2.1 Areal Weighting

Areal weighting involves proportionally distributing source region values based on the area of overlap between each source region and each target region. The method is summarised by the following algorithm:

Step 1 Calculate the area of source regions.

Step 2 Divide source region values of the spatial variable to be interpolated by the area of source regions. (This generates a measure of density.)

Step 3 Intersect the source and target region areas and calculate the area of each intersection.

Step 4 Multiply intersected region areas calculated in Step 3 by the density of the spatial variable calculated in Step 2.

Step 5 Sum the products from Step 4 for each target region.

Errors associated with areal weighting are higher the more clustered the interpolated variable is and the smaller target regions are relative to source regions. Like all the SIMs outlined in this paper it is not too hard to see how they can be extended to 3D regions. The Pycnophilactic SIM described next modifies areal weighted interpolation estimates by making neighbouring target values more similar.

2.2 The Pycnophilactic SIM

Tobler (1979) proposed a pycnophilactic SIM, which is an extension of simple areal weighting. It calculates target region values based on the values and weighted distance from the centre of neighbouring source regions keeping the mass consistent within source regions using the following iterative approximation routine:

Step 1 Intersect a dense grid over the study region.

Step 2 Assign each grid cell a value using areal weighting as described above.

Step 3 Smooth the values of all the cells by replacing each cell value with the average of its neighbours.

Step 4 Calculate the value in each source region.

Step 5 Weight the values of target cells in each source region equally so that source region values are consistent.

Step 6 Repeat Steps 3 to 5 until there are no further changes to a prespecified tolerance.

The NCGIA Global Demography Project used this SIM and the Smart SIM described next to create global population density surfaces at a 5 decimal-minute resolution; see Tobler et. al. (1995) and CIESIN (URL) for details. The interpolated surface produced by the Pycnophilactic SIM is smooth with relatively small changes in attribute values at target region boundaries. The sum or mass of combined target attribute values within each source region is kept consistent at Step 5, this is what lends the SIM its name.

The underlying assumption is that the value of a spatial variable in a neighbouring target regions tends to be similar. When applying this SIM to interpolate population for only small reductions in scale the underlying assumption does not at first seem unreasonable as population density tends to be spatially autocorrelated. However, for DSI in general the underlying assumption is not that reasonable. As it is, using the above algorithm for DSI is little better in a statistical sense than what could be produced simply using areal weighting as is shown later in a way by the figures in Table 2. Consider a source region with a very high population and population density compared with all nearby regions. Applying the algorithm for DSI in this case will gravitate population in the comparatively low population density neighbouring source regions to the target regions near its boundary, which is not generally likely to be a close representation of reality. The method can work but there is no real mechanism for testing alternative hypotheses.

On the positive side, a useful feature of the method is that target region totals are rescaled at Step 5 so that they remain consistent within source regions. The algorithm is also modifiable and is a generally useful one. For example Step 3 can be modified so as to use other statistics than the mean, such as the weighted mean as suggested above, and the region can be extended beyond the immediate neighbourhood.

Population maps generated using the Pycnophilactic SIM look good because the resulting surface is smooth, provided that the tolerance allows for sufficient iterations of the algorithm, but is this really what is wanted? Unlike the smart SIM described next, the Pycnophilactic SIM provides no mechanism for using other available geographical information to guide the interpolation.

2.3 The Smart SIM

Deichmann and Eklundh (1991) describes a Smart SIM for interpolating census population data using available digital map information regarding the location and size of urban settlements and other physical features related to population density. The smart SIM is also described in Willmott and Matsuura (1995) for interpolating temperature and precipitation using elevation and exposure data. Sweitzer and Langaas (1994) also describe its use in creating a population density map of the Baltic drainage basin at a 1 km resolution. The SIM is smart because it uses available spatial information to guide the interpolation. It manipulates this information into a weighting surface coincident with target regions to transform source region values into target region values. The way this has been reported can be criticised for being simplistic and overly reliant on subjective assessments of the strengths and nature of the relationships between all the predictor spatial variables and the variable of interest. Nonetheless, the use of available auxiliary information is a generic modelling concept.

3 Data Issues

The availability of data is a major issue. The Smart SIM described above and the Smarter SIMs described in Section 4 require available data that can be used to guide the DSI of a spatial variable. If there is no such auxiliary data then DSI can only be approached theoretically. In doing this the hope might be that sufficient data may become available to test and improve the estimates at a later date.

Data quality is another major issue. All data is abstracted and generalised in some way. Often little information is readily available regarding the nature and scale of these abstractions and the quality of these generalisations. There are often spatial variations in the quality of data partly owing to the fact that they have been collected and generalised by different organisations for different purposes. Nonetheless it is always appropriate to assess the quality and applicability of data for any intended purpose. Of course this is hard if, as is often the case, there is little or no information about how the data was collected and compiled. Map scales provide some indication of what level of detail to expect in digital map data. It has implications about what the data can be used for, but alone it does not help much in estimating the uncertainties in subsequent models that use these data

If analysing data for large geographical areas it is important to consider the effects caused by the curvature of the earth. Different projection systems result in different amounts of distance, area and direction distortion. This can become significant and so it is important to chose a projection system wisely and be aware of the effects on uncertainty of data transformations.

Data issues are important and need to be kept in mind. This in itself is the subject of much work and much more could have been written about it here and in the following sections. It is left to the reader to keep data issues in mind and think on about the consequences of only having limited available data. Many of the relevant data issues to spatial interpolation are covered in more detail in Martin (1996).

4 Developing SIMs for DSI

During MEDALUS III the Smart SIM was enhanced by using more sophisticated preprocessing of spatial predictor variables and by applying neural networks (NNs) to map the input predictors to the variable of interest. NNs are universal approximators capable of learning to represent virtually any function no matter how complex or discontinuous. NNs can learn to cope with noisy data and represent complex non-linear geographical data patterns. Using the technology requires decisions about what architectures and training schemes to employ. A related problem is that training can be computationally expensive. NNs are also criticised because it is difficult to understand what internal parameters mean in terms of mapping inputs onto outputs. This is why the technology has been described as black box and somewhat under utilised given its capability. Some work has been done to alleviate this and as a start readers are referred to Bremner et. al. (1994). The next section describes the first phase of development, a so called Smarter SIM and Section 4.2 describes the second phase of development, a so called Clever SIM.

4.1 Smarter SIM

This was developed for interpolating Nuts3 population data to make estimates of population for 1 decimal-minute grid cells in Great Britain. This involved creating a set of indicators for the chosen spatial framework, selecting a training data set, then training NNs to represent patterns between indicators and population density using some target data based on Surpop. Surpop is a 0.2 km2 grid cell population data surface of Great Britain that is generated from census data and described on the Internet, see Surpop (URL).

There are several different types of NN and a great many different ways to train them to recognise complex patterns which map sets of inputs onto sets of outputs. The best training scheme to employ depends as much on the nature (configuration, structure and other properties) of the network as it does on the pattern recognition task itself. Four types of NNs commonly used in research are; multilayer perceptrons, radial basis function networks, learning vector quantisation networks, and self organising map or Kohonen networks. Probably the simplest and easiest to understand are back propogating feedforward multilayer perceptrons. These feed inputs in at one end and process in one direction from layer to layer to produce an output at the other end and were the type that were used for the work reported in this paper.

A sigmoidal function was used to calculate each neuron output and each network configuration was initialised using a genetic optimiser for a specified number of hidden layers each with a specified number of neurons interconnected from layer to layer to map the inputs onto a single output. The genetic optimisation involved randomly assigning values to the weights and thresholds (parameters) of the network a predefined number of times. The various sets of parameters were then encoded as bit strings (a catenated binary representation of the NN parameter values). The performance of the NN model was measured for each parameter set by passing training data through the classifier and calculating the sum of squared errors between the expected output and the target value. A number of the best performing parameter sets were then selected to be parents and their bit string representations were bred using the genetic operations of crossover, inversion and mutation to produce a number of children. The genetic optimisation process of evaluating, selecting and breeding was repeated a predefined number of times. When this completed, the best set of weights was used to initialise the NN for further training using a standard conjugate non-linear optimisation method This involves iteratively reducing the difference between observed and expected values by adjusting the parameters of the network (weights, threshold values and those of the sigmoidal function used to generate neuron outputs) by a small amount working backwards from the output layer towards the input layer. Trained NNs were applied to estimate the population of the remaining grid cells and the various different predictions were mapped and analysed. The following steps summarise the process of creating a result:

Step 1 A look up table was constructed so that each 1 decimal-minute grid cell in the chosen spatial framework was referenced in terms of which Nuts3 region contained it.

Step 2 Target population estimates for the target cells in the chosen spatial framework were then generated by transforming and aggregating Surpop data.

Step 3 Available digital map data was then processed into location, density and distance grid layers. These were variables that could be used as a basis for estimating population density, in other words, they were population indicators. Location grids were effectively binary distance or density grids and a cell had values of 1 or 0 depending upon whether it contained a geographical feature or had a specific spatial data value. Distance grids had cells with values related to the distance to the closest of a set of selected geographical features or set of spatial variable values. Density grids had cells with values related to the density of a set of selected geographical features or set of spatial variable values.

Step 4 Involved selecting a training data set.

Step 5 Involved training NNs to represent the relationship between the indicators and population.

Step 6 Involved applying the NNs to estimate population density for each 1 decimal-minute grid cell in the UK. The estimates were constrained so that total populations within each NUTS3 source region matched the observed totals. The results was then mapped.

The grander experiment involved analysing the results and repeating steps 4-7 a large number of times using different NN configurations, different inputs and different sampling strategies. After substantial experimentation it was thought that the best population surface generated used the predictor variables listed in Table 1, and had a single hidden layer with 30 neurons in.

Table 1. A list of predictor variables and constraints

Location of parks

Distance to main road

Density of Motorway and main road

Density of main and minor road

Distance to a river

Location of urban areas

Distance to small urban areas

Distance to medium sized urban areas

Distance to large and extra large urban areas

Density of urban areas

Height from a Digital Elevation Model

The Pycnophilactic SIM population density estimate

The RIVM Smart SIM population estimate

Areal Weighted Nuts3 source zone population

The main deficiencies at this stage were thought to result from the GIS preprocessing being overly simplistic. Experimentation was long winded, and generalisation properties and uncertainties were not looked at in detail. In hindsight it may have been more useful to have spent time testing the uncertainties using Monte Carlo testing as outlined in Fisher and Langford (1995). Nonetheless a great deal was learned about using and developing NN technology. The use of NNs was in some way justified because they produced significantly better results compared with other SIMs and were better than linear and log-linear models set up as a more reasonable benchmark, see Table 2. At this stage it was thought that the best way of improving the results was by finding better ways of representing digital map information.

Table 2. The performance of different SIMs to interpolated population from Nuts3 zones to 1 decimal-minute regions in Great Britain

Method Absolute Percentage Error

Areal Weighting

Pycnophilactic SIM

Smart SIM

Linear Model

Log Linear Model

Smarter SIM

316.2

276.3

129.1

84.5

84.1

81.8

4.2 Clever SIM

This involved three major changes from the Smarter SIM described above. One is that data preprocessing is more sophisticated as described in the following subsection, another is that a type of bootstrap is used during the neural network training as described in Section 4.2.2, and the third is that the problem was broken down as described in Section 4.2.3.
 

4.2.1 Improved GIS preprocessing

As in Step 3 of the Smarter SIM described above, this involved manipulating available data into surfaces that contain information about either the location of, distance to, direction to, or density of a selected set of geographical data. From a single geographical data layer it is possible to create an immense numbers of different generalised surfaces and these surfaces can be combined in various ways. Not all these surfaces will be useful or are sensible to use for a given modelling task, but some might reasonably be expected to represent useful information. Surfaces for different geographical data layers can also be combined, (this is subtely different). A common difficulty concerns what to create that is meaningful and another difficulty concerns what to combine and what to leave to the NNs to model

To alleviate some of these difficulties a way of constructing cross-scale density layers was developed as a preprocessing technique, see Turner (2000) for details. Basically this involves combining density layers across a range of different scales at a specific resolution. Consider for example a cross-scale density surface of roads. This would be created by calculating the density of roads at the analysis resolution. Calculating the density of roads at a coarser resolution where the cell width and height are a factor larger than at the analysis resolution. Then these surfaces would be combined at the analysis resolution trying to ensure that only a minimum spatial bias is added. The way of ensuring a minimum spatial bias involves generating aggregate density surfaces in all possible ways given the granularity of the data.

Experiments were performed to test the differences between combining density layers from a range of spatial scales and including density layers from a selection of scales as separate inputs. These experiments confirmed that cross-scale density surfaces were indeed very useful as it appeared they would be. A geometric aggregation and combination in which the resolution (height and width of each grid cell) doubled each time was considered sufficient and was computationally feasable given available resources.

4.2.2 Bootstrap

The second major development in the Clever SIM involved using a kind of bootstrap. Basically this periodically fed the best output or prediction from the model back in as an input. This can be criticised for various reasons, such feedback is not a generally accepted way of modelling things, but experiments suggested it was not without merit in some circumstances. It can help speed up training and it can be useful in examining convergence properties and it made visualising and translating NN parameters even more interesting. Anyway, to do this kind of bootstrap; the NN training process was stopped after a given number of training cycles or after some prespecified level of convergence was reached. An output surface was then generated, substituted into the model and training was restarted. This seemed to help the NN converge by effectively giving it a memory.

This process is summarised as follows:

Step 1 Train the neural network

Step 2 Generate a result

Step 3 Substitute the result (or some function of the result) in place of the previous result

Step 4 Repeat Steps 1-3

4.1.3 Breaking down the problem domain

The idea for this fed back from some subsequent land use modelling. In that modelling different NNs were trained to classify cells into one of four different land use classes. Each NN was trained separately, in that case using the same input predictor data, and the results from applying each NN were combined to produce a single output. Applying this to the population example involved training NNs to classify zero population density, based on one set of input predictors and training data set, other NNs were geared to estimate high population density, and others were geared to predict low population density. Combining the various outputs was done using fuzzy inference and the benefits of doing so were tested by an extensive set of experiments. It became clear during the experiments that better ways of analysing the errors and estimating the uncertainties were needed.
 

5 Conclusions and ideas for further research

The Pycnophilactic SIM assumes a degree of spatial autocorrelation of the variable being interpolated. Neither it nor areal weighting are generally very useful when it comes to DSI. The Smart SIM is better but only works where there is available data that can be used to guide the interpolation. The use of NNs to replace subjective interpretation of input predictors as described in Section 4.1, the use of cross-scale density information as introduced in Section 4.2.1, and breaking down the problem as described in Section 4.2.3 offer extremely useful enhancements. The bootstrap described in Section 4.2.2 is also useful but needs more careful consideration.

Fuzzy logic based techniques provide more alternatives and possibilities fordeveloping an Intelligent SIM. Consider the following statement: If road network density is high and land use is residential then population density is high. Imagine now a new form of SIM that takes many similar statements, codes them using fuzzy membership functions and parses these using an inference engine. This is an idea for extending the smart SIM in another. It is also worth attempting to interpret NN parameters by translating them into fuzzy linguistic statements. Perhaps the grandest challenge is to develop automated intelligent general purpose analysis and modelling systems.

References

Bracken I (1994) A surface model approach to the representation of population-related social indicators. In Spatial Analysis and GIS edited by Fotheringham S and Rogerson P, Taylor and Francis, 245-259.

Bremner F, Gotts S, Denham D (1994) Hinton diagrams: viewing connections strengths in neural networks. Behaviour Research Methods, Instruments and Computers 26(2), 215-218.

CIESIN (URL) http://www.ciesin.org/datasets/gpw

Clarke J, Rhind D (1991) Population Data and Global Environmental Change. International Social Science Council Programme on Human Dimensions of Global Environmental Change, Working Group on Demographic Data.

Deichmann U, Eklundh L (1991) Global digital datasets for land degradation studies: A GIS approach, United Nations Environment Programme, Global Resource Information Database, Case Study No. 4, Nairobi, Kenya.

Deichmann U (1996) A Review of Spatial Population Database Design and Modelling. Report prepared for the United Nations Environment Programme, Global Resource Information Database, and Consultative Group for International Agricultural Research initiative on the Use of Geographic Information Systems in Agricultural Research.

Fisher P, Langford M (1995) Modelling the errors in areal interpolation between zonal systems by Monte Carlo simulation, Environment and Planning A, 27, 221-224

Flowerdew R, Openshaw S (1987) A review of the problems of transferring data from one set of areal units to another incompatible set. RR4, Northern Regional Research Laboratory, University of Newcastle Upon Tyne, England.

Flowerdew R, Green M (1991) Data integration: statistical methods for transferring data between zonal systems. In Handling geographic information: Methodology and potential applications edited by Masser E and Blakemore M, Longman, 38-54.

Hunting Technical Services (1998) GIS Application Development. Technical Report No 2 (Interim Report) compiled in association with Nene University College for the European Commission Statistical Office EUROSTAT, June.

Goodchild M, Lam S (1980) Areal interpolation: a variant of the traditional spatial problem. Geoprocessing 1, 297-312.

Goodchild M, Anselin L, Deichmann U (1993) A framework for the areal interpolation of socio-economic data. Environment and Planning A, 25, 383-397.

Lam N (1983) Spatial interpolation methods: a review. American Cartographer 10,129-149.

Langford M, Maguire D, Unwin D (1991) The areal interpolation problem: estimating population using remote sensing in a GIS framework. In Handling geographic information: Methodology and potential applications edited by Masser E and Blakemore M, Longman, 55-77.

Martin D, Braken I (1991) Techniques for modelling population-related raster databases. Environment and Planning A, 23, 1069-1075.

Medalus (URL) http://www.medalus.demon.ac.uk

Martin D (1996) Geographical Information Systems: socioeconomic applications, Routledge, 116-139.

Openshaw S, Turner A (1998) Predicting the impact of global climate change on land use patterns in Europe. Paper presented at the International Conference on Modelling Geographical and Environmental Systems GIS, Hong Kong, June.

Openshaw S, Turner A (1999) Forecasting global climate change impacts on Mediterranean agricultural land use in the 21st Century. Paper presented at the 11th European Colloquium on Quantitative and Theoretical Geography, Durham, England, September.

Openshaw S, Turner A (URL) http://www.ccg.leeds.ac.uk/projects/medalus

Surpop (URL) http://census.ac.uk/cdu/surpop

Sweitzer J, Langaas S (1994) Modelling population density in the Baltic States using the Digital Chart of the World and other small scale data sets. Technical Paper, Beiher International Institute of Ecological Economics, Stockholm.

Tobler W (1979) Smooth Pycnophylactic Interpolation for Geographical Regions. Journal of the American Statistical Association, 74, 519-529.

Tobler W, Deichmann U, Gottsegen J, Maloy K (1995) The Global Demography Project. NCGIA Technical Report TR-95-6.

Turner A (2000) Density data generation for spatial data mining applications. Paper presented at GeoComputation, Chatham, UK, September.

Willmott C, Matsuura K (1995) Smart interpolation of annually averaged air temperature in the United States. Journal of Applied Meteorology 43, 2577-2586.