Some further experiments with designing output areas for the 2001 UK census

Stan Openshaw, Seraphim Alvanides and Simon Whalley

Centre for Computational Geography
School of Geography
University of Leeds
Leeds LS2 9JT, U.K.


1 Introduction

Martin (1997, 1998) describes a method for automating the design of census output geography and presents some results he obtained for a test area. This paper seeks to confirm these results and build upon this pioneering approach via the use of finer resolution spatial data and a more sophisticated version of zone design algorithms than was used by Martin. The topic is very important because of proposals by the Office of National Statistics to consider using output areas in England and Wales based on a new and uniform geography that nest within wards. This new geography would have consistently defined properties, resulting in as small as possible zones of similar size, compact shapes and reasonably homogenous in terms of a few key variables. Additionally, the digital representations of these new output areas will be accurately known (without any need for digitisation) and their composition in terms of postal addresses precisely defined. If these plans materialise, it will constitute the most significant advance in census geography for 30 years.

The 2001 census will be the first UK census for which the definition of the areas used to report the small area census results (the output areas) could be designed by a computer procedure to have a consistent set of common properties. The release of small area census data for census enumeration districts was a major innovation in census outputs that dates back to the 1961 census. Census enumeration districts (EDs) are a Victorian concept and reflected the desire to divide up the country so as to equalise the work load of the census enumerators. As such it was traditionally the smallest geographical entity for which census data could be fairly readily provided. The 1961 small area statistics for EDs experiment was a consequence of the computerisation of the census and has ever since been repeated for each subsequent census. The task of designing the EDs (traditionally a paper exercise based on large scale paper maps) was also broadened to allow local authorities to make an input if they so wished. Increasingly census EDs became a very useful census data reporting framework that has been used in a wide range of research, planning and commercial geodemographic applications. It can be classed as geographical information since each census ED has been given an approximate 100m grid-reference.

Unfortunately census EDs are a non-ideal basic spatial unit reporting small area census data for a number of reasons:

  1. they are not homogenous,
  2. they vary greatly in population size,
  3. they are not stable over time,
  4. they vary in shape and physical size,
  5. they are not designed in any formal or consistent way to conform to an explicit set of fixed rules,
  6. they only poorly relate to unit postcode geography in England and Wales and the situation in Scotland is fast becoming chaotic due to changes occurring in postal geography,
  7. the digital boundaries and the maps belong to agencies other than the census or as Openshaw (1996) put it the "arcs" are owned by the OS and the "info" by ONS as the census has not been traditionally viewed as integrated geographical data source,
  8. end-users have not previously been permitted to design their own EDs but have to make do with the official ones and
  9. the approximate x, y coordinate point references for Eds are an inadequate form of spatial representation for an increasing number of purposes which require more accurate levels of spatial resolution.

Until fairly recently nothing much could be done about it because of a mix of technical, methodological, and data reasons. However, the situation is now quite different. There are a number of major changes which are very relevant, in particular:

  1. GIS is now a universally available and mature technology,
  2. the OS Address Point product is complete and this claims to provide an accurate 1m grid-reference for each postal address in the UK.
  3. the users of census data are far more experienced and aware than previously and have increasingly demanding and more diverse expectations,
  4. much more is now known about the degrees of geographical freedom involved in zone design as a result of research by Openshaw and Rao (1995), Openshaw and Alvanides (1998a,b), and Martin (1997, 1998), and others;
  5. computer speeds have increased dramatically since the 1991 census was taken and are continuing to do so, whilst hardware prices have fallen equally dramatically making what would have been prohibitively expensive in 1991 extremely very affordable by 1998, and
  6. the census faces an increasing number of alternative "census like" data products that offer far high levels of geographical precision and flexible outputs.

In an age when users are more aware of modifiable areal unit problems and zone design tools (such as ZDES) are becoming widely available over the internet, it is no longer acceptable to merely ignore the zone design aspects involved in the creation of census output areas on the grounds that it is not possible or feasible or affordable to do anything different or that most users do not care and are happy with the traditional products. If users claim so now, are they likely to be so unconcerned when the 2001 census results are eventually produced? Once ONS had no choice about how to perform this task, now they do and it would be quite remarkable if they were to continue to use the same old legacy style of output areas in 2001 as they did in 1991. Whether or not they are prepared to allow census users complete freedom to design their own output geographies subject to various confidentiality restrictions is still unknown and may be regarded as too ambitious for 2001. However, if not, then the onus is placed firmly on ONS to perform the output area design process in a far more rigorous and consistent manner than has historically been practical and seek to identify an output geography that is at a sufficiently fine level of resolution that it allows flexible re-aggregation sufficient to meet existing and still emerging new user needs.

This paper presents a methodology and a case study of how this task can be performed. Section 2 describes the zone design method being proposed here and how it builds on previous work by Martin (1997) and Openshaw and Rao (1995). Section 3 presents an empirical case study based on simulated Address Point census data. Finally Section 4 makes a number of suggestions about how to design better census output areas relevant to the 2001 census.

2 Zone DEsign System (ZDES)

The original automated zone design program (AZP) algorithm of Openshaw (1977) was originally developed to explore modifiable areal unit effects. Subsequently it was used to experiment with zone design. However, AZP was far ahead of its time and it was not until the mid 1990's that the increased availability of digital map data and fast workstation hardware resulted in its revival). The original AZP technology has been extended to form the ZDES system based on the work of Openshaw and Rao (1995), but subsequently dramatically extended and improved; see Openshaw and Alvanides (1998a, b). ZDES in common with AZP views the zone design task as a special kind of combinatorial optimisation problem. The aim is to optimise a function of the data generated by a zoning system defining an aggregation of N original zones into M regions or output zones (M<N). Expressed in mathematical notation it looks like any other mathematical optimisation problem, viz

optimise F(Z)

where F(Z) is a general function of Z except Z is not a simple set of linear or non linear parameters but defines an aggregation of N initial zones into M output zones. Additionally, there are implicit constraints on Z such that each of the original N zones have to be assigned to exactly one output zone and all the members of the same output zone have to be connected so that when the internal boundaries are dissolved they form a single polygon. This optimisation task might be categorised as a constrained non-linear integer optimisation problem. It can only be solved via heuristic methods that may not find the global optimum result; indeed, there is no way of knowing whether there is a single global optimum result to find! The view here is that finding a global optimum may be less relevant than finding extremely 'good' results, however 'goodness' is to be measured. In practice the principle of caveat emptor needs to be applied.

The general function F(Z) need not be continuous nor convex for all feasible values of Z. Despite this complexity, the fairly simple Monte Carlo optimisation method, formerly called AZP (Openshaw, 1976), seems capable of providing what appear to be good solutions to many zone design functions. The other innovation with AZP was the benefit of viewing F(Z) as being any relevant function, thus suggesting that a single algorithm could in principle solve any zone design problem, ranging from electoral redistricting to location-allocation modelling simply by plugging in a different objective function. Additionally, the original ZDES code has also been extended so that ZDES v5 will now handle a wide range of unconstrained and constrained zone design problems. Note that the term constrained relates here to applications in which there are extra user defined constraints imposed on the zone design process which are additional to those contiguity and coverage restrictions already implicit in ZDES. The ZDES software can be downloaded from a WWW site together with some recent papers (ZDES, 1998). At present, we are working on developing UNIX portable and NT versions which will work on any UNIX or NT system supporting Arc/Info release 7.0 or above.

3 Designing census output zones

3.1 Overview of the process

Martin (1997) performed his research within the ONS data confidentiality barrier. This greatly complicated the research task and it is a tribute to his skill and enthusiasm that he managed to achieve so much. His building blocks were unit postcodes and he tried to retain ED boundaries wherever it was sensible to do so because of the belief that at least some of the boundaries had a more general utility. Each ward was treated separately. Wards are currently the smallest "official" geography in Britain and this restriction breaks the census output area design task into about 10,420 separate pieces as each ward has to be processed separately. Seemingly this is a very daunting task until it is realised that the same procedure is applied to each ward and the entire process can be automated. Additionally, this type of problem is very well suited for a parallel supercomputer as a separate processor can be assigned to each ward if necessary. Martin's use of postcode building blocks and his attempt to preserve ED boundaries may be questioned as unit postcodes are inherently unstable building blocks, they do not nest into wards in England and Wales, and the old ED 1991 boundaries are of immensely variable meaningfulness. Moreover, the use of linear features from OS digital databases also carries with it additional, perhaps large, data cost and royalty penalties which may well be passed on to the end users. The current belief is that the use of land line data may well cost far more than any value it adds to the output zones. This is because the most important linear barrier features (e.g. rivers, railway lines, major roads, open spaces) are already implicit in the distribution of address points and these aspects may be more readily (and far more cheaply) handled by the choice of an appropriate objective function for the zone design process.

Another aspect is that unit postcode geography is probably reaching the end of its life. Postcodes are not viewed as a zoning system, they are optimised for postal delivery, they are unstable, and there is no standard specified for them. Nevertheless, postcodes are an extremely convenient geography. Postcodes trivialised the historic address matching task and provide a very easy way of attaching approximate X,Y coordinates to data relating to postal addresses and of matching address data to census EDs. However, now that a British Standard for addresses exists and there is a national address level gazetteer for the UK (Address Point), it becomes clear that postcodes are no longer so effective and may have already reached the end of their really useful life. As address matching methods improve over the next few years, postcode geography will become far less relevant and will increasingly come to be viewed as what they always were, namely a quick and dirty fix to a problem that once could not be solved in any other way. Accordingly, it is argued that the obvious building blocks for the 2001 and subsequent census output areas are the individual addresses in the OS Address Point file. These are the smallest, most accurate, and most stable basic spatial units that can exist. The 2001 census should, ideally, be based on an updated and corrected set of address points either based on the OS Address Point product or on an ONS created version using hand held GPS or some equivalent technology capable of yielding sub 1 metre locational references for all the addresses in the UK.

Address points are, by far, the most obvious building blocks for the 2001 census and for designing small area census and output area geography. They have postcodes attached so they can also be linked to postcode data and they should increasingly be machine address matchable. Unfortunately, the Address Point data are not yet available for academic research, so one of the authors (S.Whalley) spent many hours digitising all the address points within a ward in order to create a simulated address point coverage for a part of Sheffield.

3.2 Choice of constraints and zone design function

The question now is what properties should an ideal census output geography possess?

Cliff et al (1975) suggest that in general a good zoning system should be

  1. as simple as possible,
  2. homogenous and
  3. compact.

Wise et al (1997) use similar criteria to (2) and (3) and suggest that the zones should also be of equal size. Martin (1997) uses population size, shape, and homogeneity. He writes "..the population objective is based on the minimisation of the total squared difference between output area population sizes and target population size ... the boundary constraint seeks to minimise the total squared boundary length of the output area ... the homogeneity constraint ... seeks to minimise the sum of the squared differences ..." (p.14)

In effect in all these zone design applications the objective function is the weighted sum of up to three different design functions. This is, unfortunately, not the best or most appropriate way of creating zoning systems that meet a set of design constraints. There is also another problem. In a census context obtaining identically sized zones is far less important than ensuring the smallest one exceeds a ONS specified minimum size. Similarly, there is no need to optimise homogeneity or squared boundary lengths per se because the desired optimal target values are not unknown. There is merely a need to ensure that all the output zones exceed a minimum comparable degree of compactness and homogeneity. It is far more important to avoid the occasional straggling or spindly output zone that it is to insist that all zones are approximately circular in shape which is a highly unrealistic goal.

In essence the design functions used by Cliff et al., Wise et al., and Martin should be re-cast as inequality constraints. The homogeneity constraint can be set at some reasonably high arbitrary value, say 75%. The shape constraint is far more problematical because it is difficult to know what limits to use and it may not matter much, or at all, if occasionally a strangely shaped zone is produced. One solution would be to convert it into an objective function and then seek to minimise it; however, this may still produce highly eccentric shapes. Instead it is suggested that a better alternative is to use a population weighted accessibility function instead. A compact zoning system will tend to have a minimum sum of within zone travel distances around the point of minimum aggregate travel. This can be simply expressed as a set of M separate P-median problems, one for the members of each of the M output zones. Optimising the sum of these M P-median problems will tend to produce naturally compact zones that automatically adapts to the local distribution of address points.

3.3 Handling constraints in zone design

The problem that Martin, Wise et al and Cliff et al were previously unable to solve is how to optimise one design function (within zone population weighted distances) subject to inequality constraints on the values of the other zone design functions (size and homogeneity). It is this problem that ZDES can now solve. In ZDES some constraints are implicit in the algorithm and can never be violated (e.g. contiguity). Other design constraints that the user may wish to impose on the zoning systems are of the more usual type characteristic of mathematical programming, but even here the nature of the zone design task creates additional complications. These user-defined constraints can be applied to each of the individual M output zones, and, or, else relate to some characteristics of the data generated by all the output zones. For example, a minimum zone size inequality constraint can be applied to all M zones and an additional inequality restriction placed on the nature of the data generated by the complete set of zones; viz. that skewness is less than some threshold value. Examples of the latter are provided in Openshaw (1978a) when he generates zoning systems that maximise a correlation coefficient (as the objective function), subject to the data being approximately normally distributed and with zero spatial autocorrelation; both the latter are specified as inequality constraints specifying ranges of acceptable values. In a census output area design context the constraints can be restricted to size and homogeneity.

In general, there are two ways of handling these user defined constraints:

  1. convert them into a single weighted function or
  2. handle them via a far more sophisticated penalty function method borrowed from non-linear optimisation literature.

Most geographers have adopted the first approach, perhaps, without realising its key deficiencies.

Thus Cliff et al (1975), Wise et al (1997), and Martin (1997) all use a sum of weighted constraint violations. For example, the SAGE system of Wise et al (1997) allows the user to provide differential weights to three zone design functions (homogeneity, equality, and compactness). The user supplies weights to them (0 to 100%) as a means of trying to balance the different objectives. The problem with all these methods is that each function is measured in different units and they need to be standardised in an appropriate way. Martin (1997) writes "Setting each weight to 1.0 means that each is given an equal weighting, such that a 1% overall improvement in one measure is of equal attractiveness to a 1% overall improvement in another" (p14). The problem is that this scaling is not straightforward. The principal criticisms can be summarised as follows.

  1. The zone design functions of the Cliff-Wise-Martin type are not constraints in a conventional mathematical optimisation sense; instead they are really multiple separate objective functions, the weighted sum of which are to be minimised in the hope that this delivers a good result. Unfortunately, there is no assurance that either any or all of these design functions will meet whatever minimum or maximum limits are placed upon them. The overall result is uncontrolled inconsistency from one ward to the next in relation to the design criteria.
  2. The quality of the solutions depend entirely on the respective weighting given to the competing design functions and especially how each of them are scaled or standardised so that the previous quote from Martin (1997) is satisfied.
  3. Function standardisation is not easy and it is almost impossible to determine how best to scale the competing functions. For example, how do you relate within zone sum of squares to squared boundary length of the output areas to squared deviations from a target population size? Each function has a different mean and variance that constantly change as the zoning system is modified.
  4. In essence the design functions are being used as equality constraints but with extreme and quite possibly unreasonable target values (e.g. minimising the within zone sum of squares is equivalent to an equality constraint of zero and a goal of complete homogeneity) but sometimes with far more attainable, but impossible to exactly meet, targets (e.g. minimise deviations from mean zone size). Is it necessary or sensible to have equality constraints with right hand values (i.e. targets) of zero?
  5. If non-zero targets are used then the problem reduces to finding a feasible zoning system that satisfied a set of inequality constraints. However, there are potentially multiple different solutions that may achieve this goal and for it to work there has to be an overall objective function to define which of the feasible solutions is best.
  6. However, if the intention is to use multiple objective functions in zone design then consideration should be given to various multiple objective function handling methods, such as goal programming, in which there is an explicit trade-off between alternative functions. The problem is that this process is usually very complex and non-automatic involving interactive trade-offs.

A far better and simpler strategy is to select one of the design functions as the objective function and treat the others as equality or inequality constraints, setting realistic, explicit, and attainable target values for them.

3.4 Penalty function methods used in non-linear optimisation

Once the zone design problem is thought of as a mathematical optimisation problem (i.e. an objective function with constraints) then there are various methods for handling constraints on non-linear optimisation problems that may be used in a zone design context. The simplest is to add a penalty function to the objective function that reflects these constraint violations. The weighting given to these constraint violations is then gradually increased so that, hopefully, a sequence of unconstrained optimisations will gradually move towards a solution of the original constrained problem.

One way of operationalising this approach is as follows.

Let Cj be the jth constraint violation; for example, if a minimum size constraint is applied so that for any zone j

Cj = max (target - Sj, 0)

where

target is the minimum household size so that the constraint being imposed is in fact Sj target

and Sj is the household size of zone j

So, Cj either has a value of zero if Sj equals or exceeds the size threshold or is set equal to the degree by which the constraint is violated (i.e. the number of households that needs to be added to zone j to reach the size threshold).

A penalty function approach would now seek to re-arrange the zones in order to optimise a new objective function that has a penalty term which represents the constraint violations: For example, minimise a function such as

for successive values of the positive parameter chosen so that it slowly moves to zero. The hope is that this sequence of unconstrained problems will converge with a solution to the constrained problem. The problem with these and related methods is that they tend to involve terms that can easily create very large numbers which causes major difficulties to the optimiser with results that are highly sensitive to small changes making them hard to handle. Also, the differential scaling issue involved in trading off one set of constraints against another is still not being properly addressed. Finally, the optimisation method used to solve a particular unconstrained problem may easily become stuck and experiments indicate that this is particularly likely in a zone design context.

3.5 Powell-Fletcher method

Powell (1969) describes a far superior alternative to handling constraints. He developed a modified penalty function approach that has two sets of controlling parameters. This was later generalised to handle inequality constraints; see Fletcher (1987). It involves optimising a new function

where

is a penalty function that is dependent on the zoning system X, i and i;

i and i are a series of parameters that are estimated to ensure gradual satisfaction of the constraints;

Ci(Z) is a constraint violation which is some function of the zoning system and, or the zones in the zoning system;

F(Z) is the objective function and Z is the set of unknowns in the optimisation which is in fact the zoning system.

The required solution can usually be obtained for moderate values of the parameters, if one exists. The attraction is that it is easy to change these parameters to generate a suitable sequence of unconstrained problems. The key point here is that the i and control parameters are changed slowly to ensure that, if at all possible, a constrained solution will be obtained. Fletcher and Powell provide convergence proofs and it only requires that the magnitude of the objective function and the various constraint functions are scaled so as to have similar magnitudes. As Openshaw (1978b) demonstrated it works extremely well on zone design problems and so far no better methods have been developed. It has the added benefit that this penalty function approach avoids many of the local minima that the original AZP algorithm is prone to discover because it can go and up down function gradients depending on the respective values of the control parameters. A version of ZDES has now been developed which implements this approach in a form suitable for designing census output areas. The question now is how well does it work.

4 Census output area experiments

4.1 Data creation

The study region used here is a ward in Sheffield that was selected because of the diversity of the characteristics it contained. The labour involved in creating simulated Address Point data was sufficiently great as to preclude examination of more than one ward. The address points were digitised from a Sheffield City large scale map (1:2000) representing the centroids of properties identified on the map (See Figures 1 and 5). The number of households at each address was inserted as an attribute; this assumes that each house number on the map represents one household. The resulting dataset replicates the Address Point data to sub-meter accuracy and consists of 5406 points representing 7700 households.

As no real 1991 individual census data were available for this area it was necessary to create other variables used to represent a measure of zonal homogeneity. The homogeneity criteria probably only really need to be at a fairly gross level of detail because it is unreasonable to expect high levels of social homogeneity even at this scale. Additionally, there may be some concern that designing zones to be "too homogenous" may cause subsequent statistical problems in analysing the data (affecting variances). On the other hand complete lack of any control for homogeneity is also undesirable, because systematic geographical differences in the levels of homogeneity do need to be controlled to some extent. Probably the best and most relevant variables to use are, therefore, either tenure type or building type. Here the former is used as it could be captured directly from the map that was used for the digitisation. There are two house tenures; council or privately owned and these are applied via the homogeneity criteria as used by Martin (1997).

The last stage of data creation involved the generation of two sets of 5406 thiessen polygons each, that would be used as building blocks. The first set was thiessen polygons restricted to linear features in order to experiment with Martin's (1997) approach; due to lack of OS data, ED boundaries were used as linear features (Figure 2). The second set was based on free thiessen polygons without any restrictions to their shape, clipped only to the ward boundary (Figure 6). The thiessen polygons were examined for inconsistencies and entered into ZDES for the creation of the contiguity matrix and the necessary attribute tables.

4.2 Objective function and constraints

The objective function most suitable here is the sum of population weighted within zone distances around the point of minimum aggregate travel for each output area, the so called P- median point. These accessibilities are summed for all M output regions to provide a useful indicator of output area compaction. Thus, the objective function is:

where

Pi is the population value for small area Ni (in this case the number of households);

Dij is the Euclidean distance between the address point and the population centroid of the output area Mj of which i is a member, note that this P-median centroid depends on the current membership of region j.

In addition the following constraints are used. The percentage homogeneity of any output zone is calculated as

Hi = 100 (max (X1, X2) / (X1 + X2))

where X1 is council tenure and X2 is non-council tenure in zone i.

For current purposes each output zone is constrained so that: Hi > 75

where Hi is the homogeneity measure previously defined . A second set of size constraints are applied so that every zone exceeds an arbitrary size of 25, more formally: Sj 25

where Sj is the number of households in output zone j. Note that with M output zones there are 2M inequality constraints that have to be satisfied before an acceptable solution is found.

4.3 An automated census output areas procedure

A brief outline of the principal stages in an automated census output area design process is shown in Figure 11. This is a modification of Martin (1997) but with a few key changes in the mechanisms for handling the constraints and it has a different objective function. It may be of passing interest to note that most of this processing is performed automatically inside the ZDES system. The choice of constraint values is fixed by the global design process, but the number of output areas is variable and is specific to each ward. In addition, the ward specific nesting could be replaced by a higher level of geography. ZDES compute times are linear in N and almost indifferent to the magnitude of M. So a Local Authority nesting would be quite feasible and the results would be better because of the reduced effects of ward boundaries.

The best way of dealing with the problem of how many output zones can be engineered is to guess the maximum number (given a minimum size level) and then slowly reduce it until a feasible and visually pleasing solution is obtained. For the 1991 census an output area had a minimum household count of 16. The objective here is to generate as many output areas as possible, given the design constraints. It is useful to note that the principal constraint on the nature of the output areas used to report census data is the need to preserve the confidentiality of the data. Currently this is defined only indirectly via minimum household and people size counts. This being the case, the obvious strategy from a user's point of view is to seek the maximum number of output zones that meet these restrictions. This is a far more useful design objective than merely insisting that most zones are of broadly the same size. In the limit, both two approaches are identical; for instance when all zones exactly match the ONS minimum (population and household) size. This objective is probably unrealistic but there is no reason to assume that zoning systems cannot be engineered that come close to meeting this goal.

The reason for wishing to maximise the number of output areas is, therefore, simply because this will maximise the subsequent utility of the small area data. Having more rather than fewer output areas will give those census users denied an opportunity to design their own geographies, maximum flexibility in the re-aggregation of the published small area census data. Indeed, if the areas are sufficiently small then there may well be no need for any other form of output geography as users could re-engineer these small zones to match their needs, also using ZDES but with the OAs as the building blocks.

4.4 Discussion of the results

Martin (1998) notes that in the three experiments he performed it was sometimes possible to more than double the numbers of output areas. In the Sandwell ward of Birmingham the 50 1991 EDs could be increased to 111, in Petersfield ward from 25 to 48 but in the Craven ward from 19 to 18. However, these increases are almost certainly smaller than a more sophisticated constrained handling zone design method would manage, as is demonstrated here.

Table 1 shows the statistical properties of the results of our experiments with the restricted thiessen polygons for the Sheffield ward, together with the ED properties for comparisons. The maximum number of households in output areas was arbitrarily set to 25, well above the official census figure of 16; the same figure being the minimum zone size constraint. As can be seen there comes a point where no feasible result can be obtained. This can be fine tuned to whatever arbitrary degree of precision is desired. Selected outputs from these runs were mapped in figures 3 and 4 for 50 and 200 output zones respectively. The algorithm performed well, but the belief was that it might perform better with the non-restricted thiessens.

Table 2 shows the results for the unrestricted thiessens. Again the algorithm performed extremely well suggesting that about 290 output areas could be identified that met the design constraints (cf. 36 1991 census eds). However, as Figures 7 to 10 show, it is clear that some of the maps with the larger numbers of output areas have started to disintegrate and no longer show compact polygon shapes. The explanation is probably due to the homogeneity constraint. It would be sensible to select less than the maximum number of output areas. In this case somewhere between 200 (figure 8) and 250 (figure 9) output areas. Some indication of the shape disintegration can be seen in table 3 via a crude shape index that measures the internal arcs length divided by the number of polygons.

5 Conclusions

The results confirm and add further support for the creation of a new style of output area geography consisting of zones that have been consistently defined, satisfy confidentiality restrictions, and yet are small enough to be used as building blocks in other user specific geographies and re-aggregations. The use of a more sophisticated zone design algorithm that permits a re-specification of the problem in an optimisation framework improves the quality of the results and puts the entire process on a stronger methodological basis. The same technology could also be used to create a second tier of geography by re-assembling the small areas into a second tier of new geography that was 1991 ED like or ward-like in terms of numbers but consistently defined to have common properties. This is an important point. If output areas 4 to 8 times smaller than 1991 EDs can be created then it would be fairly easy to re-zone these output areas to provide uniform geographies at other scales of resolution without risking differencing problems. It may also permit best fitting to historic census geographies that are now defunct (i.e. 1981 wards, frozen postcode sectors for 1991, parishes for 1971, etc.). Finally, these methods could also be used as a basis for an entirely user controlled flexible output geographic design process merely by adding an explicit confidentiality risk constraint to the zones that are produced. This is possible now or soon could be but may be this will be more appropriate for the 2011 rather than the 2001 census or for user testing in 2005.

If the new output areas are to be successful then there are five conditions that may need to be met: (1) a global agreement as to the design criteria and the values of the constraints, (2) the availability of a digital representation of the resulting areas (created from the zone outputs), (3) the creation of a fully connected topology without edges to permit subsequent re-aggregation to be easily performed, (4) the diffusion of the necessary zone design software to allow census end users to re-engineer their census geographies and (5) an end-user community sufficiently aware, now, well in advance of the census, to see the benefits of such a revolution in four or five years time. It is this latter condition that is probably the hardest of all to meet and it requires a proactive rather than a reactive response by ONS.

Acknowledgements

The authors would like to acknowledge the use of 1991 Census data purchased by ESRC/JISC, for most of the analysis and mapping needs in this paper. We would also like to thank the GIS Unit, Planning Department of the Sheffield City Council that kindly provided the 1:2000 map for the modelling of the Sharrow ward address data. All data and maps hold Crown Copyright.

References

Cliff, A.D., Haggett, P., Ord, K, Bassett, K., Davies, R., 1975, Elements of Spatial Structure CUP, Cambridge

Fletcher R, 1987, Practical Methods of Optimisation. Chichester, Wiley.

Martin D, 1997, Implementing an automated census output geography design procedure. Department of Geography, University of Southampton, Southampton. Draft 20/01/97 (Copies obtained by the author)

Martin D, 1998, Optimising census geography: the separatino of collection and output geographies, International Journal of Geographical Information Science 12 (in press)

Openshaw, S., 1977, A geographical solution to scale and aggregation problems in region-building, partitioning, and spatial modelling. Transactions of the Institute of British Geographers 2, 459-72

Openshaw S., 1978a, 'An empirical study of some zone design criteria', Environment and Planning A 10, 781-794.

Openshaw, S., 1978b, An optimal zoning approach to the study of spatially aggregated data. In Masser I, Brown PJB (eds) Spatial representation and spatial interaction. Boston MA, Martinus Nijhoff, 95-113

Openshaw, S., 1984, The modifiable areal unit problem. Concepts and Techniques in Modern Geography 38. Norwich, UK, GeoBooks

Openshaw, S., 1996, Census Users Handbook

Openshaw, S., 1996, Developing GIS relevant zone based spatial analysis methods. In P.Longley, M.Batty (Eds) Spatial analysis: modelling in a GIS environment. Cambridge, GeoInformation International: p55-73

Openshaw, S., Alvanides, S., 1998, Applying GeoComputation to the analysis of spatial distributions. In P.Longley, M.F.Goodchild, D.J.Maguire, D.W.Rhind (Eds) GIS: Principles, Techniques, Management and Applications GeoInformation International (forthcoming)

Openshaw, S., Rao, L., 1995, Algorithms for re-engineering 1991 Census geography. Environment and Planning A 27, 425-46

Openshaw, S., Schmidt, J., 1996, Parallel simulated annealing and genetic algorithms for re-engineering zoning systems. Geographical Systems 3, 201-20

Powell, MJD., 1969, 'A method for nonlinear constraints in minimisation problems', in R Fletcher (ed) optimisation Academic Press, London 283-298

Wise, S., Haining, R., Ma, J., 1997, 'Regionalisation tools for the exploratory spatial analysis of health data', in A Getis and M.M. Fischer (eds) Recent Developments in Spatial Analysis, Springer, Berlin 83-100

ZDES, 1998, htttp://www.geog.leeds.ac.uk/research/ccg/zdes3.html for further details

Table 1: Output area statistics with ED boundary barriers for Sharrow ward, Sheffield


 Unit   Number   Status    Minimum     Maximum      Mean    in_shape  
 type  of Units          Households  Households  Households   index   
                                                                      

EDs       36       n/a       73          408        214        600    

ZDES      50       OK        40          269        154        545    

ZDES      100      OK        26          206         77        368    

ZDES      150      OK        25          141         51        297    

ZDES      200      OK        25          109         39        263    

ZDES      250      no        25          80          31        280    
                  f.s.                                                

Notes: in_shape: (internal arcs length) / (number of polygons); no f.s.: no feasible solution

Table 2: Output area statistics without boundary barriers for Sharrow ward, Sheffield


 Number   Status    Minimum    Maximum      Mean     in_shape  
   of             Households  Households Households    index   
 Units                                                         

   50       OK        65         295         154        530    

  100       OK        27         171         77         362    

  150       OK        25         129         51         295    

  200       OK        25          83         38         264    

  240       OK        25         125         32         284    

  250       OK        25         170         31         273    

  275       OK        25          62         28         267    

  285       OK        25          72         27         291    

  290       OK        25          60         27         304    

  295      no         25          63         26         302    
           f.s.                                                


Notes: no f.s.: no feasible solution

Table 3. Output Area Shapes


   Number of Units       Shape Measure        

                                              
50                       530                  

100                      362                  

150                      295                  

200                      264                  

240                      284                  

250                      273                  

275                      267                  

285                      291                  

290                      304                  



                                                                        

                                                                        
Figure 11: Overview of an automated census geography design process