Developing GIS relevant zone based spatial analysis methods

Stan Openshaw

School of Geography
University of Leeds
Leeds LS2 9JT, U.K.


1. Background

Despite a number of conferences, workshops, books, and papers on the Spatial Analysis and GIS theme in the last five years, there appears to have been a marked reluctance to face up to the problems of doing spatial analysis in the GIS age. This partly reflects the lack of interest in spatial analysis from the late 1970s onwards, as the first quantitative revolution in geography faded away, and also the absence of any large scale funded explicitly spatial analysis research. It is true that GIS has greatly stimulated interest in spatial analysis, but it is equally apparent that none of the major GIS research programmes has supported much new spatial analysis activity. For example, in the UK the ESRC's Regional Research Laboratory Initiative (1987-91) targeted spatial analysis as a priority area and then probably spent the equivalent of about three persons years of effort on it. In the USA, the National Centre for Geographic Information and Analysis organised a Spatial Analysis Initiative (I-14, the 14th initiative in four years); produced a book (Fotheringham and Rogerson, 1994), and had a budget of about £35K to be spent mainly on stimulating awareness of the problems, with no real resource to fund research. In Europe, the current ESF funded GISDATA Research Programme has targeted spatial analysis as one of only sixteen key areas for attention, and seems to be well on the way to repeating the US NCGIA model, which means that little new research is being directly funded as distinct from a raised awareness and some stimulation.

So whereas it would appear that there has been considerable research activity in the spatial analysis domain since the late 1980s, in practice there has been very little more than the efforts of a small number of individual enthusiasts. It is argued that spatial analysis has been grossly and unduly neglected especially as it is in many ways fundamental to the effective use and further exploitation of GIS in many different applied contexts. From a GIS perspective, the problem is that the analysis and modelling tasks only become important once GIS has become an established technology. It is a second generation GIS research agenda item. It is apparent that this post-GIS revolution era has been reached and that the focus of research attention now needs to move on from Geographic Information Handling to Geographic Information Using with the obvious greatly increased emphasis on creating appropriate analysis and modelling functionality. Sadly, these new needs have not yet been reflected in any major new research initiatives. Indeed, it is amazing that, for example, in the UK the Analysis of Large and Complex Data Sets Initiative (one of the largest ever run by the ESRC with potentially over three times as much funding as was spent on the highly successful GIS research initiative) should have somehow almost totally missed the needs for the analysis of large and complex spatial databases; rather than a focus on complex survey data of probably limited academic interest and of relatively little (by comparison) appplied value and relevancy.

GIS has created a large number of spatial databases that now need to be analysed. Openshaw (1994a) talks about "spatial analysis crime" which is a label applied to those agencies who hold spatial information and fail to adequately use or analyse it. This type of "crime" is particularly widespread and occurs when there is a strong imperative for the analysis of data that is created once it exists in a suitable form. In a commercial context, the failure to make good or full use of spatial information impacts on profitability. In the public sector the failure to analyse data can be against the public good and may affect the efficiency by which billions of pounds of public money are spent; for example, every year about £22 billion is allocated to health authorities based on a simple spatial model (Carr-Hill et al, 1994). Of course, there are civil liberties, privacy, and data protection excuses that can be applied, but these are often grossly exaggerated, misunderstood, and treated only in a highly negative manner. It can be argued that in the UK the Data Protection Act of 1984 is not so much protecting sensitive data as preventing or scaring people from performing analysis (Openshaw, 1994b). This is clearly not right, more especially if there is a public goods case that can be made. Privacy, and confidentiality, and civil rights should not prevent analysis so much as to ensure that the information is not released or misused in a way that impacts on the privacy of the identifiable individual. This is very important but it is not a valid excuse for non-analysis within appropriate confidentiality envelopes and barriers.

However, maybe the lack of interest in spatial analysis by those with the data also reflects a lack of suitable methods, a lack of awareness of what can now be done in the 1990s (compared with the quantitative revolution years of the 1960s), and the more or less complete absence of suitable tools. There is a golden opportunity here for geographers to start a new revolution related to the Analysis and Use of Geographic Information now that the problems of handling and getting it are more or less resolved and only need another decade or so for their total resolution. Spatial analysis and modelling of human spatial systems is now rapidly emerging as a new grand challenge area, for the late 1990s, (Openshaw, 1995a).

Twenty years ago a conference was held in Newcastle on "Whither spatial analysis? at a time when the topic was literally withering away. However, this is still a relevant question in the GIS age. It would seem that even now there are few GIS relevant spatial analysis tools and no clear view of how to proceed. This is particularly true in the case of the analysis of data aggregated to zones. Maybe too many of the experts date from the pre-GIS era and merely view GIS as a rich new source of data on which to apply their pre-GIS technologies. Maybe also it is the hardness of the challenge that has made progress slow and explains why many advocates of GIS have avoided favouring spatial analysis compared with other forms of the geoprocessing technology that are more readily do-able with clearly defined sets of needs and markets. Perhaps this is the real reason why GIS developers and vendors have been slow to offer spatial analysis tools and then justify their neglect by arguing that there is in anycase no market demand for it. Certainly confusion as to the nature of spatial analysis technology that might be considered GIS-relevant and GIS-friendly has not helped; neither has the efforts of the more statistically minded geographers who, either deliberately or accidentally, have fostered the view that the technology is likely to be statistically complex; skills that are seemingly alien to both many GIS developers and users.

Yet how much longer is it safe to let matters drift? How much longer can geographers continue their relative neglect of such a very important activity to GIS? How much longer before the users start to demand spatial analysis functionality? The default of doing next-to-nothing will almost certainly both hold back the full exploitation of GIS in general and may well result in the re-discovery and misguided application of old fashioned quantitative geographic analysis methods that are probably wholly inappropriate, purely on the grounds that there is nothing else available. This "make do with what we have" philosophy is already in evidence; for example, why are undergraduate geographers still taught how to apply classical statistical methods to spatial data even though it has been known for about 20 years that these methods are largely inappropriate for use with spatial data; for example, any classical inferences will probably be wrong. Statistics in Geography has now to be much more than the mere application of old statistical methods, widely available via standard packages, to data that reflects a geographical context. Yet even statistical geographers who know better still consider that conventional statistical packages are useful in exploring spatial data and argue against the more radical view that such methods have little or no value when applied to zonal data.

2. So what are GIS relevant spatial analysis methods?

It is useful to try and identify the types of spatial analysis needs that appear to exist in the GIS era and devise a set of basic criteria that will help guide those seeking to develop operationally useful methods. Openshaw (1991) outlines some basic generic spatial analysis procedures. These are listed in Table 1. The question then is how to meet these needs. There are a number of styles of approach which can be recognised; in particular descriptive, exploratory, inferential, and model based. In each case there is an increasing distinction being between those who see the analysis functions as being driven by the user; e.g. SPACE-STAT (Anselin, 1992) and SPIDER-REGARD packages of Haslet et al (1990, 1991) with the emphasis on interactive spatial statistics (e.g. using Xlisp Stat) with various graphic devices providing a means of visualising patterns and relationships; and those who regard the analysis as being performed largely automatically by machine but with some user guidance, intuition and insights, whilst also presenting the results back for further interpretation, e.g. Openshaw et al (1987, 1990). Anselin (1994) and others who prefer the highly man-machine interactive data exploration approach often see this distinction in style as a problem; in fact it is not. As Openshaw (1995b) points out the aim is to create an intelligent partnership that allows machines to do what they are best at (e.g. pattern sifting) and let human beings do what they are good at (interpretation, application of experience, ability to think laterally etc). The optimal approach is clearly a combination of the two. Now the graphical-interactivists might well claim that this is what their systems already seek to do, but the following important caveat needs to be added; namely this is correct only for smallish problems with relatively few variables and possible fairly straightforward stationary relationships and patterns. Once you turn-up the levels of complexity (i.e. more variables, more observations) and then introduce a spatial dimension, there must come a point when interactive map graphic approaches no longer work. Alternatively, if you begin with large and complex spatial databases, the question might also be asked "how much data reduction and simplification is needed before these graphics based methods can begin to cope"? This type of "toy town" technology is going to be most useful as a front-end to a much more powerful machine based automated analysis process.

It is important to be realistic and attempt to create a technology that can handle the problems of exploratory spatial data analysis and clearly this needs technical and methodological advances to occur on a broad front rather than just in certain areas. However, even if it is possible to agree about the broad needs there still remains the task of defining basic criteria that the new spatial methods need to meet. Openshaw (1994c,d) present 10 basic rules for identifying appropriate GISable spatial analysis technology. They are summarised in Table 2.

Some of these criteria relate to the design of spatial statistics, for example, the fundamental importance of identifying local rather than only globally recurrent spatial patterns and relationships; in particular the need to move on from whole map statistics. Anselin (1994) certainly agrees, he writes "The degree of non-stationarity (spatial instability) in large spatial data sets is likely to be such that several regimes of spatial association would be present... Therefore, the sole emphasis on global measures of spatial association... is misplaced" (p47). However, one might wish to question the linkage here of non-stationarity with spatial instability. Really what is being looked for are geographically localised patterns and it is these features that are often of greatest interest and also the hardest to find. It is particularly important in developing new spatial statistical techniques that they can look within the study region for any such localised patterns. Indeed, some very interesting developments in local measures of spatial association are being developed; for example the spatial lag pies of Anselin et al (1993), the Moran scatterplot (Anselin, 1994), the Getis-Ord G statistic (Getis and Ord, 1992), and other local indicators of spatial association including the local autocorrelation diagnostic of Nass and Garfinkle (1992). It is absolutely fundamental that we can develop tools able to detect, by any feasible means, patterns and localised association that exist within the map. There is a considerable amount of interesting research underway and the mid 1990's may well see the greatest advances being made in spatial analysis since the original work of Cliff and Ord (1973).

Some of the other criteria in Table 2 also need emphasis. In particular, the post GIS era is a time of large spatial databases. It is not unusual for large numbers of point and zone entities to be present. Methods of analysis are needed that can cope with the problems of spatial analysis in GIS domains. The problems are not necessarily the same as they were during the pre-GIS statistical geographic era. Much more of the analysis is now largely data driven and is relatively hypothesis free. Many end-users merely want answers to fairly abstract questions such as, are there any patterns, where are they, what do they look like? Additionally, it is apparent that spatial analysis will soon no longer be the exclusively preserve of the expert researcher but that the technology needs to be packaged for much more general use. This implies that it must be understandable, intrinsically safe, usable by the non-expert, and useful. These design criteria are fundamental and cannot simply be added on as after thoughts. GIS is applied technology and it is unavoidable that probably fairly soon probably large numbers of "idiots" will end-up doing and having to do spatial analysis of one form or another. It is important that they are not misled by exaggerated claims of a statistical kind or rely on methods that are wrong when applied in a spatial context. In many ways what is needed is not a precise spatial analysis technology but methods able to spatially analyse data without misleading the inexperienced and causing measurable harm. One approach is automation of the search and validation functions. Another is the default of leaving it to users to make their own mistakes! However, the subject is too important to be so neglected. It is critical that we discover how best to meet generic spatial analysis needs in a way that is least likely to cause subsequent harm. This will not be easy but equally it is a responsibility that geographers cannot shirk.

3. Confirmatory options

One of the key areas that misleads users is the appeal of confirmatory analysis. Many users of GIS will want to test hypotheses or will want to know if this or that result is "significant" even if they have no clear idea of what precisely significance means in a particular GIS context. However most users of spatial analysis technology only know that significance means importance and non-significance implies irrelevancy. It is an important task to develop the clearest possible view of the nature and role of confirmatory methods in a spatial analytical context. Currently there is considerable confusion and urgent steps are needed to clarify the situation.

This problem is not easily resolved particularly as it is still not clear how best to use inference in an exploratory spatial data analysis context; if indeed it should be used at all. It is all very well arguing against the frequentist philosophy emphasising instead the attractions of Bayesianism, but currently most of the available technology is non-Bayesian in nature. The confirmatory dilemma is as follows: either you test a single whole map statistic against a null hypothesis or you test N hypotheses relating to zone or locality specific statistics. In the former, the test is silly from a geographical point of view because of its whole map nature, its dependency on the definition of the study region, and the nature of the underlying hypothesis. In the latter case, there is the problem of multiple testing which as N increases in size beyond a small number usually become insuperable. If that is not sufficient, there are two additional difficulties that need to be faced; (1) the nature and origin of the null hypothesis, testing against randomness is not entirely useful, it would be far better to be much more specific than this universally applicable null hypothesis of last resort; and (2) the power of the test.

Much more attention needs to be given to what significance means in a spatial context and the nature of the spatial null hypothesis. For example, most spatial phenomena even if they have no patterning are non-random; for instance, population densities are systematically patterned. A null hypothesis of zero spatial autocorrelation is not meaningful because the random state of any variable associated with population density would be positively spatially autocorrelated. Some of these difficulties may be overcome by the use of Monte Carlo significance tests (instead of reliance on asymptotic distributions) and other forms of computationally intensive testing (bootstrapping and permutation tests). However, the question arises as to what they are being used for?

If there is a genuine a priori spatial hypothesis to test then that is the one case where confirmatory methods might actually be very useful; provided it is a genuine a priori hypothesis and not one generated via knowledge of the data being investigated. The problem here is that post-hoc hypothesis testing following on from an initial exploratory analysis of spatial data (e.g. perhaps via knowledge of mapped data) is a very common practice in geography. However this is not an acceptable confirmatory approach. The hypotheses needs to be formulated prior to any knowledge of the data on which it is to be tested. This clearly limits its usefulness in a GIS context.

If the hypothesis testing is merely to flag areas of the map which are "significant"; a widespread practice; then this may be highly misleading due to multiple testing. For example, a map of 1000 zones showing significant Getis-Ord G statistics at a=0.05 would, even if the data were random, show on average 50 significant zones. Worse follows, the variance of this expected number of false results cannot be computed except by simulation. So how will the investigator know whether any particular set of mapped results are showing anything meaningful or not. Also how would you separate the real significant zones from the false (due to multiple testing) significant zones when mapped? In anycase with a large number of zones, who can cope with large numbers of supposedly significant results scattered about all over? If you reduce the a value to allow for the testing of 1000 hypotheses then it is highly likely that not even a single zone will be identified, because a is probably now so vanishingly small. Another difficulty is that data errors often appear as a highly significant departure from whatever might have been expected. So even if you find sufficiently strong results able to survive corrections for multiple testing it is quite likely that some or many of them will be due to data error rather than a 'real result' How do you cope with that?

One possibility is to use the spatial pattern of significant results as a second order test. Whilst 50 significant results might be expected in the previous example, what is the probability that 40 of the 50 are all located in the same part of the map? Now this can be computed by simulation and provides the basis for a geographically relevant approach to the use of confirmatory methods in a spatial data exploratory context, provided you have sufficient compute power to make it computationally feasible. A variation of this approach is that of 'descriptive inference'. In a way the words are contradictory but it retains the flavour of an exploratory style, in that significance is being used purely as a data filter not as a test of an explicit hypothesis. This is very much the style of Besag and Newell (1991) and also Openshaw and Craft (1991). It reduces the importance of many of the problems but equally it reduces the value of the confirmatory approach. It is abundantly clear that confirmatory technology needs further urgent attention if it is to be evolved into spatial analysis techniques that are safe and suitable for use in GIS. Currently they are not.

4. Spatial representation issues

Another set of very important but little understood issues concern spatial representation. It can be observed that much spatial analysis with zonal data has so far been performed with little regard for the basic spatial representation issue. Zonal data typically provide an aggregation of more micro scaled observations. This is true of remote sensing, the spectra for a pixel reflects the size of the pixel and its homogeneity in the presence of aggregation and aliasing effects. The same is true of a census zone; data for individual people and households are aggregated and in the process information is lost and an area profile created that may or may not be a good representation of the micro data that was used to construct the zonal data. Sometimes zones might be re-aggregated a number of times. Only if the zones are completely homogenous with respect to the micro data they represent will there be no possibility of spatial representation error. In practice the heterogeneity of the microdata patterns interact with the location of zonal boundaries and zone size to generate all manner of complexity. However, it is important to appreciate that the entities and geographical objects a zoning system might be expected to represent need not only be micro data, they could be much larger geographical features such as a town or village or neighbourhood; and that similar spatial representation issues also occur here.

It is nice to ignore these problems and focus on the more statistical questions. However, from a geographical perspective this cannot be sensible. As spatial statistics becomes more sophisticated so the interaction with the underlying geographical representations issue cannot be ignored. Otherwise the situation will emerge, if it hasn't already, whereby attempts will be made to overcome the consequences of studying spatial data for artificially fixed and arbitrary geographical units by the application of increasingly clever and sophisticated statistical procedures that can never totally resolve the problems that relate to the geography of the data. However, this is very much the approach of the geographically ignorant statistician. There is current a tremendous lack of imagination as how best to deal with the problems of spatial data analysis. For instance, it may be far easier and much more sensible to seek a geographical zoning that simplifies the statistical problems rather than assume the geography is fixed and thus further complicate an already difficult spatial analysis task. Ultimately, no degree of statistical and mathematical sophistication can ever overcome the problem of analysing data for areal entities which are silly from a geographical point of view. It is particularly important in a GIS context to adopt a 1990s rather than a 1930s view of zonal data. In particular the 1990s (and beyond) are an age of increasingly flexible geographic representations, whereas previously they were largely fixed by administrators for their own ends. It is no longer possibly to simply assume fixed spatial data except in special circumstances. The same microdata can be given a large number of broadly equivalent but different spatial representations; for example, there are many different ways of aggregating 1m data to a 1Km grid because there are many different 1Km grids that could be used; for instance, by moving the grid-origin and by rotating the grid. It is just not sound geographic practice to regard zonal data as fixed. Most spatial statisticians, and regrettably also many geographers, appear to have been slow to accept this point.

The consequences of this failure to reject methods of spatial analysis and views of spatial data that are inappropriate can be severe. For example, in the UK large amounts of public funds are allocated on the basis of simple minded indexes used to rank areas, usually either local authorities or wards. Simple minded technology is clearly attractive to end-users, because the results are easy to understand, but they can also be wrong! Fundamental geographic knowledge and principles are being ignored. For example, if you are ranking "areas" on the basis of a Z or X2 statistic, how are the effects of spatial autocorrelation being handled? This might seem a little technical, so lets ask an even more fundamental question. If you are ranking or comparing area A with area B what makes you think that these areal entities are comparable? It is true that they may well be of the same class of areal objects (i.e. Districts or wards) and that people seldom compare Districts with wards because they are perceived to be different. Fine, that is sensible. However, are all Districts comparable? Are some District-wise comparison in fact equivalent to comparing a District with a ward or a District with a county?

If you wish to compare zonal objects then you need to be certain that from a geographical representational point of view that they are comparable objects. Otherwise there is a real risk that you are comparing chalk with cheese. Given that these areal objects are not natural units (such as a cow or an egg or a person) but are variously subjectively defined zonations of two-dimensional map space that may or may not have any intrinsic meaning, there is immediately a problem of the nature of what is being studied. For example, if you wish to compare unemployment rates in cities then using local government units will not necessarily be that helpful, since they may encompass more than one city in one location and only part of a single city in another. Similarly, if you wish to study deprivation for small areas then wards may provide a good representations of the underlying patterns in large towns and totally inadequate in small towns. Likewise, census enumeration district in a large town centered on a problem housing area may well provide an excellent representation of the local spatial situation, whereas a census enumeration district of the same size centred on the same type of problem housing area in a small town may provide a much poorer representation due to the smaller scale of the patterns introducing a greater degree of heterogeneity. It is, therefore, highly dangerous to study zonal entities purely because data are available for them. Too many geographically illiterate statisticians actually seem to believe that zonal entities, such as wards, are really meaningful objects. The truth of the matter is that for any particular application some are, and some are not! The one's that are not either wreck the analysis, or are poorly and unfairly handled, or spoil the results for other areas.

In the pre-GIS era, it was quite an achievement to gain access to any small area data. The available spatial data handling technology did not exist to allow flexible geographic areal definitions. People used what was offered or available. However, in the post GIS era this is no longer the case. The technical constraints have dissolved. The re-discovery of the modifiable areal unit problem in the late 1980s is one reflection of the greatly increased degree of geographic flexibility that now exists. The problem is that it is being treated as an aggregation effect or as an ecological fallacy, whereas in reality it is a reflection of a far more fundamental problem of geographic representation. The Modifiable Areal Unit Problem (MAUP) will disappear once geographers know what are the areal objects they wish to study. Of course, the answer to this question may well be application specific, but nevertheless it needs to be faced, otherwise people will continue reporting results for zones that have little meaning, or worst of all, a different degree of meaningfulness in different places. The MAUP needs to be brought under control not left out of control. It is important to either discover viable ways of designing and engineering application appropriate zoning systems that make sense and are consistent in the quality of the spatial representation they offer, or devise methods that are robust to the problems caused by varying levels of spatial meaningfulness, or else develop appropriate frame independent methods. This is a very important challenge for spatial analysists in the GIS era. It is also a fundamental geographical problem that needs to be resolved; see also Fotheringham (1988) for a further discussion of some of the strategic alternatives.

5. Developing new tools for the spatial analysis of modifiable zonal data

The discussion leads to the view that a key area for geographical attention concerns how to analyse spatial information aggregated to zones. It is noted that much of the information that needs spatial analysis is zonal in nature. The challenge is that of discovering methods of analysis that are appropriate for spatial zonal data that are modifiable due to their nature. This is probably the most important of all GIS relevant spatial analysis tasks that still needs to be handled.

It is well known, or should be by now, that the analysis and modelling of spatial data reported for zonal entities is affected by and dependent on (to some varying but usually unknown degree) the zones that are used. (Openshaw and Taylor, 1979). The modifiable areal unit problem is endemic to all zonal data and it reflects the absence of a geographical equivalent to natural and indivisible entities for study. The MAUP cannot sensibly be removed nor is it likely to go away, nor can it be ignored. The zonal entities or objects of spatial study are almost infinitely modifiable and are in practice usually arbitrary or fixed for reasons which are incidental to the purpose of study but which may, nevertheless, exert an influence over it. GIS and the resulting explosion in spatially referenced data has greatly accentuated the importance of the MAUP. Previously researchers had to use whatever data for whatever zones were available; there was often little or no real choice in the matter. The decisions were made by others. So it was quite understandable for analysts to regard zonal data as fixed and to focus attention on the subsequent analysis and modelling problems for fixed spatial data. GIS has changed all this. Increasingly, the researcher can access sufficiently fine resolution data in a digital form that allows the design of whatever zoning system is considered best. This process of freeing the researcher from the tyranny of imposed zoning systems is not yet complete although the concept of user defined flexible geographic representation is gradually being accepted. For example in the UK, the 1991 census was the first UK census for which the basic spatial data collection units (census enumeration districts) are available in a digital form (Openshaw, 1995c). This provides the first real opportunity for census users to re-engineer their zoning systems using building blocks that are sufficiently small to make the enterprise worthwhile (Openshaw and Rao 1995). However, at present the process is non-ideal; even better would be the suggestion that the 2001 census should allow the user complete geographic freedom in designing zones, subject to basic confidentiality restraints and assurances.

Unfortunately, allowing users to choose their own zonal representations, a task that GIS trivialises, merely emphasises the importance of the MAUP. The user Modifiable Areal Unit Problem (UMAUP) has many more degrees of freedom than the classic MAUP and thus an even greater propensity to generate an even wider range of results than before. This is not very helpful unless the effects or variability in results this causes can either be estimated (as aggregation confidence limits) or removed (probably impossible) or controlled (feasible but we do not yet really know what to control for).

Zoning systems are of central importance to much GIS inspired spatial analysis so it is essential that appropriate methods are developed for handling the problems. Zones distort the data they represent, they may add noise, they may create new but false patterns of the ecological fallacy type, they may remove or hide real patterns, and they can easily generate a wide range of different results (although the differences need not always be great). Nevertheless, they are a major source of uncontrolled variation that needs to be explicitly handled rather than ignored. Maybe the statistical analyst's task can be eased by the simple expediency of exploiting the UMAUP to design zones that overcome at least some of the problems or which better meet certain statistical assumptions or which yield data that are much more readily analysed using existing methods.

It is disingenuous to argue that the answer is to minimise the impact of the MAUP by only using data at the lowest possible level of aggregation (Goodchild, 1992) or doing away with zones altogether by moving to frame independent forms of spatial data representation and analysis (Tobler, 1988). There are problems with both approaches. In the former, there is an assumption that data for finer and finer zones are best, which is false. It ignores the important role of a zoning system as a generalisation operator that can serve a useful spatial analysis function, as well as add noise. Many geographic patterns of interest have an innate scale dependency to them. Finer resolution data mainly increases flexibility in usage and storage; however this is not really the problem. Increasingly tools are needed to generalise and define more abstract geographic patterns and relationship that cannot be seen (or found) at a micro scale level of detail. Fine zones merely give the user more freedom in the design of geography but the extra detail often cannot be properly exploited for analysis until zones design can be performed in a purposeful and controlled way. If frame free methods are used, it is hard to see how all relevant spatial information can be accurately, reliably, and sensibly expressed in a continuous and frameless form. The continuous approximation of discrete information merely introduces other sorts of error, inaccuracy, and distortion. Geographical analysis is about the variation between and within places, these are discrete limps of space. It is hard to see how and why surface representations of such phenomenon constitute a replacement technology. There is too much detail, it is too hard to relate the surfaces back onto the real-world, particularly in an object orientated way and of course not all (or many) phenomenon of interest can be sensibly re-expressed in this form. In a way it changes a UMAUP based on a finite but large number of possibilities into an IUMAUP where there is an infinite number of possible discretisations since there is no longer any spatial building block restrictions. This might be optimal for areal unit interpolation and spatial data manipulation but it will cause many other problems to spatial analysis and in the end may well prove unhelpful in a spatial analysis context unless linked to clever pattern and object recognition technology (Openshaw, 1994e).

One of the nicest features about zones is that they provide a major simplification of real world complexity that is naturally geographical. Frameless continuous space representations are more mathematically appealing than geographically relevant. The real problem at present is not really the clumsiness of a discrete zonal representations of space but that geographers have not yet faced up to the problems they present. There is a need to bring the MAUP under control and to change attitudes towards it. It is important also to drop the -P, the MAU issue is only a MAUP whilst we ignore it! Openshaw (1978a) argued that it is not a problem to be removed by sophisticated means but a potentially extremely useful spatial analysis tool.

6. Zone design as a spatial analysis tool

There are two principal ways of dealing with zone design: (1) as a problem in the definition or identification of areal entities that make some sense in a particular substantive context (e.g. functional regionalisation, spatial classification to identify residential neighbourhoods), and (2) as a spatial analysis tool that provides a two dimensional map based representation of the interaction between a function (or statistic) that is being optimised and the data being aggregated (e.g. political redistricting, optimal catchment planning). It is important not to confuse the two approaches and also to seek to further develop the second one as a spatial analysis tool that is of much wider potential utility than its use on specific issues in spatial planning or optimising the spatial arrangement of facilities in a location-allocation context. Zone design is regarded here as providing a generic basis for the development of new types of spatial analysis applicable to data that are to be analysed via zone based entities. The principal spatial analysis functions that can be served are shown in Table 3. They are all variations of the same approach in which N building blocks are aggregated into M zones such that some function (or functions) defined on the M zone data is optimised, subject to various constraints on the topology of the M zones (e.g. internal connectivity) and, perhaps also, on the configuration of the M zones and, or, on the nature of the data they generate. This Automatic Zoning Problem (AZP) is a hard combinatorial optimisation problem. Plugging in different objective functions and sets of constraints allows it to handle the spatial analysis tasks described in Table 3.

A zoning system is a very simple form of description. The boundaries may themselves have a substantive meaning; for example representing edges or changes in the underlying patterns; for instance, as found in multivariate regionalisation or in traditional geography regional description. The areas encompassed by the boundaries may also have a geographic meaning; for example, units of space that can be labelled for descriptive purpose, either as being in some ways distinctive or similar entities (i.e. approximately equal population sized) or as being dominated by different characteristics (i.e. regional types). The zonation functions needed to serve these descriptive needs might include: minimising multivariate within region sums of squares, equalising functions based on areal configuration (i.e. shape compaction) or on data generated by the regions (i.e. approximately uniformity of population size or numbers of households with two cars etc). The resulting geographical partitioning of space has a value as a visualisation or visible spatial description of geographic patterns of potential substantive interest. The zoning system acts as a spatial pattern detector.

The challenge of spatial pattern capture is only slightly different. Here the need is to so re-arrange the data so that zones are created which are highly kurtose with respect to a variable or indicator of interest. For example, in searching for excess disease zones, the objective function could be to maximise the kurtosis of the disease rate in question. The aim being to find excessively deficient as well as excessively large zonal anomalies. A minimum zone size criteria should also apply to avoid identifying small number problems in the database. The locations and configurations of the highest rates zones would be of direct interest as well as the magnitude of the rates obtained. The zoning system acts here as a spatial pattern finder. If it cannot be engineered to produce a desired result, then clearly it cannot be achieved and the associated hypothesis should be rejected. If the target result can be obtained by zone design (or gerrymandering or boundary tightening) then the question arises as to whether it means anything. An indication of meaningfulness can be provided by Monte Carlo simulation, for example, if the incidence of the disease was randomly distributed throughout the at risk population how often can equal or more extreme results be obtained by re-applying the same approach a few hundred or several thousand times. Maybe it is possible to use the increasing availability of high performance computers to compute our way out of this and related problems in spatial analysis. The emerging new era of parallel supercomputing certainly offers a new tool that is very relevant to many previously poorly handled problems in spatial analysis. Switching to a computationally intensive methodology certainly provides the basis for much more flexible and more realistic approaches.

Testing spatial hypotheses by attempting to identify zoning systems that match the hypothesis of interest or which would deny it, is another powerful way of performing explicitly spatial analysis with zonal data. For example, if you define a cancer hot spot as having twice the average rate, then can you define zones that match this target and, if so, what do they look like? How often would you obtain a similar result if the zones were generated randomly? Another example would be that you wish to test the hypothesis that there is an abnormally high cancer rate near point x, y. Proceed as before and see whether one or more hot spots occur near this location; or have a very low frequency of occurrence in any of the random zonations that are examined. Note that purely random zone design along the lines of Openshaw (1977a) algorithm is not relevant here, as distinct from estimating aggregation induced confidence intervals. The arbitrary zoning systems that exist in the world are highly non-random, being characterised by at least a degree of uniformity of shape or compaction, within some size distribution. More appropriate would be other types of randomly generated zonations that possessed a relatively high degree of compaction, or which occur within a given size range. There is no reason why this cannot be done.

A final application is that of spatial modelling. Traditional spatial statistical and mathematical models are fitted to fixed spatial data and thus prone to MAUP effects. In the emerging UMAU era, the user can improve the performance of a model by seeking a spatial representation (i.e. zoning system) for it that matches it, in some way. Traditional zone design, is usually performed as an a priori process. However, it can be developed further and used as a tool for optimising model fit; see Openshaw (1977b, 1978a, b). Clearly if zones are engineered such that they yield data that conforms to that which a model can represent, then the level of model error will decrease. Openshaw (1984) argued that this is a form of spatial calibration, the geographer's equivalent to parameter estimation. If it makes no apparent statistical sense to optimise model performance by fine tuning the zoning system so that it fits any reasonable but arbitrary set of parameter values, does it make any geographical sense to fine tune the model's parameters to fit data generated by any reasonable but arbitrary zoning system? A much more realistic approach is needed. The optimal zoning system that best fits a model must be saying something about the nature of the model and how it interacts with the spatial data. The zoning system provides a visualisation of this interaction; it is therefore the natural domain in which to learn and discover how the model works in a spatial context. Another way of asking the same question is as follows: what must the zoning system look like if a model is to have a good fit. It would be interesting, indeed, to be able to visualise this model zoning system interaction in real-time, turning up or down the level of data scale and resolution (i.e. the number of zones being used) as and when appropriate. On top of this display a back-cloth could be shown giving aspects of the data and various graphical views of the insides of the model.

7. Conclusions

One key requirement in developing new zone design based spatial analysis methods is an ability to re-aggregate N small zones (ideally the finest available building blocks) into M larger regions such that a general function computed from the aggregated data is optimised. The function may be directly computed from the data or else be based on other computations applied to the data, for example, a measure of model error. Openshaw and Rao (1995) describe zone re-engineering techniques that can deal with these general problems. The only remaining requirement is to incorporate constraint handling procedures into the most sophisticated methods. The final task is then to provide parallel versions of the algorithms so that large N values can be efficiently handled. Openshaw and Schmidt (1995) present a parallel algorithm for zone design that is scalable and, as a result, offers the potential of being able to handle problems of virtually any size, given a sufficiently fast parallel supercomputer.

Fotheringham and Ding (1994) clearly have the right concept when they describe a Spatial Aggregation Machine (SAM), although it does no more than randomly generate zoning systems. Openshaw and Rao's (1995) Zone Design Systems (ZDES) deals with the optimisation problem as well but not with the related spatial analysis tools. This chapter suggests how both the SAM and ZDES now need to be extended to provide a full set of spatial analysis tools for zonal data analysis. The pieces exist. The data almost exists and the latest generation of high performance computer is just about fast enough. The challenge now is to do it and investigate further how best to exploit zone design as a powerful, explicitly geographical spatial analysis tool relevant to GIS.


Table 1 Basic generic spatial analysis needs

_________________________________________________________

pattern measures and spotters

relationship measurers and finders

data simplifiers

spatial response modellers

fuzzy pattern recognisers

visualisers and animators

___________________________________________________________


Table 2 Ten GISable spatial analysis criteria

___________________________________________________________

Can handle large N values

Study region invariant

Sensitive to the nature of spatial data

Results are mappable

Generic analysis procedures

Useful and valuable

Interfacing problems are irrelevant

Ease of use and understandable

Safe technology

Applied technology

___________________________________________________________

Source: Openshaw (1994b): Table 2


Table 3 Spatial analysis by zones (SAZ)

___________________________________________________________

spatial description

visualisation

pattern capture

testing spatial hypotheses

spatial modelling

___________________________________________________________


References

Anselin, L., 1994, 'Exploratory spatial data analysis and geographic information systems', Eurostat, Luxembourg p45-54

Anselin, L., 1992, 'SPACESTAT: a program for the analysis of spatial data;, NCGIA, Santa Barbara, CA

Anselin, L., Dodson, R., Hudak, S., 1993, 'Linking GIS and spatial data analysis in practice', Geographical Systems 1, 3-23

Besag, J., Newell, J., 1991, 'The detection of clusters in rare diseases', Journal of the Royal Statistical Society A 154, 143-155

Carr-Hill, R.A., Sheldon, T.A., Smith, P., Martin, S., Peacock, S., Hardman, G., 1994, 'Allocating resources to health authorities: development of method for small area analysis of use of inpatient services', British Medical Journal 309 p1046-1049

Cliff, A., Ord, J.K., 1973, Spatial Autocorrelation Pion, London

Fotheringham, S., 1988, 'Scale-independent spatial analysis', in M Goodchild and S Gopal (eds) The accuracy of spatial databases Taylor and Francis, London p221-228

Getis, A., Ord, K., 1992, 'The analysis of spatial association by the use of distance statistics', Geographical Analysis 24, 189-206

Goodchild, M., 1992, 'Research Initiative 1: Accuracy of spatial databases final report' NCGIA, Santa Barbara, CA

Haslett, J., Wills, G., Unwin, A., 1990, 'SPIDER-an interactive statistical tool for the analyses of spatially distributed data; Int J of GIS 4, 285-296

Haslett, J., Bradley, R., Craig, P., Unwin, A., Wills, C., 1991, 'Dynamic graphics for exploring spatial data with applications to locating global and local anomalies', The American Statistician 45, 234-242

Nass, C., Garfinkle, D., 1992, 'Localised autocorrelation diagnostic statistic (LADS) for spatial models: conceptualisation, utilisation and computation', Regional Science and Urban Economics 22, 333-346

Openshaw, S., 1977a, 'Algorithm 3: a procedure to generate pseudo random aggregations of N zones into M zones, where M is less than N', Environment and Planning A 9, 1423-1428

Openshaw, S., 1977b, 'Optimal zoning systems for spatial interaction models', Environment and Planning A 9, 169-184

Openshaw, S, 1978a, 'An optimal zoning approach to the study of spatially aggregated data', in I. Masser, P.J.B. Brown (eds) Spatial Representation and Spatial Interaction Martinus Nijhoff, Boston, 95-113

Openshaw, S., 1978b, 'An empirical study of some zone design criteria', Environment and Planning A 10 781-794

Openshaw, S., Taylor, P.J., 1979, 'A million or so correlation coefficients: three experiments on the modifiable areal unit problem' in N. Wrigley (ed) Statistical Applications in the Spatial Sciences Pion, London

Openshaw, S., 1984, 'The modifiable areal unit problem GeoAbstracts, Norwich

Openshaw, S., Charlton, M., Wymer, C., Craft, A., 1987, 'A mark I Geographical Analysis Machine for the automated analysis of point data sets', Int. J. of GIS 1, 335-358

Openshaw, S., Cross, A., Charlton, M., 1990, 'Building a prototype geographical correlates exploration machine', Int. J. of GIS 3, 297-312

Openshaw, S., 1991, 'Developing appropriate spatial analysis methods for GIS', in D Maguire, M F Goodchild, D Rhind (eds) GIS Principles and Applications Volume 1 Longman, London 389-402

Openshaw S., Craft, A., 1991, 'Using Geographical Analysis Machines to search for evidence of clusters and clustering in childhood leukaemia and non-Hodgkin Lymphomas in Britain' in G Draper (ed) The geographical epidemiology of childhood leukaemia and non-Hodgkin Lymphomas in Great Britain, 1966-83 OPCS, HMSO, London

Openshaw, S., 1994a, 'GIS crime and spatial analysis, Proceedings of GIS and Public Policy Conference Ulster Business School, Ulster, p22-35

Openshaw, S., 1994b, 'Social costs and benefits of the census', Proceedings of Xvth International Conference of the Data Protection and Privacy Commissioners, 89-97 Manchester

Openshaw, S., 1994c, 'What is GISable spatial analysis?', in New Tools for Spatial Analysis, Eurostat, Luxembourg 36-44

Openshaw, S., 1994d, 'A framework for research on spatial analysis relevant to geo-statistical information systems' in New Tools for Spatial Analysis Eurostat, Luxembourg, P157-162

Openshaw, S., 1994e, 'A concepts rich approach to spatial analysis, theory generation, and scientific discovery in GIS using massively parallel computing', in M F Worboys (ed) Innovations in GIS Taylor and Francis, London p123-138

Openshaw, S., Rao, L., 1995, 'Algorithms for re-engineering 1991 census geography', Environment and Planning A (forthcoming)

Openshaw, S., Schmidt, J., 1995, 'A parallel simulated annealing algorithm for re-engineering zoning systems', (forthcoming)

Openshaw, S., 1995a, 'Human Systems Modelling as a new grand challenge area in Science', Environment and Planning forthcoming

Openshaw, S., 1995b, 'Developing automated and smart spatial pattern exploration tools for geographical information systems applications', The Statistician (forthcoming)

Openshaw, S., 1995c, The Census Users Handbook Longman, London

Tobler, W., 1988, 'Frame independent spatial analysis', in M Goodchild and S Gopal (eds) The accuracy of spatial databases Taylor and Francis, London p115-122