Home
 

Ascending Hierarchical Classification

Inputs Continuous data (unlimited number)
or
1 Specific P.C.A. Matrix
Output Qualitative data

Display :

Description :
This module enables an automatic classification of geographical entities to be obtained as a function of a number of statistical data items.
The entities on the working map are called the individual values, and the input data are called the variables.
The Ascending Hierarchical Classification consists of carrying out progressive grouping of individual values in accordance with their degrees of similarity to obtain a single class that groups them all.
Once this calculation has been made, the individual values are divided up into various classes; the class number is fixed by the user.

This module can use a Principal Component Analysis module as an input: the data will then be the coordinates of the individual values on the m first axes (m being the number of axes, varying from 1 to the number of inputs from the P.C.A. module).

Calculations:

  • Stage 1:
    We consider the n individual values characterized by their values as compared with the p variables, and we determine the 2 individual values that have the highest degree of similarity as compared with this set of variables. These 2 individual values are grouped to form a class.
  • Stage 2:
    On this level, we have (n-1) classes, one made up of the 2 individual values grouped above, and the others containing only a single individual value. We can then determine the 2 classes that have the highest degree of similarity, and group them.
  • Following stages:
    We repeat the same operation until we obtain a single class grouping all the individual values.

This procedure entails 2 choices:

  1. Determining a criterion of similarity between the individual values. The euclidian distance is used.
  2. Determining a distance between classes. This process is called an aggregation criterion; we can propose several criteria.

Aggregation criteria:

  • nearest neighbour
    The distance between 2 classes c1 and c2 is defined as the smallest distance separating an individual value in c1 from an individual value in c2:
       
  • maximum diameter
    The distance between 2 classes c1 et c2 is defined as the greatest distance separating an individual value in c1 from an individual value in c2:
       
  • average distance
    This criterion consists of calculating the average distance between all the elements of c1 and all the elements of c2:
       
            where ni = number of elements in class ci
  • centers of gravity
    The distance between 2 classes c1 and c2 is defined as the distance between their centers of gravity:
       
            where Gi = center of gravity of class ci
  • centered moment of order 2
    This criterion consists of looking for a partition giving minimum internal variance within the classes and maximum variance between the classes:

            where refers to the inertia of the partition under which c1 and c2 are separated
                      refers to the inertia of the partition under which c1 and c2 are together
                       Gi = center of gravity of class ci
    i.e.:
     
            where Gi = center of gravity of the class ci
                       G = center of gravity of the cloud
                       mi = relative mass of the class ci =

The difficulty in the choice of the aggregation criterion is that these criteria can lead to differing results.
The most widely used criterion is that of the centered moment of order 2, applied to a centered, reduced matrix.

Numerical example:
Initial matrix:

 

v1

v2

v3

A

4

2

4

B

2

1

6

C

3

4

4

D

4

5

2

E

2

4

5

Table of distances between individual values:

d2

A

B

C

D

E

A

0

       

B

9

0

     

C

5

14

0

   

D

13

36

6

0

 

E

9

10

2

14

0

The individual values C and E can clearly be seen to have the highest level of similarity with a squared distance of 2 (the lowest value in the table).
They are thus grouped together in a class, and the first phase of classification, independent of the pooling criterion, has been finished.

Example of A.H.C. using the criterion of maximum diameter:

Table of distances between classes:

d2

A

B

{C,E}

D

A

0

     

B

9

0

   

{C,E}

9

14

0

 

D

13

36

14

0

The problem here is finding out which of the classes are closest: {A} and {B}, or {A} and {C,E}. No criterion other than arbitrary choice enables us to find the best aggregation. We choose the distance that was calculated first: grouping {A} and {B}.

Table of distances between classes:

d2

{A,B}

{C,E}

D

{A,B}

0

   

{C,E}

14

0

 

D

36

14

0

There is a problem here, too, to find the closest classes: {C,E}and {D}, or {A,B}and {C,E}. We choose, for example, to group {C,E}and {D}.

There are thus 2 classes left:
{A,B} and {C,D,E}

The following hierarchical tree is obtained:

If we carry out a division in 2 classes, we obtain: {A,B} and {C,D,E}.
If we carry out a division in 3 classes, we obtain: {A,B}, {C,E} and {D}.

With the criterion of the nearest neighbour, we obtain the following tree:

If we carry out a division in 2 classes, we obtain: {A,C,D,E} and {B}.
If we carry out a division in 3 classes, we obtain: {A,C,E}, {B} and {D}.

The results are thus different depending on the pooling criterion chosen.

Elements of vocabulary:
A rather precise set of vocabulary is used within the framework of ascending hierarchical classification. We present it here using the numerical example above.
The individual values A, B, C, D and E form classes with a single element, called terminal classes.
A group of 2 classes is called a node.
The first of these nodes carries the number of the total number of terminal classes +1 ( = n+1). Here, the number is 6.
Each group gives rise to a new node whose number follows that of the preceding group.
The last node will thus have the number: 2 x n -1. Here, the number is thus 2 x 5 - 1 = 9.
Each node is thus made up of 2 classes that we call its successors; one is called the elder and the other the junior (chosen arbitrarily).

By extension, we also designate the individual values using a node number between 1 and n. If you need to know the exact node number corresponding to a particular individual value, you must click on the Save in a file button and read, in the text file created, the paragraph entitled "The classes of the hierarchy".

With the criterion of the nearest neighbour, we thus obtain the following tree:

             which is the equivalent of           

Node 8 has thus as its successors classes {D} and 7. Arbitrarily, {D} is designated as the elder and 7 as the junior.

The level indices enable us to measure and represent the importance of the difference between 2 classes that we are grouping. They are defined by the distance between the 2 classes forming the node.
In this way, node 6 has the square root of 2 level index (distance between the elements C and E, see above).
These indices are very useful in studying the hierarchical tree.

Interpretation:
The graph shows the hierarchical tree obtained after classification, together with the histogram of the level indices. It depends on the number of nodes drawn, which varies from 3 to 21. If you modify the number, the graph changes as a result.

If you modify the window size, the drawing adapts itself automatically to the new size.

Choice of class number:
The user must choose the desired class number. The number varies between 1 and the number of entities in the working map.
A choice based on the level index is a judicious one. It is determined by reading off the histogram of level indices: the user must notice large leaps in the values of the level index. If these leaps concern the k last nodes of the tree, a division in (k+1) classes is appropriate.
For example, on the display copy at the top of the present document, we can see that the level index shows large leaps for the last five nodes in the tree. The histogram is then more regular. A division in 6 classes would thus seem appropriate.

To make it easier to read off the histogram and find large leaps in the level index, we advise you to increase to the maximum the number of nodes to be drawn.
If you need to know the exact value of a level index, move the mouse pointer onto the corresponding histogram; the value of the index is displayed in the text area below the graph.

Building up the classes:
To build up a division into 4 classes, we divide the hierarchy starting from the final node. The successors of this node are the classes i and j. At this level, we thus have a division in 2 classes. We then subdivide the highest node in the hierarchy (i for example). This node is made up of the 2 classes k and l, and at this level we have 3 classes: j, k and l. The highest node is again subdivided into 2 classes to obtain the desired division into 4 classes.

Results:
By clicking on the Save in a file button, you create a tabulated text file in which the A.H.C. results are inscribed.

In this file, you will find:

  • Your choices concerning the parameters of the module that have an effect on the classification:
    - The aggregation criterion selected.
    - If the Center and reduce the matrix box has been checked, this choice will also be inscribed.
    - The class number chosen.
  • The list of the non-valid individual values, if any.
    An individual value is considered to be non-valid if at least one of the data items of the input does not have a value associated with the individual value.
  • The overall inertia of the cloud.
  • The classes of the hierarchy:
    This table enables us to analyse in detail the resemblances between individual values. It also enables the content of the classes corresponding to the division chosen to be established.
    This is a table in which each line corresponds to a node in the hierarchy. For each line, we have the following columns:
    - Number of the class (if the node corresponds to a class)
    - Number of the node
    - Elder of the node
    - Junior of the node
    - Number of elements contained in the node
    - Identifying elements of the individual values making up the node
  • The parameters of the hierarchy:
    This is a table in which each line corresponds to a node in the hierarchy. For each line, we have the following columns:
    - Number of the node
    - Level index of the node
    - Inertia rate of the node (e.g.: if it is shown as 230, this means that the node represents 23 % of the overall inertia)
    - Sum total of the inertia rates (e.g.: if it is shown as 548 for the kth last node in the tree, this means that 54.8 % of the differences between individual values are expressed through this division into (k+1) classes)
  • Interpreting aids:
    This table shows some general parameters on each of the nodes. It enables us to set out directly the characteristics of each class of the division.
    This is a table in which each line corresponds to a node in the hierarchy. For each line, we have the following columns:
    - Number of the class (if the node corresponds to a class)
    - Number of the node
    - Relative mass of the node
    - Contribution of the node to the dispersion of the cloud
    - Deviation separating the node from the general average
    - In the following columns: deviation of the node as compared with the average of each variable. The highest values, in absolute terms, of these columns indicate which variables best characterize the node (e.g.: for a given class, if the variable V1 has a value of 1500, V2 2700, V3 100, V4 200, V5 -1200, then the characters contributing the most to differentiate this class from the group of the other individual values are, in this order: the importance of the characters V2 and then V1, and the relative weakness of the character V5; the other variables make only an insignificant contribution to the specificity of this class).

Formulas used:
Total inertia of the cloud:
   
        where G = center of gravity of the cloud
                   mi = relative mass of the individual value i =

Rate of inertia of the node J:
   
        where I(J) = level index of the node J

Relative mass of the class c:
   

Contribution of the class c to the dispersion of the cloud:
   
        where mc = relative mass of the class c =

Deviation separating the class c from the overall average:
   

Remark:
All the values given in the paragraphs "The parameters of the hierarchy" and "Interpreting aids" are multiplied by 1000, except RHO2 which is multiplied by 100.
This method has been adopted by most data analysis programs; its purpose is to simplify the presentation of the results.

Graphics :
Under the graphic, you will find a button : that creates a "graphical window" from current drawing. This window let you compare graphics, export them, or print them.

For further details, you can consult the following book:
L'analyse statistique des données en géographie. (Statistical data analysis in geography). Lena Sanders. Collection Alidade. Groupement d'Intérêt Public Reclus. Montpellier. 1989.

Script:

2      module untyped_list ""
3        mod_type integer "103"
3        mod_subtype integer "527"
3        mod_name string "CAH"
3        mod_dads integer_list ""
4          ? integer "5"
4          ? integer "8"
4          ? integer "9"
4          ? integer "4"
4          ? integer "6"
4          ? integer "7"
3        class_nb integer "6"
3        cah_type integer "705"
3        center_matrix boolean "T"
3        draw_nb integer "17"
3        ind_nb integer "20"
3        var_nb integer "6"
3        acp_dad boolean "F"
3        axis_nb integer "1"

Values for typical hac:
nearest neighbour          701
maximum diameter           702
average distance           703
centers of gravity         704
centered moment of order 2 705


Samples