Create clusters based on dissimilarity matrix

This script is designed to create clusters of species based on a distance matrix between those species. Several metrics are computed to evaluate these clusters and a graphic is produced to help the user to choose the best number of clusters..

PRE_FATE.speciesClustering_step1(mat.species.DIST, opt.no_clust_max = 15)

Arguments

mat.species.DIST: a dist object, or a list of dist objects (one for each GROUP value), corresponding to the dissimilarity distance between each pair of species.
Such an object can be obtained with the PRE_FATE.speciesDistance function.
opt.no_clust_max: (optional) default 15.
an integer corresponding to the maximum number of clusters to be tested for each distance matrix

Value

A list containing one list, one data.frame with the following columns, and two ggplot2 objects :

clust.dendrograms

a list with as many objects of class hclust as data subsets

clust.evaluation

GROUP: name of data subset
no.clusters: number of clusters used for the clustering
variable: evaluation metrics' name
value: value of evaluation metric

plot.clustMethod

ggplot2 object, representing the different values of metrics to choose the clustering method

plot.clustNo

ggplot2 object, representing the different values of metrics to choose the number of clusters

One PRE_FATE_CLUSTERING_STEP1_numberOfClusters.pdf file is created containing two types of graphics :

clusteringMethod: to account for the chosen clustering method
numberOfClusters: for decision support, to help the user to choose the adequate number of clusters to be given to the PRE_FATE.speciesClustering_step2 function

Details

This function allows to obtain dendrograms based on a dissimilarity distance matrix between species.

As for the PRE_FATE.speciesDistance method, clustering can be run for data subsets, conditioning that mat.species.DIST is given as a list of dist objects (instead of a dist object alone).

The process is as follows :

1. Choice of the optimal clustering method

hierarchical clustering on the dissimilarity matrix is realized with the hclust.

Several methods are available for the agglomeration : complete, ward.D, ward.D2, single, average (UPGMA), mcquitty (WPGMA), median (WPGMC) and centroid (UPGMC).
Mouchet et al. (2008) proposed a similarity measure between the input distance and the one obtained with the clustering which must be minimized to help finding the best clustering method : $$ 1 - cor( \text{mat.species.DIST}, \text{clustering.DIST} ) ^ 2$$

For each agglomeration method, this measure is calculated. The method that minimizes it is kept and used for further analyses (see .pdf output file).

2. Evaluation of the clustering

once the hierarchical clustering is done, the number of clusters to keep should be chosen.
To do that, several metrics are computed :

Dunn index (mdunn) :: ratio of the smallest distance between observations not in the same cluster to the largest intra-cluster distance. Value between 0 and $\infty$, and should be maximized.
Meila's Variation of Information index (mVI) :: measures the amount of information lost and gained in changing between two clusterings. Should be minimized.
Coefficient of determination (R2) :: value between 0 and 1. Should be maximized.
Calinski and Harabasz index (ch) :: the higher the value, the "better" is the solution.
Corrected rand index (Rand) :: measures the similarity between two data clusterings. Value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.
Average silhouette width (av.sil) :: Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster. Should be maximized.

A graphic is produced, giving the values of these metrics in function of the number of clusters used. Number of clusters are highlighted in function of evaluation metrics' values to help the user to make his/her optimal choice : the brighter (yellow-ish) the better (see .pdf output file).

Mouchet M., Guilhaumon f., Villeger S., Mason N.W.H., Tomasini J.A. & Mouillot D., 2008. Towards a consensus for calculating dendrogam-based functional diversity indices. Oikos, 117, 794-800.

Note

The function does not return ONE dendrogram (or as many as given dissimilarity structures) but a LIST with all tested numbers of clusters. One final dendrogram can then be obtained using this result as a parameter in the PRE_FATE.speciesClustering_step2 function.

Author

Isabelle Boulangeat, Maya Guéguen

Examples


## Load example data
Champsaur_PFG = .loadData('Champsaur_PFG', 'RData')

## Species dissimilarity distances (niche overlap + traits distance)
tab.dist = list('Phanerophyte' = Champsaur_PFG$sp.DIST.P$mat.ALL
                , 'Chamaephyte' = Champsaur_PFG$sp.DIST.C$mat.ALL
                , 'Herbaceous' = Champsaur_PFG$sp.DIST.H$mat.ALL)
str(tab.dist)
as.matrix(tab.dist[[1]])[1:5, 1:5]

## Build dendrograms ---------------------------------------------------------
sp.CLUST = PRE_FATE.speciesClustering_step1(mat.species.DIST = tab.dist)
names(sp.CLUST)
str(sp.CLUST$clust.evaluation)
plot(sp.CLUST$plot.clustMethod)
plot(sp.CLUST$plot.clustNo)

if (FALSE) { # \dontrun{
require(foreach)
require(ggplot2)
require(ggdendro)
pp = foreach(x = names(sp.CLUST$clust.dendrograms)) %do%
{
  hc = sp.CLUST$clust.dendrograms[[x]]
  pp = ggdendrogram(hc, rotate = TRUE) +
    labs(title = paste0('Hierarchical clustering based on species distance '
                        , ifelse(length(names(sp.CLUST$clust.dendrograms)) > 1
                                 , paste0('(group ', x, ')')
                                 , '')))
  return(pp)
}
plot(pp[[1]])
plot(pp[[2]])
plot(pp[[3]])
} # }