R/PRE_FATE.speciesClustering_step1.R
PRE_FATE.speciesClustering_step1.RdThis script is designed to create clusters of species based on a distance matrix between those species. Several metrics are computed to evaluate these clusters and a graphic is produced to help the user to choose the best number of clusters..
PRE_FATE.speciesClustering_step1(mat.species.DIST, opt.no_clust_max = 15)a dist object, or a list of
dist objects (one for each GROUP value), corresponding to the
dissimilarity distance between each pair of species.
Such an object can
be obtained with the PRE_FATE.speciesDistance function.
(optional) default 15.
an
integer corresponding to the maximum number of clusters to be tested
for each distance matrix
A list containing one list, one data.frame with
the following columns, and two ggplot2 objects :
a list with as many objects of
class hclust as data subsets
GROUPname of data subset
no.clustersnumber of clusters used for the clustering
variableevaluation metrics' name
valuevalue of evaluation metric
ggplot2 object, representing the different
values of metrics to choose the clustering method
ggplot2 object, representing the different
values of metrics to choose the number of clusters
One PRE_FATE_CLUSTERING_STEP1_numberOfClusters.pdf file is created
containing two types of graphics :
to account for the chosen clustering method
for decision support, to help the user to choose
the adequate number of clusters to be given to the
PRE_FATE.speciesClustering_step2 function
This function allows to obtain dendrograms based on a dissimilarity distance matrix between species.
As for the PRE_FATE.speciesDistance method, clustering can be
run for data subsets, conditioning that mat.species.DIST is given as
a list of dist objects (instead of a dist object alone).
The process is as follows :
hierarchical clustering on the dissimilarity matrix is realized with the
hclust.
Several methods are available for the agglomeration : complete, ward.D, ward.D2, single, average (UPGMA), mcquitty (WPGMA), median (WPGMC) and centroid (UPGMC).
Mouchet et al. (2008) proposed a similarity measure between the input distance and the one obtained with the clustering which must be minimized to help finding the best clustering method : $$ 1 - cor( \text{mat.species.DIST}, \text{clustering.DIST} ) ^ 2$$
For each agglomeration method, this measure is calculated. The
method that minimizes it is kept and used for further analyses (see
.pdf output file).
once the hierarchical
clustering is done, the number of clusters to keep should be chosen.
To do that, several metrics are computed :
mdunn) : ratio of the smallest
distance between observations not in the same cluster to the largest
intra-cluster distance. Value between 0 and \(\infty\), and
should be maximized.
mVI) : measures the amount of information lost and gained in changing between two clusterings. Should be minimized.
R2) : value
between 0 and 1. Should be maximized.
ch) : the higher the value, the "better" is the solution.
Rand) : measures the
similarity between two data clusterings. Value between 0 and
1, with 0 indicating that the two data clusters do not
agree on any pair of points and 1 indicating that the data
clusters are exactly the same.
av.sil) : Observations
with a large s(i) (almost 1) are very well clustered, a
small s(i) (around 0) means that the observation lies
between two clusters, and observations with a negative s(i) are
probably placed in the wrong cluster. Should be maximized.
A graphic is produced, giving the values of these metrics in
function of the number of clusters used. Number of clusters are
highlighted in function of evaluation metrics' values to help the
user to make his/her optimal choice : the brighter (yellow-ish) the
better (see .pdf output file).
Mouchet M., Guilhaumon f., Villeger S., Mason N.W.H., Tomasini J.A. &
Mouillot D., 2008. Towards a consensus for calculating dendrogam-based
functional diversity indices. Oikos, 117, 794-800.
The function does not return ONE dendrogram (or as many as
given dissimilarity structures) but a LIST with all tested numbers
of clusters. One final dendrogram can then be obtained using this result
as a parameter in the PRE_FATE.speciesClustering_step2
function.
hclust,
cutree,
cluster.stats,
dunn,
PRE_FATE.speciesDistance,
PRE_FATE.speciesClustering_step2
## Load example data
Champsaur_PFG = .loadData('Champsaur_PFG', 'RData')
## Species dissimilarity distances (niche overlap + traits distance)
tab.dist = list('Phanerophyte' = Champsaur_PFG$sp.DIST.P$mat.ALL
, 'Chamaephyte' = Champsaur_PFG$sp.DIST.C$mat.ALL
, 'Herbaceous' = Champsaur_PFG$sp.DIST.H$mat.ALL)
str(tab.dist)
as.matrix(tab.dist[[1]])[1:5, 1:5]
## Build dendrograms ---------------------------------------------------------
sp.CLUST = PRE_FATE.speciesClustering_step1(mat.species.DIST = tab.dist)
names(sp.CLUST)
str(sp.CLUST$clust.evaluation)
plot(sp.CLUST$plot.clustMethod)
plot(sp.CLUST$plot.clustNo)
if (FALSE) { # \dontrun{
require(foreach)
require(ggplot2)
require(ggdendro)
pp = foreach(x = names(sp.CLUST$clust.dendrograms)) %do%
{
hc = sp.CLUST$clust.dendrograms[[x]]
pp = ggdendrogram(hc, rotate = TRUE) +
labs(title = paste0('Hierarchical clustering based on species distance '
, ifelse(length(names(sp.CLUST$clust.dendrograms)) > 1
, paste0('(group ', x, ')')
, '')))
return(pp)
}
plot(pp[[1]])
plot(pp[[2]])
plot(pp[[3]])
} # }