R/PRE_FATE.speciesClustering_step1.R
PRE_FATE.speciesClustering_step1.Rd
This script is designed to create clusters of species based on a distance matrix between those species. Several metrics are computed to evaluate these clusters and a graphic is produced to help the user to choose the best number of clusters..
PRE_FATE.speciesClustering_step1(mat.species.DIST, opt.no_clust_max = 15)
a dist
object, or a list
of
dist
objects (one for each GROUP
value), corresponding to the
dissimilarity distance between each pair of species.
Such an object can
be obtained with the PRE_FATE.speciesDistance
function.
(optional) default 15
.
an
integer
corresponding to the maximum number of clusters to be tested
for each distance matrix
A list
containing one list
, one data.frame
with
the following columns, and two ggplot2
objects :
a list
with as many objects of
class hclust
as data subsets
GROUP
name of data subset
no.clusters
number of clusters used for the clustering
variable
evaluation metrics' name
value
value of evaluation metric
ggplot2
object, representing the different
values of metrics to choose the clustering method
ggplot2
object, representing the different
values of metrics to choose the number of clusters
One PRE_FATE_CLUSTERING_STEP1_numberOfClusters.pdf
file is created
containing two types of graphics :
to account for the chosen clustering method
for decision support, to help the user to choose
the adequate number of clusters to be given to the
PRE_FATE.speciesClustering_step2
function
This function allows to obtain dendrograms based on a dissimilarity distance matrix between species.
As for the PRE_FATE.speciesDistance
method, clustering can be
run for data subsets, conditioning that mat.species.DIST
is given as
a list
of dist
objects (instead of a dist
object alone).
The process is as follows :
hierarchical clustering on the dissimilarity matrix is realized with the
hclust
.
Several methods are available for the agglomeration : complete, ward.D, ward.D2, single, average (UPGMA), mcquitty (WPGMA), median (WPGMC) and centroid (UPGMC).
Mouchet et al. (2008) proposed a similarity measure between the input distance and the one obtained with the clustering which must be minimized to help finding the best clustering method : $$ 1 - cor( \text{mat.species.DIST}, \text{clustering.DIST} ) ^ 2$$
For each agglomeration method, this measure is calculated. The
method that minimizes it is kept and used for further analyses (see
.pdf
output file).
once the hierarchical
clustering is done, the number of clusters to keep should be chosen.
To do that, several metrics are computed :
Dunn index (mdunn
) : ratio of the smallest
distance between observations not in the same cluster to the largest
intra-cluster distance. Value between 0
and \(\infty\), and
should be maximized.
Meila's Variation of Information index (mVI
) :
measures the amount of information lost and gained in changing
between two clusterings. Should be minimized.
Coefficient of determination (R2
) : value
between 0
and 1
. Should be maximized.
Calinski and Harabasz index (ch
) : the higher
the value, the "better" is the solution.
Corrected rand index (Rand
) : measures the
similarity between two data clusterings. Value between 0
and
1
, with 0
indicating that the two data clusters do not
agree on any pair of points and 1
indicating that the data
clusters are exactly the same.
Average silhouette width (av.sil
) : Observations
with a large s(i)
(almost 1
) are very well clustered, a
small s(i)
(around 0
) means that the observation lies
between two clusters, and observations with a negative s(i)
are
probably placed in the wrong cluster. Should be maximized.
A graphic is produced, giving the values of these metrics in
function of the number of clusters used. Number of clusters are
highlighted in function of evaluation metrics' values to help the
user to make his/her optimal choice : the brighter (yellow-ish) the
better (see .pdf
output file).
Mouchet M., Guilhaumon f., Villeger S., Mason N.W.H., Tomasini J.A. &
Mouillot D., 2008. Towards a consensus for calculating dendrogam-based
functional diversity indices. Oikos, 117, 794-800.
The function does not return ONE dendrogram (or as many as
given dissimilarity structures) but a LIST with all tested numbers
of clusters. One final dendrogram can then be obtained using this result
as a parameter in the PRE_FATE.speciesClustering_step2
function.
hclust
,
cutree
,
cluster.stats
,
dunn
,
PRE_FATE.speciesDistance
,
PRE_FATE.speciesClustering_step2
## Load example data
Champsaur_PFG = .loadData('Champsaur_PFG', 'RData')
## Species dissimilarity distances (niche overlap + traits distance)
tab.dist = list('Phanerophyte' = Champsaur_PFG$sp.DIST.P$mat.ALL
, 'Chamaephyte' = Champsaur_PFG$sp.DIST.C$mat.ALL
, 'Herbaceous' = Champsaur_PFG$sp.DIST.H$mat.ALL)
str(tab.dist)
as.matrix(tab.dist[[1]])[1:5, 1:5]
## Build dendrograms ---------------------------------------------------------
sp.CLUST = PRE_FATE.speciesClustering_step1(mat.species.DIST = tab.dist)
names(sp.CLUST)
str(sp.CLUST$clust.evaluation)
plot(sp.CLUST$plot.clustMethod)
plot(sp.CLUST$plot.clustNo)
if (FALSE) {
require(foreach)
require(ggplot2)
require(ggdendro)
pp = foreach(x = names(sp.CLUST$clust.dendrograms)) %do%
{
hc = sp.CLUST$clust.dendrograms[[x]]
pp = ggdendrogram(hc, rotate = TRUE) +
labs(title = paste0('Hierarchical clustering based on species distance '
, ifelse(length(names(sp.CLUST$clust.dendrograms)) > 1
, paste0('(group ', x, ')')
, '')))
return(pp)
}
plot(pp[[1]])
plot(pp[[2]])
plot(pp[[3]])
}