Clustering Method
This parameter selects the method to be used for clustering the objects.
The possible values are:
- Repeated Bisection
- In this method, the desired k-way clustering solution is computed
by performing a sequence of k-1 {\be repeated bisections}.
In this approach, the matrix is first clustered into two groups,
then one of these groups is selected and bisected further. This process
continuous until the desired number of clusters is found. During each
step, the cluster is bisected so that the resulting 2-way clustering
solution optimizes a particular clustering criterion function (which
is selected using the {\em -crfun} parameter). Note that this approach
ensures that the criterion function is locally optimized within each
bisection, but in general is not globally optimized. The cluster that
is selected for further partitioning is controlled by the {\em -cstype}
parameter. By default, \vcluster uses this approach to find the k-way
clustering solution.
- Repeated Bisection (K Way)
- In this method the desired $k$-way clustering solution is computed
in a fashion similar to the repeated-bisecting method but at the end,
the overall solution is globally optimized. Essentially, \vcluster
uses the solution obtained by {\em -clmethod=rb} as the initial
clustering solution and tries to further optimize the clustering
criterion function.
- Direct
- In this method, the desired $k$-way clustering solution is computed
by simultaneously finding all $k$ clusters. In general, computing a
$k$-way clustering directly is slower than clustering via repeated
bisections. In terms of quality, for reasonably small values of $k$
(usually less than 10--20), the direct approach leads to better
clusters than those obtained via repeated bisections. However, as
$k$ increases, the repeated-bisecting approach tends to be better
than direct clustering.
- Agglomerative
- In this method, the desired $k$-way clustering solution is computed
using the {\be agglomerative} paradigm whose goal is to locally
optimize (minimize or maximize) a particular clustering criterion
function (which is selected using the {\em -crfun} parameter).
The solution is obtained by stopping the agglomeration process when
$k$ clusters are left.
- Graph
- In this method, the desired $k$-way clustering solution is computed
by first modeling the objects using a nearest-neighbor graph (each
object becomes a vertex, and each object is connected to its most
similar other objects), and then splitting the graph into $k$-clusters
using a min-cut graph partitioning algorithm. {\bf Note that if the graph
contains more than one connected component, then \vcluster and \scluster
return a $(k+m)$-way clustering solution, where $m$ is the number
of connected components in the graph.}
{Note about suitability of these for Microarray datasets}
Similarity Function
Selects the similarity function to be used for clustering. The possible values are:
- Cosine
- The similarity between objects is computed using the cosine function.
This is the default setting.
- Correlation Coefficient
- The similarity between objects is computed using the correlation
coefficient.
- Euclidean Distance
- The similarity between objects is computed to be inversely
proportional to the Euclidean distance between the objects.
This similarity function is only applicable when {\em -clmethod=graph}.
- Jaccard's Coefficient
- The similarity between objects is computed using the extended Jaccard
coefficient. This similarity function is only applicable when
{\em -clmethod=graph}.
Critereon Function
This parameter selects the particular clustering criterion function to be used in
finding the clusters. A total of seven different clustering criterion functions are
provided that are selected by specifying the appropriate integer value. The
possible values for {\em -crfun} are:
- I1
- Selects the I1 criterion function.
- I2
- Selects the I2 criterion function.
- E1
- Selects the E1 criterion function.
- G1
- Selects the G1 criterion function.
- G1'
- Selects the G1$'$ criterion function.
- H1
- Selects the H1 criterion function.
- H2
- Selects the H2 criterion function.
- Single Link
- Selects the traditional single-link criterion function.
- Wt. Single Link
- Selects a cluster-weighted single-link criterion function.
- Complete Link
- Selects the traditional complete-link criterion function.
- UPGMA
- Selects the traditional UPGMA criterion function.
This is the default setting for the {\em agglo} and {\em bagglo}
clustering methods.
The precise mathematical definition of the first seven functions is shown in
\tblref{tbl:crfundef}. The reader is referred to \cite{zhao01tr-vpcluster} for
both a detailed description and evaluation of the various criterion functions.
The {\em slink}, {\em wslink}, {\em clink}, {\em wclink}, and {\em upgma}
criterion functions can only be used within the context of agglomerative
clustering, and cannot be used for partitional clustering.
The {\em wslink} and {\em wclink} criterion function were designed for building
an agglomerative solution on top of an existing clustering solution (see {\em
-agglofrom}, or {\em -showtree} options). In this context, the weight of the
``link'' between two clusters $S_i$ and $S_j$ is set equal to the aggregate
similarity between the objects of $S_i$ to the objects in $S_j$ divided by
the total similarity between the objects in $S_i \bigcup S_j$.
The various criterion functions can sometimes lead to significantly
different clustering solutions. In general, the \itwo and \htwo criterion
functions lead to very good clustering solutions, whereas the \eone and
\goneprime criterion functions leads to solutions that contain clusters
that are of comparable size. However, the choice of the {\em right}
criterion function depends on the underlying application area, and the
user should perform some experimentation before selecting one appropriate
for his/her needs.
Note that the computational complexity of the agglomerative clustering
algorithms (\ie., {\em -clmethod=agglo} or {\em -clmethod=bagglo}) depend
on the criterion function that is selected. In particular, if $n$ is the
number of objects, the complexity for \hone and \htwo criterion functions
is $O(n^3)$, whereas the complexity of the remaining criterion functions
is $O(n^2\log n)$. The higher complexity for \hone and \htwo is due to
the fact that these two criterion functions are defined globally over the
entire solution and they cannot be accurately evaluated based on the local
combination of two clusters.
\begin{table}[htb]
\begin{center}
\input{formula.tbl}
\end{center}