Clustering Help

Clustering Method

This parameter selects the method to be used for clustering the objects. The possible values are:

Repeated Bisection: In this method, the desired k-way clustering solution is computed by performing a sequence of k-1 {\be repeated bisections}. In this approach, the matrix is first clustered into two groups, then one of these groups is selected and bisected further. This process continuous until the desired number of clusters is found. During each step, the cluster is bisected so that the resulting 2-way clustering solution optimizes a particular clustering criterion function (which is selected using the {\em -crfun} parameter). Note that this approach ensures that the criterion function is locally optimized within each bisection, but in general is not globally optimized. The cluster that is selected for further partitioning is controlled by the {\em -cstype} parameter. By default, \vcluster uses this approach to find the k-way clustering solution.
Repeated Bisection (K Way): In this method the desired $k$-way clustering solution is computed in a fashion similar to the repeated-bisecting method but at the end, the overall solution is globally optimized. Essentially, \vcluster uses the solution obtained by {\em -clmethod=rb} as the initial clustering solution and tries to further optimize the clustering criterion function.
Direct: In this method, the desired $k$-way clustering solution is computed by simultaneously finding all $k$ clusters. In general, computing a $k$-way clustering directly is slower than clustering via repeated bisections. In terms of quality, for reasonably small values of $k$ (usually less than 10--20), the direct approach leads to better clusters than those obtained via repeated bisections. However, as $k$ increases, the repeated-bisecting approach tends to be better than direct clustering.
Agglomerative: In this method, the desired $k$-way clustering solution is computed using the {\be agglomerative} paradigm whose goal is to locally optimize (minimize or maximize) a particular clustering criterion function (which is selected using the {\em -crfun} parameter). The solution is obtained by stopping the agglomeration process when $k$ clusters are left.
Graph: In this method, the desired $k$-way clustering solution is computed by first modeling the objects using a nearest-neighbor graph (each object becomes a vertex, and each object is connected to its most similar other objects), and then splitting the graph into $k$-clusters using a min-cut graph partitioning algorithm. {\bf Note that if the graph contains more than one connected component, then \vcluster and \scluster return a $(k+m)$-way clustering solution, where $m$ is the number of connected components in the graph.}

{Note about suitability of these for Microarray datasets}

Similarity Function

Selects the similarity function to be used for clustering. The possible values are:

Cosine: The similarity between objects is computed using the cosine function. This is the default setting.
Correlation Coefficient: The similarity between objects is computed using the correlation coefficient.
Euclidean Distance: The similarity between objects is computed to be inversely proportional to the Euclidean distance between the objects. This similarity function is only applicable when {\em -clmethod=graph}.
Jaccard's Coefficient: The similarity between objects is computed using the extended Jaccard coefficient. This similarity function is only applicable when {\em -clmethod=graph}.

Critereon Function

This parameter selects the particular clustering criterion function to be used in finding the clusters. A total of seven different clustering criterion functions are provided that are selected by specifying the appropriate integer value. The possible values for {\em -crfun} are:

I1: Selects the I1 criterion function.
I2: Selects the I2 criterion function.
E1: Selects the E1 criterion function.
G1: Selects the G1 criterion function.
G1': Selects the G1$'$ criterion function.
H1: Selects the H1 criterion function.
H2: Selects the H2 criterion function.
Single Link: Selects the traditional single-link criterion function.
Wt. Single Link: Selects a cluster-weighted single-link criterion function.
Complete Link: Selects the traditional complete-link criterion function.
UPGMA: Selects the traditional UPGMA criterion function. This is the default setting for the {\em agglo} and {\em bagglo} clustering methods.

The precise mathematical definition of the first seven functions is shown in \tblref{tbl:crfundef}. The reader is referred to \cite{zhao01tr-vpcluster} for both a detailed description and evaluation of the various criterion functions. The {\em slink}, {\em wslink}, {\em clink}, {\em wclink}, and {\em upgma} criterion functions can only be used within the context of agglomerative clustering, and cannot be used for partitional clustering. The {\em wslink} and {\em wclink} criterion function were designed for building an agglomerative solution on top of an existing clustering solution (see {\em -agglofrom}, or {\em -showtree} options). In this context, the weight of the ``link'' between two clusters $S_i$ and $S_j$ is set equal to the aggregate similarity between the objects of $S_i$ to the objects in $S_j$ divided by the total similarity between the objects in $S_i \bigcup S_j$. The various criterion functions can sometimes lead to significantly different clustering solutions. In general, the \itwo and \htwo criterion functions lead to very good clustering solutions, whereas the \eone and \goneprime criterion functions leads to solutions that contain clusters that are of comparable size. However, the choice of the {\em right} criterion function depends on the underlying application area, and the user should perform some experimentation before selecting one appropriate for his/her needs. Note that the computational complexity of the agglomerative clustering algorithms (\ie., {\em -clmethod=agglo} or {\em -clmethod=bagglo}) depend on the criterion function that is selected. In particular, if $n$ is the number of objects, the complexity for \hone and \htwo criterion functions is $O(n^3)$, whereas the complexity of the remaining criterion functions is $O(n^2\log n)$. The higher complexity for \hone and \htwo is due to the fact that these two criterion functions are defined globally over the entire solution and they cannot be accurately evaluated based on the local combination of two clusters. \begin{table}[htb] \begin{center} \input{formula.tbl} \end{center}