This menu can be used to perform a two-way clustering of probes (which may be thought of as representing genes)
and slides or targets effects.
The clustering method can be either hierarchical or non-hierarchical using the k-means algorithm.
A range of clustering criteria are available for each method.
The probes are grouped together so that the responses of each group are similar, with
the groups as distinct as possible, and similarly the slides or target effects are grouped together.
For the hierarchical clustering, the allocation to groups is specified by providing a threshold
for the levels of similarity within a group, and the dendrogram is cut at this level generating
an unknown number of groups. For the k-means algorithm, the number of groups must be specified.
The dendrogram for a hierarchical cluster analysis may be plotted, but for a large
number of probes this may not be useful as individual probes cannot be read. The responses
of each probe across the targets/slides can also be plotted, but again with large numbers
of probes this is slow, in which case the mean response for each group can be plotted.
A spreadsheet containing the grouped data can also be generated using the Store
button.
With large numbers of probes, the limit of RAM on a PC can be quickly reached, so an
option to only cluster probes with the largest mean absolute response is available.
Available Data
This lists data structures appropriate for the edit box which currently has focus.
You can double-click a name to enter it in the edit box.
Data Format
The data can be supplied in either of the following formats:
- Single Variate for Log-ratios with Slide Factor - All the log-ratios are stacked
into a single variate, with factors that index the slide and probe/gene
- Pointer to Log-ratio Variates for each Slide - Each slide has its data in
a variate, and a pointer which points to this set of variates is provided. The Slides
factor is not required, but if supplied it should just have one entry for each slide in the order of
the variates in the pointer. The Probes/Genes factor is that for a single slide, and
all slides must have a common layout.
The spreadsheet stack and
unstack menus can be used to reorganise the data
between these two formats.
Log-ratios
The log-ratios to cluster the probes and targets on.
Probes/Genes
The factor that identifies the probes or genes on a slide. If the data are in pointer format, this
has just one entry per probe, but if the data are in variate (stacked) format,
this factor indexes the probes in the log-ratio variate.
Targets or Slides
The factor that identifies the slides. If the data are in pointer format, this
has just one entry per slide, but if the data are in variate (stacked) format,
this factor indexes the slides in the log-ratio variate.
Clustering Method
The type of clustering to be used:
| Hierarchical | Hierarchical clustering using the method selected within the Link Method option. |
| K-Means | Non-hierarchical clustering using k-means method |
When a clustering method is selected the options and controls change to allow you to specify settings for
the chosen method.
Link Method (Hierarchical only)
A number of methods for clustering are available and vary
according to the way in which 'closest' is defined at each stage of merging
groups. The following possibilities are available:
| Single Link |
defines the similarity between two clusters as the maximum similarity
between any two samples in those clusters |
| Nearest Neighbour |
synonym for Single link |
| Complete Link |
defines the similarity between two clusters as the minimum similarity
between any two samples in those clusters |
| Furthest Neighbour |
synonym for Complete Link |
| Average Link |
defines the similarity between a cluster and two merging clusters as the
average of the similarities with each of the original clusters. It therefore
replaces two merging clusters by their mean, unweighted by cluster
size |
| Group Average |
an average is taken over all the samples in the two merging clusters. Thus,
the original clusters are replaced by their mean, weighted by cluster
size |
| Median Sorting |
can be thought of in terms of clusters being represented by points in a
multidimensional space; when two clusters join, the new cluster is represented
by the midpoint of the original cluster points |
Distance Method (Hierarchical only)
The number of clusters to group the probes into.
Probe Groups Threshold% (Hierarchical only)
The minimum percentage similarity within probe/gene groups.
Target Groups Threshold% (Hierarchical only)
The minimum percentage similarity within slide/target groups.
Criterion (K-means only)
The criterion to be optimized by the clustering. This can be set to one of the
following four choices:
| Within-class dispersion |
Minimizes the determinant of the pooled within-class dispersion matrix (W).
Under the assumption that the data originated from a mixture of k multivariate
Normal distributions, with equal variance-covariance matrix V, the MLE of V is
obtained when the grouping into k classes minimizes det(W). Obtains compact
groups. |
| Mahalanobis squared distance |
Maximizes the total between-groups Mahalanobis squared distance. This will
obtain separation of groups, possibly at the cost of compactness. Equivalent to
the Within-class dispersion criterion when there are only two groups. |
| Between-group sum of squares |
Minimizes the trace of the pooled within-class dispersion matrix (W).
Equivalent to maximizing the total between-group sum of squares, or Euclidean
distance between groups. |
| Maximal predictive classification |
Maximal predictive classification is suitable for binary data. Each group
has a class predictor, a binary indicator for each variate set to 0 or 1
according to whichever value is more frequent in the group. The criterion to be
maximized is the total number of agreements between units and their respective
class predictors. |
Number of Probe Groups (K-means only)
The number of clusters to group the probes into.
Number of Target Groups (K-means only)
The number of clusters to group the targets/slides into.
Use only top % of responding probes
Cluster only the a percentage of the probes. These probes chosen will be those
with largest average absolute responses.
Action Buttons
| Run | Run the analysis. |
| Cancel | Close the menu without further changes. |
| Options | Opens a dialog where additional options and settings can be
specified for the analysis. |
| Defaults | Set the menu settings back to the default settings.
Clicking the right mouse on this button produces a pop-up menu where you can choose to set
the menu using the currently stored defaults or the GenStat default settings. |
| Store | Opens a dialog to specify names of structures to store the results from the analysis.
The names to save the structures should be supplied before running the analysis. |
See Also