BKEY procedure
Constructs an identification key (R.W. Payne).
Options
Parameters
Description
Identification keys provide efficient ways of identifying objects, or taxa, whose properties can be described by a set of discrete-valued tests. Many applications are biological. For example, in botanical work, the taxa may be species of plant and the tests may require the observation of characters like the colours of petals or numbers of leaves. Similarly, in microbiology, the tests may involve the ability of an organism to grow in various media. Using a key involves doing a sequence of tests which continues until the unknown specimen can be identified.
The characters that are available for constructing the key are specified, as a list of factors, using the CHARACTER parameter. Each factor has a level for each possible value of the character concerned, and you can insert a missing value for a particular taxon to indicate that its value for the character is either variable or unknown. If an "extra" text has been defined for the factor (using the EXTRA parameter of the FACTOR directive), BKEY will use this when printing the textual forms of the key instead of the identifier of the factor. (So the characters can be described in the key using any printable symbol, not just those that may be used in identifiers.) The COST parameter allows you to specify a cost for each character. This may be how much it costs to observe or may simply record your own personal preferences between the parameters. By default all the costs are 1. The names of the taxa can be specified in a text using the TAXONNAMES option. If this is omitted, they are simply numbered 1, 2 and so on. If the taxa are classified into groups, BKEY can construct a key to identify the group of a specimen rather than the taxon itself. These groupings can be supplied using the GROUPS factor.
The efficiency of a key is usually measured by its expected cost of identification. To find the optimal key using a particular set of data essentially requires the construction and comparison of all possible keys for the taxa that could be formed with the available tests. This is impracticable even for moderate numbers of tests and taxa. Thus, heuristic algorithms are used which construct the key sequentially, selecting first the test that "best" divides the taxa into sets (where set k for test i contains all the taxa that can give result k to test i), then selecting the best test to use with each set, continuing until the sets each contain only one taxon - or until no further separation is possible. The "best" test can be defined using a selection criterion function (Gower & Payne 1975). BKEY provides three criteria, which can be selected using the CRITERION option, with settings:
CMe and CMv′ (and two other criteria) were studied by Payne & Thompson (1989), who found that each of them produced the best key for some sets of data. They thus concluded that programs for key construction should allow their users to try several so that they can choose the one that behaves best with any particular set of data.
Usually construction of the key stops when the possible taxa at that point share identical values or have missing values for all the characters. However, if the missing values represent variable rather than unknown values, it may still be worth using these tests in case a specimen of the taxon concerned is obtained that happens to give a level different from the shared level. This partial separation can be requested by setting option PARTIAL=yes.
The key can be printed in various formats, as requested by the PRINT option, or it can be saved using the KEY option. The settings of PRINT are:
BKEY stores the information required for printing as part of the tree. The labels for the diagram are formed as "identifier==n1", where n1 is the first level of the factor. The lines of the indented and bracketed keys are formed similarly if the factor has no extra test and no labels. Otherwise, the form is "fname lname", where fname is the extra text if this has been defined (by the EXTRA parameter of the FACTOR command) or else the identifier of the factor, and lname is the label if available or the level if not.
Options: PRINT, TAXONNAMES, GROUPS, CRITERION, PARTIAL, KEY.
Parameters: CHARACTER, COST.
Method
BKEY calls procedure BCONSTRUCT to form the key. This uses a special-purpose procedure BSELECT, which is customized specifically for keys, and stored with BKEY. The methodology involved in the construction of keys is reviewed by Payne & Preece (1980). Statistical applications of keys are described by Payne (1992).
Action with
RESTRICT
Any restrictions on the CHARACTER factors or on TAXONNAMES or GROUPS are removed.
References
Gower, J.C. & Payne, R.W. (1975). A comparison of different criteria for selecting binary tests in diagnostic keys. Biometrika, 62, 665-671.
Payne, R.W. & Preece, D.A. (1980). Identification keys and diagnostic tables: a review (with discussion). Journal of the Royal Statistical Society, Series A, 143, 253-292.
Payne, R.W. (1981). Selection criteria for the construction of efficient diagnostic keys. Journal of Statistical Planning and Inference, 5, 27-36.
Payne, R.W., Yarrow, D. & Barnett, J.A. (1982). The construction by computer of a diagnostic key to the genera of yeasts and other such groups of taxa. Journal of General Microbiology, 128, 1265-1277.
Payne, R.W. & Thompson, C.J. (1989). A study of selection criteria for constructing identification keys containing tests with different costs. Computational Statistics Quarterly, 5, 43-52.
Payne, R.W. (1992). The use of identification keys and diagnostic tables in statistical work. In: COMPSTAT 92 Proceedings in Computational Statistics (Ed. Y. Dodge & J. Whittaker), Volume 2, 239-244. Heidelberg: Physica-Verlag.