HIV sequence database

PCOORD Explanation

PCOORD (Principal Coordinate Analysis) is a procedure to find meaningful patterns in sequence data with no a priori knowledge about them. The procedure attempts to summarize the variation in the sequences in a limited number of axes or dimensions. A 'dimension' is basically a combination of positions in a sequence that behave similarly (for example, position 133 usually has an A when position 250 has a G).

One way to describe the process of finding these dimensions is as follows. If we have a two-dimensional swarm of datapoints, then we need two dimensions (the X and Y axis) to describe the variation in our data. However, if the swarm is very elongated and the points almost lie on a straight line, then we really need only one dimension, although we use two. PCOORD uses a mathematical method to find the best way to describe a multi-dimensional dataset in a smaller number of dimensions, which are linear combinations of the original dimensions.

The dimensions are not necessarily biologically meaningful, but they can be. Quite frequently, some dimensions that are extracted correspond to an epidemiological variable or some other feature of the data. The patterns that are found using PCOORD usually can be seen in a phylogenetic tree as well, but they may be much less pronounced there.

The results from PCOORD are to some extent influenced by which distance scoring method is used. For nucleotides, PCOORD computes simple Hamming distances. For amino acids, a similar same/different scoring scheme is available, called ID distances. Also implemented is the Smith and Smith (1990) scoring method, which results in Euclidian distances. Their scoring matrix looks like this:

   D   0
   E   1   0
   K   2   2   0
   R   2   2   1   0
   H   2   2   1   1   0
   N   2   2   2   2   2   0
   Q   2   2   2   2   2   1   0
   S   2   2   2   2   2   2   2   0
   T   2   2   2   2   2   2   2   1   0
   I   3   3   3   3   3   3   3   3   3   0
   L   3   3   3   3   3   3   3   3   3   1   0
   V   3   3   3   3   3   3   3   3   3   1   1   0
   F   3   3   3   3   3   3   3   3   3   2   2   2   0
   W   3   3   3   3   3   3   3   3   3   2   2   2   1   0
   Y   3   3   3   3   3   3   3   3   3   2   2   2   1   1   0
   C   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   0
   M   3   3   3   3   3   3   3   3   3   2   2   2   2   2   2   2   0
   A   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   0
   G   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   1   0
   P   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   0

       D   E   K   R   H   N   Q   S   T   I   L   V   F   W   Y   C   M   A   G   P

The PCOORD program has the possibility to identify each sequence with a character (number, letter, or symbol such as * or ^). To use that feature, you need a file with one character for each sequence. In the dimension plot, the point representing each sequence will then be identified by the corresponding character.

The Principal Coordinate Analysis method is very similar to Principal Component Analysis. The method was developed J.C. Gower. The PCOORD program suite was developed by Des Higgins (then at the European Molecular Biology Laboratory, EMBL), and adapted for UNIX machines by Jack Leunissen of the CAOS/CAMM institute in Nijmegen, The Netherlands.

For more detailed information, you can view the manual for the Spacer code, which is the basis of our PCOORD tool.

References:

Higgins DG (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets Comput Appl Biosci 8(1):15-22

Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53:325-328

Smith RF, Smith TF (1990). Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci U S A. 1990 Jan;87(1):118-22.

last modified: Wed Nov 7 13:37 2007

Index of all tools	ADRA
Branchlength	Codon Alignment
Consensus Maker	ELF
ElimDupes	Entropy
Epilign	FindModel
Format converter	Gap strip/squeeze
Gene Cutter	HDent/HDdist
Heatmap	Hepitope
Highlighter	HIV BLAST
HIValign	Hypermutation
jpHMM at GOBICS	Mosaic Vaccine Tool Suite
Motif Scan	N-Glycosite
ODprep/ODfit	PCOORD
PeptGen	PhyloPlace
Primalign	Protein Feature Accent
Protein structure	Recombinant HIV-1 drawing tool
RIP	SeqPublish
Sequence locator	SNAP
SUDI subtyping	SynchAlign
Translate	Treemaker
External tools