HIV sequence database

Advanced Consensus Explanation

Consensus Maker takes an input file of aligned sequences in most standard formats and calculates a consensus sequence for those sequences. The consensus alone may be returned by the program or the user has the option to prepend the consensus to the original alignment. A copy of the output file may be downloaded. If the input alignment comprises blocks of sequences (e.g., HIV sequences grouped by subtype) then the program can calculate a consensus for each sequence block and a consensus of the consensuses. The program recognizes sequence blocks by how the component sequences are named.

A good way to understand the options available in this program is to click the blue Sample Input button at the top of the submission page. This causes a simple, hypothetical alignment (in table format) to be loaded into the form.

A.seq1	A-CGTATTAG
A.seq2	A-CG-AT
A.seq3	A-CT-CT
A.seq4	A-TT-CX
B.seq1	A-CG-AT
B.seq2	A-CG-CT
B.seq3	A-CG-TT

You can then calculate the consensus of this alignment under varying input options to see the results of those options. Each column of the Sample Input has been chosen to illustrate the workings of the various options. The output looks like:

CON_OF_CONS  ACG-?TTAG
CON_A        Acg-?TTAG
A.seq1       ACGTATTAG
A.seq2       ACG-AT   
A.seq3       ACT-CT   
A.seq4       ATT-CX   
CON_B        ACG-?T???
B.seq1       ACG-AT   
B.seq2       ACG-CT   
B.seq3       ACG-TT

Col. 1: unanimity, Col. 2: all gaps, column squeezed, Col. 3: majority, Col. 4: no majority letter but resolvable by common character, Col. 5: gaps, Col. 6: irresolvable tie in consensus, Col. 7: undefined character, Cols. 8-10: missing information (trailing blanks).

Input file options

Format of input alignment. Consensus Maker recognizes most standard alignment formats. If the program fails to decipher your format try resubmitting the alignment in fasta or table format.
Squeeze gaps. If your alignment contains columns that are entirely gaps they will be removed before a consensus is calculated. Default = squeeze gaps. You can also specify what character is used in your alignment to signify gaps. The default is "-".

Note, if your alignment contains sequences of varying length, Consensus Maker will equalize the lengths of sequences by adding spaces to the ends of short sequences. But those spaces will not be considered in calculating the sequence unless the space character is added to the set of "characters to consider."

Consensus output options

Do consensus for each block. If the input contains blocks of sequences then calculate a consensus for each block, not just a single consensus for the alignment as a whole. Default = false. If false only a single consensus is computed for the entire alignment. If true then you must insure that the names of the sequences in your alignment follow a conventional format that can be read by the program. Sequences must have names like "A.US.57866" . The program reads the letter(s) before the first dot ("A") and uses it to define an "A" group of sequences. Another group of sequences will be defined in the alignment if that first character changes, e.g., B.FR.98332. Example. This naming convention is that followed by the HIV database. The output will have a CONSENSUS_A and a CONSENSUS_B. If more than one character is present before the first dot those characters will become the block name; e.g., CRF01AE.X34577 will define a CRF01AE consensus block. If you don't want to calculate multiple consensuses then your sequences can be named however you like.
Min. no. seqs. for consensus. If a block contains fewer than "n" sequences, then don't calculate a consensus for that block. Default = 3. This number only applies to blocks within an alignment.
Do consensus of consensuses. If consensuses are to be computed for each block in the alignment also calculate a consensus of these consensuses. Default = false.
Consensus + alignment. Results will show consensus appended to the top of the user's alignment. Default = true. When false, the output consists of the consensus alone.
Show number of sequences. If consensuses are to be computed for each block in the alignment this option will show how many sequences occurred in each block. The number will be shown following each consensus name, e.g., CON_A(23). The default is to not show numbers.
Output format. A "pretty print" output shows your alignment aligned to the consensus. The alignment contains 50 characters per line with spaces every 10 characters. Example. The "output aligned" format is like "pretty print" except that identities are shown by the "-" character and gaps by "." Example. Alternatively you can have your output presented in the same format as your submission. Example.

Consensus calculation options

Unanimous value. The fraction of characters in a column of the alignment needed to establish unanimity (shown as a capital letter) for that column. For example, if unanimous = 1.0 then all characters in a column must be the same in order for the consensus to show a capital letter. A value of .9 requires 90% agreement to show a capital. Default = 1.0
Majority value. Default = 0.5. The fraction of characters in a column of the alignment needed to establish majority (shown as a lowercase letter) for that column. For example, if majority = 0.5 then at least half the characters in a column must agree in order for the consensus to show a lowercase letter. If there is no majority letter for a column the consensus indicates this with either a ? or by the most common character in that column.
Use most common character. This option determines what symbol to enter in the consensus for a column that has no majority character. Suppose a column contained letters AAAGGTTC. Does the user want that column to be represented in the consensus by "a" (i.e., the most common letter)? If so, then set this value to its default, true. Or does the user want that column to be represented in the consensus by "?" (i.e., no letter forms a majority)? If so, then set this value to false.
Tie breaking. If there are two or more letters in a column that occur in equal numbers, e.g., AAAGGGT, how does the consensus tool represent the consensus for this column? There are two options. If multiple blocks are present in the alignment and there is a tie between two letters in one block, the program will try to resolve the tie by looking at that column of the alignment in all other blocks as well. For example, if column 1 of block 1 is AAAGGGT, and column 1 of block 2 is AAAAG, then the consensus for column 1 block 1 will be "a", not "?" This is the default. Ties can also be broken by using the IUPAC character that represents the set of tied characters. In the example above, there are equal numbers of AG. The IUPAC character that represents A or G is R, and that will be consensus for that column.
Characters to count when making consensus. This is a set of characters ("letters") that the program considers when making a consensus. The default for nucleotide alignments is the set of valid nucleotide characters and the gap character "ACGTU-". Using these defaults, the alignment column AAAAAXAA would have a consensus of "A" because the "X" character is ignored -- it's not in the set of valid characters. If we edit the ACGTU- set by adding "X" to it, then the consensus for that column would be "a" (majority A, not unanimous). A similar set of amino acid codes, also editable, is defined on the input form. You should first run your alignment with the default character sets to see if that produces the alignment you want. If not then you can edit the character sets so the resulting consensus matches your intent.
Use any character when making consensus. Finally, if you want to consider ALL characters (including blanks, *, x, $, etc.) when making a consensus check that box.

Examples

Example of using names to identify alignment blocks:

In the table-formatted file below there are two blocks, an "A1" block and a "B" block recognizable by the "A1." and "B." (note the dot) with which the names begin. Two consensuses will be calculated for this alignment if "Do consensus for each block" is true and "Min. no. seqs. for consensus" is 3.

A1.FR.83.IIIB_A04321 aaactatcgtagctagctagctgatcgatgctagctgatcg.... etc
A1.FR.83.IIIC_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
A1.DE.96.POIURR_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
B.FR.82.LAI_K03455 aaactatcgtagctagctttctgatcgatgctagctgatcg.... etc
B._._.N833_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc
B.US.99.JK77_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc

Example of "pretty print" output:

CON                     gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA 
1a.-.COLONEL_AF290978   ---------- --TTGGGGGC GACACTCCAC CATGAATCAC CCCCCTGTGA 
1a.-.H77_AF009606       GCCAGCCCCC TGATGGGGGC GACACTCCAC CATGAATCAC TCCCCTGTGA 
1a.-.HEC278830_AJ278830 GCCAGCCCCC TGATGGGGGC GACGCTCCAC CATGAATCAC TCCCCTGTGA 

CON                     GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG 
1a.-.COLONEL_AF290978   GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG 
1a.-.H77_AF009606       GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG 
1a.-.HEC278830_AJ278830 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCGTGGCG TTAGTATGAG 

CON                     TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG 
1a.-.COLONEL_AF290978   TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG 
1a.-.H77_AF009606       TGTCGTGCAG CCTTCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG 
1a.-.HEC278830_AJ278830 TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG

Example of "output aligned" output:

CON                     gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA 
1a.-.COLONEL_AF290978   .......... ..T------- ---------- ---------- C--------- 
1a.-.H77_AF009606       ---------- ---------- ---------- ---------- ---------- 
1a.-.HEC278830_AJ278830 ---------- ---------- ---G------ ---------- ---------- 

CON                     GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG 
1a.-.COLONEL_AF290978   ---------- ---------- ---------- ---------- ---------- 
1a.-.H77_AF009606       ---------- ---------- ---------- ---------- ---------- 
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ----G----- ---------- 

CON                     TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG 
1a.-.COLONEL_AF290978   ---------- ---------- ---------- ---------- ---------- 
1a.-.H77_AF009606       ---------- ---T------ ---------- ---------- ---------- 
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ---------- ----------

Example of formatted output (nexus):

#NEXUS

begin taxa;
dimensions ntax=4;
taxlabels
CON
1a._.COLONEL_AF290978
1a._.H77_AF009606
1a._.HEC278830_AJ278830
;
end;

begin characters;
dimensions nchar=150;
format interleave datatype=dna;
matrix
CON                     gccagccccctgaTGGGGGCGACaCTCCACCATGAATCACtCCCCTGTGA
1a._.COLONEL_AF290978   ------------TTGGGGGCGACACTCCACCATGAATCACCCCCCTGTGA
1a._.H77_AF009606       GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGA
1a._.HEC278830_AJ278830 GCCAGCCCCCTGATGGGGGCGACGCTCCACCATGAATCACTCCCCTGTGA

CON                     GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCaTGGCGTTAGTATGAG
1a._.COLONEL_AF290978   GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.H77_AF009606       GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.HEC278830_AJ278830 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCGTGGCGTTAGTATGAG

CON                     TGTCGTGCAGCCTcCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.COLONEL_AF290978   TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.H77_AF009606       TGTCGTGCAGCCTTCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.HEC278830_AJ278830 TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG

;
end;

last modified: Thu Jul 19 10:59 2007

Index of all tools	ADRA
Branchlength	Codon Alignment
Consensus Maker	ELF
ElimDupes	Entropy
Epilign	FindModel
Format converter	Gap strip/squeeze
Gene Cutter	HDent/HDdist
Heatmap	Hepitope
Highlighter	HIV BLAST
HIValign	Hypermutation
jpHMM at GOBICS	Mosaic Vaccine Tool Suite
Motif Scan	N-Glycosite
ODprep/ODfit	PCOORD
PeptGen	PhyloPlace
Primalign	Protein Feature Accent
Protein structure	Recombinant HIV-1 drawing tool
RIP	SeqPublish
Sequence locator	SNAP
SUDI subtyping	SynchAlign
Translate	Treemaker
External tools