Advanced Consensus Explanation
Consensus Maker takes an input file of aligned
sequences in most standard formats and calculates a consensus
sequence for those sequences. The consensus alone may be returned by
the program or the user has the option to prepend the consensus to
the original alignment. A copy of the output file may be downloaded.
If the input alignment comprises blocks of sequences (e.g., HIV
sequences grouped by subtype) then the program can calculate a
consensus for each sequence block and a consensus of the consensuses.
The program recognizes sequence blocks by how the component sequences
are named.
A good way to understand the options available in this program is
to click the blue Sample Input button at the top of the submission
page. This causes a simple, hypothetical alignment (in table format)
to be loaded into the form.
A.seq1 A-CGTATTAG
A.seq2 A-CG-AT
A.seq3 A-CT-CT
A.seq4 A-TT-CX
B.seq1 A-CG-AT
B.seq2 A-CG-CT
B.seq3 A-CG-TT
You can then calculate the consensus of
this alignment under varying input options to see the results of
those options. Each column of the Sample Input has been chosen to
illustrate the workings of the various options. The output looks like:
CON_OF_CONS ACG-?TTAG
CON_A Acg-?TTAG
A.seq1 ACGTATTAG
A.seq2 ACG-AT
A.seq3 ACT-CT
A.seq4 ATT-CX
CON_B ACG-?T???
B.seq1 ACG-AT
B.seq2 ACG-CT
B.seq3 ACG-TT
Col. 1: unanimity,
Col. 2: all gaps, column squeezed,
Col. 3: majority,
Col. 4: no majority letter but resolvable by common character,
Col. 5: gaps,
Col. 6: irresolvable tie in consensus,
Col. 7: undefined character,
Cols. 8-10: missing information (trailing blanks).
Input file options
- Format of input alignment. Consensus Maker
recognizes most standard alignment formats. If the
program fails to decipher your format try resubmitting
the alignment in fasta or table format.
- Squeeze gaps. If your
alignment contains columns that are entirely gaps
they will be removed before a consensus is calculated.
Default = squeeze gaps. You can also specify what
character is used in your alignment to signify gaps. The
default is "-".
Note, if your alignment contains sequences of varying
length, Consensus Maker will equalize the lengths of
sequences by adding spaces to the ends of short
sequences. But those spaces will not be considered in
calculating the sequence unless the space character is
added to the set of "characters to consider."
Consensus output options
- Do consensus for
each block. If the input contains blocks of sequences
then calculate a consensus for each block, not just a
single consensus for the alignment as a whole. Default =
false. If false only a single consensus is computed for
the entire alignment. If true then you must insure that
the names of the sequences in your alignment follow a
conventional format that can be read by the program.
Sequences must have names like "A.US.57866" . The program
reads the letter(s) before the first dot ("A") and uses
it to define an "A" group of sequences. Another group of
sequences will be defined in the alignment if that first
character changes, e.g., B.FR.98332. Example.
This naming convention is that followed by the HIV
database. The output will have a CONSENSUS_A and a
CONSENSUS_B. If more than one character is present before
the first dot those characters will become the block
name; e.g., CRF01AE.X34577 will define a CRF01AE
consensus block. If you don't want to calculate multiple
consensuses then your sequences can be named however you
like.
- Min. no. seqs. for
consensus. If a block contains fewer than "n"
sequences, then don't calculate a consensus for that
block. Default = 3. This number only applies to blocks
within an alignment.
- Do consensus of
consensuses. If consensuses are to be computed for
each block in the alignment also calculate a consensus of
these consensuses. Default = false.
- Consensus + alignment.
Results will show consensus appended to the top of the
user's alignment. Default = true. When false, the output
consists of the consensus alone.
- Show number of sequences.
If consensuses are to be computed for
each block in the alignment this option will show how many
sequences occurred in each block. The number will be shown
following each consensus name, e.g., CON_A(23). The default
is to not show numbers.
- Output format. A
"pretty print" output shows your alignment aligned
to the consensus. The
alignment contains 50 characters per line with spaces
every 10 characters. Example.
The "output aligned" format is like "pretty print" except that
identities are shown by the "-" character and gaps by "."
Example.
Alternatively you can have your output presented in the same
format as your submission. Example.
Consensus calculation options
- Unanimous value. The
fraction of characters in a column of the alignment
needed to establish unanimity (shown as a capital letter)
for that column. For example, if unanimous = 1.0 then all
characters in a column must be the same in order for the
consensus to show a capital letter. A value of .9
requires 90% agreement to show a capital. Default =
1.0
- Majority value. Default = 0.5.
The fraction of characters in a column of the alignment
needed to establish majority (shown as a lowercase
letter) for that column. For example, if majority = 0.5
then at least half the characters in a column must agree
in order for the consensus to show a lowercase letter. If
there is no majority letter for a column the consensus
indicates this with either a ? or by the most
common character in that column.
- Use most common character.
This option determines what symbol to enter in the
consensus for a column that has no majority character.
Suppose a column contained letters AAAGGTTC. Does the
user want that column to be represented in the consensus
by "a" (i.e., the most common letter)? If so, then set
this value to its default, true. Or does the user want
that column to be represented in the consensus by "?"
(i.e., no letter forms a majority)? If so, then set this
value to false.
- Tie breaking.
If there are two or more letters in a column that occur
in equal numbers, e.g., AAAGGGT, how does the consensus tool
represent the consensus for this column? There are two
options. If multiple blocks are present in the
alignment and there is a tie between two letters in one
block, the program will try to resolve the tie by looking
at that column of the alignment in all other blocks as
well. For example, if column 1 of block 1 is AAAGGGT, and
column 1 of block 2 is AAAAG, then the consensus for
column 1 block 1 will be "a", not "?" This is the default.
Ties can also be broken by using the IUPAC character that
represents the set of tied characters. In the example above,
there are equal numbers of AG. The IUPAC character that
represents A or G is R, and that will be consensus for
that column.
- Characters to count when
making consensus. This is a set of characters
("letters") that the program considers when making a
consensus. The default for nucleotide alignments is the
set of valid nucleotide characters and the gap character
"ACGTU-". Using these defaults, the alignment column
AAAAAXAA would have a consensus of "A" because the "X"
character is ignored -- it's not in the set of valid
characters. If we edit the ACGTU- set by adding "X" to
it, then the consensus for that column would be "a"
(majority A, not unanimous). A similar set of amino acid
codes, also editable, is defined on the input form. You
should first run your alignment with the default
character sets to see if that produces the alignment you
want. If not then you can edit the character sets so the
resulting consensus matches your intent.
- Use any character when making
consensus. Finally, if you want to consider ALL
characters (including blanks, *, x, $, etc.) when making
a consensus check that box.
Examples
Example of using names to identify
alignment blocks:
In the table-formatted file below there are two blocks, an "A1"
block and a "B" block recognizable by the "A1." and "B." (note the
dot) with which the names begin. Two consensuses will be calculated
for this alignment if "Do consensus for each block" is true and "Min.
no. seqs. for consensus" is 3.
A1.FR.83.IIIB_A04321 aaactatcgtagctagctagctgatcgatgctagctgatcg.... etc
A1.FR.83.IIIC_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
A1.DE.96.POIURR_A04322 aaactatcgtagctagctag------gatgctagctgatcg.... etc
B.FR.82.LAI_K03455 aaactatcgtagctagctttctgatcgatgctagctgatcg.... etc
B._._.N833_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc
B.US.99.JK77_AF76511 acactatcgtagctagctagctgatcgatgctagctgatcg.... etc
Example of "pretty print" output:
CON gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA
1a.-.COLONEL_AF290978 ---------- --TTGGGGGC GACACTCCAC CATGAATCAC CCCCCTGTGA
1a.-.H77_AF009606 GCCAGCCCCC TGATGGGGGC GACACTCCAC CATGAATCAC TCCCCTGTGA
1a.-.HEC278830_AJ278830 GCCAGCCCCC TGATGGGGGC GACGCTCCAC CATGAATCAC TCCCCTGTGA
CON GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG
1a.-.COLONEL_AF290978 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG
1a.-.H77_AF009606 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCATGGCG TTAGTATGAG
1a.-.HEC278830_AJ278830 GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCGTGGCG TTAGTATGAG
CON TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.COLONEL_AF290978 TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.H77_AF009606 TGTCGTGCAG CCTTCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.HEC278830_AJ278830 TGTCGTGCAG CCTCCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
Example of "output aligned" output:
CON gccagccccc tgaTGGGGGC GACaCTCCAC CATGAATCAC tCCCCTGTGA
1a.-.COLONEL_AF290978 .......... ..T------- ---------- ---------- C---------
1a.-.H77_AF009606 ---------- ---------- ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---G------ ---------- ----------
CON GGAACTACTG TCTTCACGCA GAAAGCGTCT AGCCaTGGCG TTAGTATGAG
1a.-.COLONEL_AF290978 ---------- ---------- ---------- ---------- ----------
1a.-.H77_AF009606 ---------- ---------- ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ----G----- ----------
CON TGTCGTGCAG CCTcCAGGAC CCCCCCTCCC GGGAGAGCCA TAGTGGTCTG
1a.-.COLONEL_AF290978 ---------- ---------- ---------- ---------- ----------
1a.-.H77_AF009606 ---------- ---T------ ---------- ---------- ----------
1a.-.HEC278830_AJ278830 ---------- ---------- ---------- ---------- ----------
Example of formatted output (nexus):
#NEXUS
begin taxa;
dimensions ntax=4;
taxlabels
CON
1a._.COLONEL_AF290978
1a._.H77_AF009606
1a._.HEC278830_AJ278830
;
end;
begin characters;
dimensions nchar=150;
format interleave datatype=dna;
matrix
CON gccagccccctgaTGGGGGCGACaCTCCACCATGAATCACtCCCCTGTGA
1a._.COLONEL_AF290978 ------------TTGGGGGCGACACTCCACCATGAATCACCCCCCTGTGA
1a._.H77_AF009606 GCCAGCCCCCTGATGGGGGCGACACTCCACCATGAATCACTCCCCTGTGA
1a._.HEC278830_AJ278830 GCCAGCCCCCTGATGGGGGCGACGCTCCACCATGAATCACTCCCCTGTGA
CON GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCaTGGCGTTAGTATGAG
1a._.COLONEL_AF290978 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.H77_AF009606 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCATGGCGTTAGTATGAG
1a._.HEC278830_AJ278830 GGAACTACTGTCTTCACGCAGAAAGCGTCTAGCCGTGGCGTTAGTATGAG
CON TGTCGTGCAGCCTcCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.COLONEL_AF290978 TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.H77_AF009606 TGTCGTGCAGCCTTCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
1a._.HEC278830_AJ278830 TGTCGTGCAGCCTCCAGGACCCCCCCTCCCGGGAGAGCCATAGTGGTCTG
;
end;
last modified: Thu Jul 19 10:59 2007