HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Highlighter Explanation

Click here to Return to Highlighter Front Page

Topics below:

 

Data input limitations

When processing nucleotide or amino acid sequences a maximum of 500 sequences can be processed.

Input format permitted

Highlighter takes nucleotide and amino acid alignments as input. The alignments can be in any one of the Common Sequence Formats. The input should be codon aligned if you wish to use the silent and non-silent statistics.
The highlighter tool for nucleotide sequences handles sequences of different size such that sequences shorter than the master sequence are extended with dashes ('-') to the same length as the master sequence and sequences longer will be trimmed to the same length as the master sequence.
The highlighter tool for amino acid sequences does not currently have this feature and instead will prompt the user to correct the data set.

Ignore alphabet validation

If unchecked, each sequence is evaluated to be either of type nucleotide, amino acid or indeterminate. If more than 2% of the characters are unambiguously amino acid codes [QEILFP] then the sequence is protein. If more that 94% of the characters are ATGCUNRY it is a nucleotide sequence. Else it is an ambiguous sequence.

This allows the user to submit sequences that only have amino acid characters in places of mutation and the rest of the sequence positions are represented by e.g. dashes.
For example:

MasterTFQPSSGGDLEI
Seq1--M---------
Seq2--D------D--
will result in regular output only if the 'Ignore alphabet validation' checkbox is checked.

Change masters

This feature enables you to select the sequence(s) that will act as master.

If the box is not clicked, the top sequence will be taken as the master by default. When multiple masters are required while using the option match, the top n sequences will be taken as masters where n is a number specified by the user. However, when the change masters button is clicked, the sequences will be extracted from the input file and a check box will be displayed. Using these checkboxes, you can select the sequences you want to use as masters. All master sequences have to be of the same length.

NOTE: When there are multiple selections but only one master is needed, any one of the selections will be used as master.

Mismatches

This option enables you to find the nucleotides/amino acids in a query sequence that do not match with those in a SINGLE master sequence. The nucleotides/amino acids that do not match with the master are assigned a color. The example below describes mismatches in a nucleotide sequence:

= T  = A  = C  = G

Consider the following sequences,

MasterAGTTAG
QueryTAGCAG
Result||||

In the above example, the query sequence differs from the master in the positions 1, 2, 3 and 4. Hence these changes in the query are indicated by a "|" in a color depending on which nucleotide is present in the given location in the query.

Mark APOBEC signatures

This selection will highlight APOBEC signatures and G->A conversions that are encountered in the sequences. The following illustrates an example of how this function works.
MasterGTGCGT
QueryAAAGAC
The above master and query will produce the following plot. Pink filled circles denote APOBEC signatures and open diamonds represent G->A conversions.

In order to identify APOBEC signatures, whenever the program encounters a G->A change, it looks at the nucleotides of the query in the next 2 successive positions. If the first successive position contains an A or a G, and the second successive position does not contain a C, the position is marked as an APOBEC signature. For example, in the above plot, the first position is marked APOBEC because, in position 1 of the query, there is a G->A change, in position 2, there is an A and in position 3, there is again an A (not a C). Similarly, position 3 is also marked as an APOBEC signature. If there is a G->A change in a position, but the next two positions do not qualify as APOBEC, then this position is marked as a G->A conversion. This is shown in position 5. To put in a nutshell, all changes to A are marked light green, all changes from G->A have an open diamond in addition to the green bar, and all changes from G->A that are also APOBEC signatures, have a closed circle in addition to the green bar. To learn more about APOBEC signatures see here.

Transitions and transversions

This option enables you to compare query sequences with the master and highlight transitions, transversions and A<->G transitions with the following colors:

= Transversions  = C<->T Transitions  = A<->G Transitions  = Transition or transversion

Consider the following sequences,

MasterATGCATM
Query-AAGGCG
Result|||||||

In the above example, all transitions except the one in the 6th position is an A<->G transition and is hence marked in light blue. All the transversions are marked in pink. In the last position (7), M of the master represents A or C and the query has a G in the corresponding position, hence this could be either a transition or transversion and is marked in green.

Silent and non-silent mutations

This option enables you to compare query NUCLEOTIDE sequences with the master and highlight silent and non-silent mutations with the following colors.

= Silent  = Non-silent

Consider the following sequences,

MasterATGACTAATTAG
QueryATGACCGTTTAA
Result||||

The tool converts the nucleotide sequences to the corresponding amino acid sequences and highlights silent and non-silent mutations. In the above example, the second codon (ACT) in the master encodes threonine, and the corresponding codon in the query (ACC) also codes for the same, hence this is shown as a silent mutation. In contrast, the 3rd codon in the master (AAT) codes for amino acid asparagine, while the corresponding codon in the query (GTT) codes for valine, hence this is shown as a non-silent mutation.

This tool uses SNAP to calculate the statistics. It only compares the Master sequence with the other sequences and does not compare all pairs of sequences. For a more detailed analysis of silent and non-silent mutations, please use SNAP.

Match

This option enables you to identify the matching nucleotides/amino acids between a single or multiple masters and a query sequence. If the number of masters is entered as 2, the top 2 sequences in the file will be considered as master sequences.

Each of the masters is assigned a unique color and is matched to each of the query sequences. The nucleotide/amino acid matches in the query are highlighted in the color of the master that it matched. If a nucleotide/amino acid in the query matches more than one master, this match is ignored, and only unique matches are colored.

Consider the following example,

Master1ATTGGC
Master2AGGCAT
Query1AGTTAG
ResultA||T|G

In the above example, the G in position 2 of the query matches with master2 and is indicated by a green "|" in the respective position in the result. The query sequence also matches with the T of master1 in position 3 and this is indicated by the red "|". The query matches with both master1 and master2 in position 1 and hence this position is left uncolored as only unique matches are displayed in the result. With regards to positions 4 and 6, since there are no matches, this is treated as a polymorphic site and depending on the option chosen to label polymorphisms, this is either left uncolored or is colored black. For more info on labeling polymorphisms, see below.

Label polymorphisms

Consider the following sequences,

Master1: ATTGATA
Master2: ATTGTTA
Query1 : ATTGCTA

All the sequences above are identical except for the nucleotides in the 5th position. While the master sequences have an A and T in their respective positions, Query1 has a C. By selecting the option to label polymorphisms, these mismatches will also be indicated in black, while they will be ignored if this option is not selected.

Indicate successive matches

This option highlights successive matches with bars as shown below:

Master1ATTGGC
Master2AGGCAT
Query1ATTTAG
ResultT|G

In the example, the first three successive matches are represented as a single long bar as shown when the 'Use bars to indicate successive matches' option is selected. If the option is not selected, the matches are represented as regular vertical lines as shown in position 5.

Treat gaps as character

Match:
When a match is done with the option "treat gaps as character", the gaps are treated as a "fifth nucleotide" ("21st amino acid") and a gap in the query is matched with a gap in the master in the same position.

Consider the following example:

Master1-TTGG-
Master2AGGCA-
Query1-GTTA-
Result|||T|G

In the above example, when "treat gaps as character" is selected, the gap in the 1st position of the query is matched with the gap in the first position of master 1. However, the gap in the 6th position is not taken into account and is ignored because it matches with more than one master. For more details on match see above.

Mismatch, Transition & transversion, and Silent & non-silent:
When any of the above options are run with the option "treat gaps as a character", a gap IN THE QUERY is highlighted in gray for nucleotids and amino acids if there is no gap in the master sequence at the same position. If the "treat gaps as character" option is not chosen, such a gap is ignored.

Handling of IUPAC codes

When the option "match IUPAC codes" is selected, the following IUPAC codes are also considered during a match or mismatch:

CharacterMeaning
MA and C
RA and G
WA and T
SC and G
YC and T
KG and T
BC and G and T
HA and C and T
VA and C and G
DA and G and T
NA or C or G or T
?Any state or nothing

Ignore

When the ignore option is selected, the tool skips over the IUPAC code in the sequence and does not perform any comparison in that position.

Treat IUPAC codes as characters

In this case, the tool treats IUPAC codes as regular characters without using the nucleotides they stand for while comparing. For example, although R stands for A and G, while using this option it will match ONLY another R.

Use IUPAC codes to compare

Match:
When IUPAC codes are included in the match, the codes are also matched based on the nucleotides they represent. For example, if the master had an M in the 2nd position and the query had an R in the corresponding position, then this is considered a match because M could be an A and so could R. Whereas, if the master had a C in the 2nd position and the query had a D in the same position then this would not be a match because D includes everything but C.

Consider the following example:

Master1AMRW?TGC??----?
Master2MHND?Y?T?G-AA-T
Query1BTACAT?M?T--??-
Result|||||||||||||||

In the above example, in the first position, the query matches with Master 2 because the B in the query matches with M in Master 2 since both codes can represent C. Position 2 shows a similar example. In position 3, A in the query matches both Master 1 and Master 2 hence this match is not shown. Looking at position 4, we find that the C in the query does not match with either W(A and T) or D(A and G and T). If the option label polymorphisms is selected, then this is labeled black, else it is ignored. To learn about labeling polymorphisms, see here. When a question mark is present in the query during a match, it is considered a match with all the masters and is ignored. Whereas, if it is present in a master and the query does not match with any other masters in the corresponding position, then this is considered a match and is shown. This is demonstrated in the example in positions 7, 9, 13, 14 and 15.

Mismatch, transitions & transversions and silent & non-silent:
When IUPAC codes are included in any of the above cases, a difference is shown only if there are no common nucleotides between the IUPAC codes. Consider the following example during mismatch:

MasterRMATGC-D??
QueryAHWGGDABA?
ResultAHW|G||BA?

In the above example, the first three positions of the query match with the first three positions of the master and hence are not shown. Whereas at the 4th position, the T in the master does not match with the G in the query and this is shown in yellow. Similarly at position 6, the master has a C while the query has a D which represents the nucleotides A, G and T. Hence this is considered a mismatch.

Sort sequences

1. by similarity

When the "Sort sequences by similarity" option is chosen, the sequences are compared to the first sequence in the alignment file and are sorted according to their similarity with this sequence. The most similar sequence is placed at the top of the result graph and the least similar at the bottom. When there are multiple masters, the sequences are sorted according to their similarity with the first sequence in the alignment file.

2. by tree

When the "Sort sequences by tree" option is chosen, the sequences are sorted based on their evolutionary relationship. If the user does not supply a tree file, the program will generate one using the phylogenetic analysis tool PAUP.

3. do not sort

When the "Do not sort" option is chosen, the sequences in the result set will appear in the same order they are in the supplied alignment file.

Options for coloring matches/mismatches (amino acids only)

1. Standard

His
Asp, Glu
Lys, Asn, Gln, Arg
Met
Ile, Leu, Val
Phe, Trp, Tyr
Cys
Ala, Gly, Ser, Thr
Pro
Gap
Other

2. Se-Al (default)

For information about the Se-Al software click here ;
Ala, Gly, Pro, Ser, Thr
His, Lys, Arg
Asp, Glu, Asn, Gln
Cys
Ile, Leu, Met, Val
Phe, Trp, Tyr
Gap
Other

3. Se-Al (polar/non-polar)

For information about the Se-Al software click here ';
Ala, Phe, Ile, Leu, Met, Pro, Val, Trp
Cys, Gly, Asn, Gln, Ser, Thr, Tyr
Asp, Glu
His, Lys, Arg
Gap
Other

3. BioEdit

For information about the BioEdit software click here ';
Ala Gly Pro Ser
Asp Glu Trp Tyr
His Lys Arg Ile
Leu Met Val Asn
Gln Thr Phe Cys
Gap Other


last modified: Thu Jan 3 10:45 2013


Questions or comments? Contact us at seq-info@lanl.gov.