HIV sequence database

ElimDupes Explanation

What is a duplicate sequence?

There are various ways of defining "duplicateness" in two sequences.

1. The strongest definition would be the case in which two sequences match exactly as in:

ACCCTGATTAGC  seq 1
ACCCTGATTAGC  seq 2

2. Slightly less strong than perfect match is the situation in which the sequences match in all respects except the case of the letters:

ACCCTGATTAGC  seq 1
aCCCtGATTaGC  seq 2

3. A third consideration is the case of gaps and other non-letter or "extraneous" characters. Are these two sequences duplicates?

ACCCTGATTAGC      seq 1
ACCCT----GATTAGC  seq 2

4. Fourth, there is the case of one sequence that matches part of another:

ACCCTGATTAGC  seq 1
    TGAT      seq 2

5. Final consideration is the similarity of sequences. Nine bases of seq 2 is covered by seq 1. There is 90% coverage of seq 2 by seq 1.

ACCCTGATTAGC   seq 1
   CTGATTAGCT  seq 2

Tool defaults

The ElimDupes tool is configured, by default, to

Uppercase all letters in the set of submitted sequences. Therefore actg is a duplicate of ACTG.
Remove all non-letter characters from each sequence. Here ACCCT----GATTAGC is a duplicate of ACCCTGATTAGC because the gaps are removed before comparing the sequences. However the sequence ACCC?----GATTAGC is not a duplicate because the "?" character is also removed. Hence, after extraneous character removal ACCCGATTAGC is not a duplicate of ACCCTGATTAGC because it is missing the "T".
Classify as "duplicate" a sequence, which after the two manipulations above matches in its entirety all, or a part of another longer sequence. In other words, a sequence that is contained within a longer sequence is a duplicate (100% similarity).
Restore original sequences in output files if they were changed by uppercasing and removal of extraneous characters.
Treat entire input as one group of sequences. Do not split input into groups of sequences.

The default settings produce the most liberal interpretation of what constitutes a "duplicate", or to put it another way produce the smallest list of unique sequences. To illustrate the results of the default settings, consider the sample input (in fasta format) from the submission page.

>seq1
ABCDE
>seq2
CDEFG
>seq3
A--B
>seq4
abc
>seq5
DEF
>seq6
ABCDeF
>seq7
CDEFG

Under the tool default settings only two of these sequences are unique, seq2 and seq6. Note that sequences are shown uppercase and have had their gaps removed.

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq2     CDEFG         DEF(5),  CDEFG(7)
seq6     ABCDEF        ABCDE(1),  AB(3), ABC(4)
---------------------------------------------------------------------------

Tool options

The ElimDupes tool default settings can be changed by modifying the five options listed at the bottom of the submission page:

Remove extraneous characters from sequences
Make all letters uppercase
Consider subsequences as duplicates
Eliminate sequences by similarity
Restore original sequences in output
Analyze input by groups

Using the same list of sequences as above let us see how changing each of these options individually to "no" affects the tool's behavior.

Remove extraneous characters

Setting only the "Remove extraneous characters" from sequences option to "no" means that gaps and other non-letter characters will NOT be removed. Therefore seq3 retains its gaps and becomes a unique sequence. The result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq2     CDEFG         DEF(5),  CDEFG(7)
seq3     A--B
seq6     ABCDEF        ABCDE(1), ABC(4)
---------------------------------------------------------------------------

Make all letters uppercase

Setting "Make all letters uppercase" to "no" means that lowercase characters will be preserved. This option produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq1     ABCDE         AB(3)
seq2     CDEFG         DEF(5),  CDEFG(7)
seq4     abc
seq6     ABCDeF        
---------------------------------------------------------------------------

Consider subsequences as duplicates

Setting "Consider subsequences as duplicates" to "no" means that a shorter sequence that is contained by a larger sequence will be considered unique. This option produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq1     ABCDE        
seq2     CDEFG         CDEFG(7)
seq3     AB
seq4     ABC
seq5     DEF
seq6     ABCDEF
---------------------------------------------------------------------------

Eliminate sequences by similarity

Setting "Eliminate sequences more similar than" 70% means that >70% length of a shorter sequence covered by a larger sequence will be considered duplicate. While using this option, the extraneous characters will be removed and all letters will be made uppercase because the program will pairwise align each sequence. Sequences cannot tolerate any mismatch. To consider subsequences as duplicates is the special case when similarity is equal to 100%. This setting produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq6     ABCDEF        ABCDE(1), ABC(4), AB(3), CDEFG(2), DEF(5), CDEFG(7)

Restore original sequences in output

The fourth option, "Restore original sequences in output", works as follows. In order to decide if two sequences are duplicates, the ElimDupes tool, by default, changes the sequences to uppercase and removes extraneous characters like gaps. In this way the program can "see" that input sequence ABC is the same as sequence a--Bc. The resulting sequences in the downloadable output file can be presented in their new changed form or in their original form. The latter is the default, but you can override this by selecting "no" for this option.

Analyze input by sequence groups

The fifth option lets you analyze your input by sequence groups. Say your input consists of multiple sequences from two patients "A1," and "A3", like this:

>A1_seq1
CDEFG
>A1_seq2
A--B
>A1_seq3
DEF
>A3_seq1
ABCDE
>A3_seq2
abc
>A3_seq3
ABCDeF
>A3_seq4
CDEFG

You want to recognize unique and duplicate sequences for the two patients separately. Thus the sequence "DEF" in patient A1_seq3 should appear as a duplicate of A1_seq1, but not of A3_seq1, because the A3 sequence is from another patient group. You can activate this feature in the ElimDupes tool by telling the program how to distinguish the different groups by their sequence names. You do this by entering the number of characters from the beginning of each sequence name that are necessary and sufficient to distinguish the groups. This number should be entered in the text box in the fifth option on the submission form page. In the example above, two characters are enough to distinguish the two groups, A1 and A3. If your input contained seqs from two countries, the US and France, in a style like:

B.US.97.ARES2
B.US.85.Ba-L
B.US.85.Ba_L2
B.FR.85.LW123
B.FR.85.LW124

you could distinguish the groups by using the first 4 characters. It is not necessary that the different groups be contiguous in the input. The program will sort them out.

last modified: Tue Jan 19 14:42 2010

Index of all tools	HIV BLAST	Quality Control
ADRA	HIVAlign	QuickAlign
Branchlength	Hypermut	Rainbow Tree
Codon Alignment	jpHMM at GOBICS	Recombinant HIV-1 Drawing Tool
Consensus Maker	Mosaic Vaccine Tool Suite	RIP
ELF	Motif Scan	SeqPublish
ElimDupes	N-Glycosite	Sequence Locator
Entropy	PCOORD	SNAP
FindModel	PepMap	SUDI Subtyping
Format Converter	PeptGen	SynchAlign
Gap Strip/Squeeze	PhyloPlace	Translate
GenBank Entry Generation	PhyML	TreeMaker
Gene Cutter	Pixel	TreeRate
Heatmap	Poisson-Fitter	VESPA
Hepitope	Protein Feature Accent	External Tools
Highlighter	Protein Structure