HIV Databases HIV Databases home HIV Databases home
HIV sequence database



ElimDupes Explanation

What is a duplicate sequence?

There are various ways of defining "duplicateness" in two sequences.

1. The strongest definition would be the case in which two sequences match exactly as in:

ACCCTGATTAGC  seq 1
ACCCTGATTAGC  seq 2

2. Slightly less strong than perfect match is the situation in which the sequences match in all respects except the case of the letters:

ACCCTGATTAGC  seq 1
aCCCtGATTaGC  seq 2

3. A third consideration is the case of gaps and other non-letter or "extraneous" characters. Are these two sequences duplicates?

ACCCTGATTAGC      seq 1
ACCCT----GATTAGC  seq 2

4. Fourth, there is the case of one sequence that matches part of another:

ACCCTGATTAGC  seq 1
    TGAT      seq 2

5. Final consideration is the similarity of sequences. Nine bases of seq 2 is covered by seq 1. There is 90% coverage of seq 2 by seq 1.

ACCCTGATTAGC   seq 1
   CTGATTAGCT  seq 2

Tool defaults

The ElimDupes tool is configured, by default, to

The default settings produce the most liberal interpretation of what constitutes a "duplicate", or to put it another way produce the smallest list of unique sequences. To illustrate the results of the default settings, consider the sample input (in fasta format) from the submission page.

>seq1
ABCDE
>seq2
CDEFG
>seq3
A--B
>seq4
abc
>seq5
DEF
>seq6
ABCDeF
>seq7
CDEFG

Under the tool default settings only two of these sequences are unique, seq2 and seq6. Note that sequences are shown uppercase and have had their gaps removed.

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq2     CDEFG         DEF(5),  CDEFG(7)
seq6     ABCDEF        ABCDE(1),  AB(3), ABC(4)
---------------------------------------------------------------------------

Tool options

The ElimDupes tool default settings can be changed by modifying the five options listed at the bottom of the submission page:

  1. Remove extraneous characters from sequences
  2. Make all letters uppercase
  3. Consider subsequences as duplicates
  4. Eliminate sequences by similarity
  5. Restore original sequences in output
  6. Analyze input by groups

Using the same list of sequences as above let us see how changing each of these options individually to "no" affects the tool's behavior.

Remove extraneous characters

Setting only the "Remove extraneous characters" from sequences option to "no" means that gaps and other non-letter characters will NOT be removed. Therefore seq3 retains its gaps and becomes a unique sequence. The result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq2     CDEFG         DEF(5),  CDEFG(7)
seq3     A--B
seq6     ABCDEF        ABCDE(1), ABC(4)
---------------------------------------------------------------------------

Make all letters uppercase

Setting "Make all letters uppercase" to "no" means that lowercase characters will be preserved. This option produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq1     ABCDE         AB(3)
seq2     CDEFG         DEF(5),  CDEFG(7)
seq4     abc
seq6     ABCDeF        
---------------------------------------------------------------------------

Consider subsequences as duplicates

Setting "Consider subsequences as duplicates" to "no" means that a shorter sequence that is contained by a larger sequence will be considered unique. This option produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq1     ABCDE        
seq2     CDEFG         CDEFG(7)
seq3     AB
seq4     ABC
seq5     DEF
seq6     ABCDEF
---------------------------------------------------------------------------

Eliminate sequences by similarity

Setting "Eliminate sequences more similar than" 70% means that >70% length of a shorter sequence covered by a larger sequence will be considered duplicate. While using this option, the extraneous characters will be removed and all letters will be made uppercase because the program will pairwise align each sequence. Sequences cannot tolerate any mismatch. To consider subsequences as duplicates is the special case when similarity is equal to 100%. This setting produces the following result:

Name     Unique seq    Duplicate sequences and their (number)
---------------------------------------------------------------------------
seq6     ABCDEF        ABCDE(1), ABC(4), AB(3), CDEFG(2), DEF(5), CDEFG(7)

Restore original sequences in output

The fourth option, "Restore original sequences in output", works as follows. In order to decide if two sequences are duplicates, the ElimDupes tool, by default, changes the sequences to uppercase and removes extraneous characters like gaps. In this way the program can "see" that input sequence ABC is the same as sequence a--Bc. The resulting sequences in the downloadable output file can be presented in their new changed form or in their original form. The latter is the default, but you can override this by selecting "no" for this option.

Analyze input by sequence groups

The fifth option lets you analyze your input by sequence groups. Say your input consists of multiple sequences from two patients "A1," and "A3", like this:

>A1_seq1
CDEFG
>A1_seq2
A--B
>A1_seq3
DEF
>A3_seq1
ABCDE
>A3_seq2
abc
>A3_seq3
ABCDeF
>A3_seq4
CDEFG

You want to recognize unique and duplicate sequences for the two patients separately. Thus the sequence "DEF" in patient A1_seq3 should appear as a duplicate of A1_seq1, but not of A3_seq1, because the A3 sequence is from another patient group. You can activate this feature in the ElimDupes tool by telling the program how to distinguish the different groups by their sequence names. You do this by entering the number of characters from the beginning of each sequence name that are necessary and sufficient to distinguish the groups. This number should be entered in the text box in the fifth option on the submission form page. In the example above, two characters are enough to distinguish the two groups, A1 and A3. If your input contained seqs from two countries, the US and France, in a style like:

B.US.97.ARES2
B.US.85.Ba-L
B.US.85.Ba_L2
B.FR.85.LW123
B.FR.85.LW124

you could distinguish the groups by using the first 4 characters. It is not necessary that the different groups be contiguous in the input. The program will sort them out.

last modified: Tue Jan 19 14:42 2010


Questions or comments? Contact us at seq-info@lanl.gov.