HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Gene Cutter Explanation

How Gene Cutter aligns the sequences

Gene Cutter will accept aligned or unaligned sequences and align them. Because it contains and internal reference sequence, Gene Cutter frequently gives a better multiple alignment than many computationally-based alignment programs. (Gene Cutter uses Hmmer v 2.32 with a training set of the full-length genome alignment). NOTE: Mis-alignments at the ends of a coding region may result in a few amino acids/bases not appearing in the output.

The current version of Gene Cutter does not require the HXB2(Accession #K03455) reference sequence to be included in the input nucleotide alignment. However, an alignment that contains it may come out better.

How Gene Cutter finds the genes and proteins

Gene Cutter clips the coding regions from a nucleotide alignment and (optionally) codon aligns the sequences. To define the boundaries of genes or domains of interest, and to codon-align the sequences, Gene Cutter uses the coordinates from the HIV reference sequence HXB2(Accession #K03455).

How Gene Cutter codon-aligns

The sequences in the alignment are internally aligned to the HXB2(Accession #K03455) reference sequence (provided by the program). This reference sequence is annotated with the correct reading frame for all genes, so the program knows where to start the translation. Gaps will be inserted in groups of 3, or shifted to form groups of 3, and are inserted only between codons, not in the middle of a codon. In some sequences, insertions are compensated within a short distance by a deletion, or vice versa. Because these frameshifts may not inactivate the protein, if a compensating mutation is within 5 amino acids of an initial frameshift, Gene Cutter will shift it so that the reading frame is left intact. Otherwise, the frame shift is marked in the output with the hash symbol (#), and the translation is continued in the correct reading frame beyond that codon. Stop codons are marked by a dollar sign ($).

How Gene Cutter translates and deals with IUPAC codes

Translations are in the standard 1-letter amino acid alphabet. Codons containing "-" are translated to either "-" or "#". Stop codons are represented by $. If you request translated output and your sequences contain IUPAC (ambiguity) codes, they can be translated in 3 different ways:

  1. all codons containing IUPAC/IUB multistate characters become "X"
  2. ambiguity characters in silent positions are translated to an amino acid, in a non-silent position they become "X"
  3. codons with multistate characters in silent positions are translated; for multistate characters in a non-silent position, up to 3 possible translations are given

Note: regardless of which translation option is selected, the presence of IUPAC characters will result in a translation that cannot be read by sequence editor and analysis programs!

 

last modified: Wed Nov 7 11:26 2007


Questions or comments? Contact us at seq-info@lanl.gov.