######################################################################
RepeatMasker, Arian Smit 03/12/96, most recent change 07/13/2002       
Please refer to:  Smit, AFA & Green, P "RepeatMasker" at               
http://repeatmasker.genome.washington.edu                              
                                                                       
The interspersed repeat databases are modified versions of those found 
in "RepBase Update" (http://www.girinst.org/~server/repbase.html)
######################################################################


RepeatMasker is a program that screens DNA sequences for interspersed
repeats and low complexity DNA sequences. The output of the program is
a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which
all the annotated repeats have been masked (default: replaced by
Ns). Sequence comparisons in RepeatMasker are performed by
the program cross_match, an efficient implementation of the
Smith-Waterman-Gotoh algorithm developed by Phil Green. or,
optionally, by WU-Blast developed by Warren Gish.


This helpfile discusses the following topics:

0	Basic input and output
	
1       Options
1.1     Species and contamination check options
1.2     Options effecting which repeats get masked
1.3     Speed, engine and search parameters
1.4     Output and formatting
1.5     ProcessRepeats options
1.6     WU-blast search-engine option

2       Methodology and quality of output
2.1     Methodology
2.2     Scoring matrices
2.3     Databases
2.4     Sensitivity and speed
2.5     Selectivity and matches to coding sequences
2.6	Low complexity DNA and simple repeats

3       How to read the results
3.1     The annotation (.out) file
3.2     Alignments
3.3     The summary (.tbl) file

4       Applications
4.1     Use in database searches
4.2     Identification of DNA source and bacterial insertions
4.3     Use with gene prediction programs and other applications

5	References


0  INPUT and OUTPUT

Input format:

Sequences have to be in the 'fasta format':

>sequencename all kind of info
AGCGATCGCATCGAGCGCATTCGCATGGGG
>sequencename2 all kind of info
GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA

The program accepts most common erroneous 'almost fasta format' and
raw sequence files, but does not yet work with other formats (GenBank,
Staden, etc.).

You can use RepeatMasker on a file containing multiple fasta format
sequences and on multiple sequence files at the same time:

RepeatMasker *.fasta

This command will mask all files that end with .fasta in the current
directory and give separate reports for each file. Note that if you
have multiple small sequences it is considerably faster to run
RepeatMasker on one batch file than on many single sequence files. The
summary file will be more informative as well. However, analysis on
single files (when larger than 2 kb each) can be slightly more
accurate, since GC levels for each sequence will be calculated and
used to choose appropriate parameters.


Standard output:

RepeatMasker returns a .masked file containing the query sequence(s)
with all identified repeats and low complexity sequences masked. These
masked sequences are listed and annotated in the .out file. The masked
sequences are returned in the same order as they are in the submitted
file, whereas the sequences are presented alphabetically in the
annotation table. The .tbl file is a summary of the repeat content of
the analyzed sequence.


1 OPTIONS

1.1 Species options

-m(us)         masks rodent specific and mammalian wide repeats
-rod(ent)      same as -mus
-cow           masks artiodactyl, whale, and mammalian wide repeats
-pig, -cet(acea), -art(iodactyl)  same as -cow
-car(nivore)   mask carnivore-specific and mammalian wide repeats
-cat -dog      same as -car
-mam(mal)      masks repeats found in mammals not mentioned above
-ch(icken)     masks repeats found in chicken and related birds
-ar(abidopsis) masks repeats found in Arabidopsis
-dr(osophila)  masks repeats found in Drosophilas
-el(egans)     masks repeats found in C. elegans
-fugu          masks repeats found in Takifugu rubripres
-lib [filename] allows usage of a custom library (e.g. from another species)

contamination checking options
-is_only       only clips E coli insertion elements out of fasta and .qual files
-is_clip       clips IS elements before analysis (default: IS only reported)
-no_is         skips bacterial insertion element check
-rodspec       only checks for rodent specific repeats (no repeatmasker run)
-primspec      only checks for primate specific repeats (no repeatmasker run)

For detailed explanation of the contamination detection options, see
"4.2 Identification of DNA source" below.


The default settings of RepeatMasker are for masking a primate (human)
sequence. 

Interspersed repeats are specific to a (group of) species, dependent
on the time of activity of the source transposable element. Less than
half of the repeats identified in human DNA are specific to primates,
i.e. over half amplified before the eukaryotic radiation some 100
million years ago. Most repeats that can be identified in mouse DNA
are specific to rodents though, due to higher activity and faster
mutation rates in the rodent lineage.  RepeatMasker has separate
protocols optimized for analysis of genomes of different mammalian orders.

These are the numbers and bp of repeat consensus sequences specific in
the May 5 2002 databases,

species         # of consensi  total bp
mammalian-wide       421         459641
primate-specific     324         550887
rodent-specific      150         231345
cetartiodactyl-spec   44          23658
carnivore-specific    20          22694
spec to other mammals 11          17998
chicken (birds)       34          39512
Xenopus               27          26515
pufferfish            94         169989
zebrafish             49          59425
other vertebrates     42          38699
Drosophila           184         524645
C. elegans           122         161140
Arabidopsis          435        1419710
maize, rice           71         246544

Mammalian sequences are compared to both order specific and
mammalian-wide repeats (transpositional activity predates the
mammalian radiation). One can see that the majority of sequences
against which other mammals are compared are repeats that have been
identified in the human genome but are thought to predate the
mammalian radiation.

Six libraries are extracted from the 'RepBase Update' fasta libraries
with very limited curation. The large C. elegans, Arabidopsis, and
Drosophila libraries have been built primarily by the people at the
Genetic Information Research Institute (GIRI). The Xenopus, danio
(zebrafish), other vertebrate (a rather useless mixture) and grasses
(maize and rice) libraries are still fetal. The latter smaller
libraries are accessed with the -lib option. In 2001/2002 I've created
(the cetartiodactyl, carnivore, chicken and pufferfish libraries, and
significantly extended the rodent databases.

RepBase Update contains repeats for many other species. These are not
included here, either because interspersed repeats are not an
analytical problem for these species (e.g. prokaryotes, yeast) and/or
the number of repeats is impractically small (i.e. you don't need
RepeatMasker to compare your query to one repeat sequence).  RepBase
Update is maintained by GIRI and me at
(http://www.girinst.org/~server/repbase.html).


-lib 
With the -lib option you can specify a custom library of sequences to
be masked in the query. The library file needs to contain sequences in
fasta format. Unless a full path is given on the command line,
such a file should be in the current directory or in the
.../RepeatMaskerxx/Libraries directory with the other library files.
I've provided libraries for some vertebrate (vertebrate.lib, danio.lib
xenopus.lib), and grasses (grasses.lib) repeats, which are not yet
fully integrated and have to be accessed by using the -lib option.

'RepeatMasker -lib xenopus.lib bigfrog.seq'
will mask all sequences similar to repeats in the Xenopus database
as well as all low complexity and simple repetitive DNA in
"bigfrog.seq". 

I recommend to format your own repeat library like a RepeatMasker .lib
file (the file name does not need to end with .lib). Like this
>repeatname#class/subclass
or simply
>repeatname#class

In that format, the data will be processed (overlapping repeats are
merged etc), alternative output (.ace or .gff) can be created and an
overview .tbl file will be created. Classes that will be displayed in
the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA','Satellite', anything
with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the
latter defaults when class is missing). Subclasses are plentiful, but
are not all spelled tabulated in the .tbl file. Check the accompanying
.lib files for names that can be parsed into the .tbl file.


-no_is, -is_clip, -is_only, -primspec, -rodspec 
contamination checking options
-is_only       only clips E coli insertion elements out of fasta and .qual files
-is_clip       clips IS elements before analysis (default: IS only reported)
-no_is         skips bacterial insertion element check
-rodspec       only checks for rodent specific repeats (no repeatmasker run)
-primspec      only checks for primate specific repeats (no repeatmasker run)

See "Contamination detection" below


1.2 Masking options (options that determine what kind of repeats are masked)

-cutoff [number] sets cutoff score for masking repeats when using -lib
               (default cutoff 225)
-nolow         does not mask low_complexity DNA or simple repeats
-l(ow)         same as nolow (historical)
-(no)int       only masks low complex/simple repeats (no interspersed repeats)
-alu           only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number]  masks only those repeats that are less than [number] percent
               diverged from the consensus sequence

-cutoff
When using a local library you may want to change the minimum score
for reporting a match. The default is 225, lowering it below 200 will
usually start to give you significant numbers of false matches,
raising it to 250 will guarantee that all matches are real. Note that
low complexity regons in otherwise complex repeat sequences in your
library are most likely to give false matches.


-nolow / -l(ow)
With the option -nolow or -l(ow) only interspersed repeats are
masked. By default simple tandem repeats and low complexity
(polypurine, AT-rich) regions are masked besides the interspersed
repeats. For database searches the default setting is recommended, but
sometimes, e.g. when using the masked sequence to predict the presence
of exons, it may be better to skip the low complexity masking.


-noint / -int
When using the -noint or -int option only low complexity DNA and
simple repeats will be masked in the query sequence ("minus
interspersed repeats").
Since the 03032000 release, A-rich simple repeats derived from the
poly A tails of SINEs and LINES are merged with the annotation of the
SINE or LINE (i.e. you can't tell there is a simple repeat). Thus, if
you're interested in finding the location of potentially polymorphic
simple repeats, this option is recommended.


-alu
-div
You can limit the masking and annotation to (primate) Alu repeats with
the -alu option and to a subset of less diverged (younger) repeats
with the option -div. For example,
"RepeatMasker -div 20 -mus mysequence"
will mask only those rodent repeats and simple repeats that are less
than 20% diverged from the consensus sequence and
"RepeatMasker -div 10 -alu mysequence"
will mask Alus that are less than 10% diverged from the Alu consensus
sequences and no other repeats.

The -div option may be used to limit the masking to those repeats that
are either specific to primates or another mammalian order for use in
subsequent comparison of orthologous mammalian loci. On average,
interspersed repeats have diverged 16% in human (~28% in mouse) from
their consensus since the mammalian orders separated (average
substitution levels are 18% and 35%, respectively).  Note that this
method is rather crude, mostly since the range of deterioration of
repeats of the same age is wide; some order specific repeats may
remain unmasked and shared repeats may be masked.


1.3   Options effecting speed and search parameters

-q             quick search; 5-10% less sensitive, 3-4 times faster than default
-qq            rush job; about 10% less sensitive,
-s             slow search; 0-5% more sensitive, 2.5 times slower than default
-pa(rallel)    number of processors to use in parallel (only works for batch files or
    [number]      sequences larger than 50 kb)
-w(ublast)     use WU-blast, rather than cross_match as engine (see 1.6)

-frag [number] maximum sequence length masked without fragmenting (default 51000)
-maxsize [nr]  maximum length for which IS- or repeat clipped sequences can be produced 
               (default 4000000). Memory requirements go up with higher maxsize.
-gc [number]   use matrices calculated for 'number' percentage background GC level
-gccalc        program calculates the GC content even for batch files/small seqs
-nocut         skips the steps in which repeats are excised
-noisy         prints cross_match progress report to screen (defaults to .stderr file)


-s -q -qq
RepeatMasker can be run at four different sensitivity/speed levels,
with the option -q providing quick (less sensitive) and -s slow
(sensitive) results compared to default. The option -qq has been added
for when you're in a frightful hurry. Each higher gear is about 2-3
times faster, and 90% as sensitive as the next lower gear. See "2.4
Sensitivity and Speed" below for details


-pa(rallel)
For sequences over 50 kb long or files wit multiple sequences,
RepeatMasker can use multiple processors. When you type:

RepeatMasker -par 10 <file>

A batch file of sequences will run with up to 10 sequences at the
time, until all sequences are done, while a file with one large
sequence will analyze the sequence in up to 10 fragments at the same
time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all
sequences over 100 kb are divided in 33-66 kb fragments). For the
batch files no minimum size exists. Thus, 

If <file> contains:    RM runs in parallel:
one 60 kb seq          two 30 kb fragments
one 400 kb seq         ten 40 kb fragments
one 1 Mb seq           ten 50 kb fragments, twice
ten 500 bp sequences   ten 500 bp sequences
two 500 kb sequences   ten 50 kb fragments, twice

Processing of the detected matches takes place after all batches or
fragments have been cross matched with the databases.
Beware that, generally, you have a limited number of processor IDs
allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're
allotted 64 user PIDs, you can 'only' run 16 fragments/batches in
parallel.


-frag 
Even when the -par option is not used, RepeatMasker transparently
fragments sequences over 51 kb in fragments of equal sizes with 1 kb
overlaps. Similarly, sequence batches containing more than 51 kb are
subdivided in batches of 51 kb or less. The -frag option sets the
maximum fragment and batch size

The only visible effect of the fragmentation is in the alignment
files, where alignments at the edges of the fragments can be
duplicated and/or truncated.  The 1 kb overlap between fragments
almost guarantees that there is no loss in sensitivity at the
edges. Fragmentation initially was implemented to allow the size of
sequences and sequence batches to be unlimited. Cross_match can be
very memory intensive when SW alignments have to be performed in large
matrices. This may happen with short minmatch and large bandwidth
settings. Note that RepeatMasker should not croak when cross_match
runs out of memory; it will redo the failed search with a higher word
length or smaller bandwidth until it succeeds. However, this will lead
to gradually less sensitive comparisons.

Fragmentation also can improve repeat detection when a genomic
sequence contains large regions of DNA with significantly different GC
levels (isochores), since sets of scoring matrices are chosen based on
the GC level of a fragment.

Since April 2002 the maximum fragment size is hardwired to be half of
"maxsize" (see below).


-maxsize
To limit the memory requirments of the script an upper boundary to the
amount of sequence stored in a single array in the script is set to 4
million bp. This parameter can be reduced with the -maxsize option to
a minimum of 500000, for severely memory impaired computers.

The size of maxsize further determines the largest length single
sequence from which E. coli insertion sequences and full-length
repeats can be clipped. Increase the size of maxsize to allow removal
of IS elements from larger sequences, like:
RepeatMasker -is_clip -maxsize 9999999999 muntjakchromosome1


-gc
-gccalc
Neutral mutation patterns differ significantly depending on the GC
richness of a locus and we have calculated optimal scoring matrices
for the alignment to consensus sequences in a range of background GC
levels (see 2.2). Usually, RepeatMasker calculates the percentage of
the sequence consisting of Gs and Cs and uses the appropriate
matrices.  However, the program defaults to using 'average' 43% GC
matrices when the query is shorter than 2000 bp or a batch file is
analyzed. Short sequences are less likely to share the GC level of the
locus. For example, CpG islands and exons are more GC rich than the
surrounding DNA, whereas a LINE1 element usually is more AT rich than
the background. In a batch file, RepeatMasker analyses all sequences
together with the same matrices. The percentage GC in all the
sequences combined may be inappropriate for some sequence entries;
using high GC level matrices in AT rich sequences (and vice versa) may
result in false masking.

One can override this behavior in two ways:
With the option -gc you can set the GC level to a certain percentage:
RepeatMasker -gc 37 mybatchofsequences.fa
lets the program use matrices appropriate for 37% GC background. The
batch could, for example, contain ESTs from a single locus with a
known GC level.  Alternatively, the -gccalc option forces RepeatMasker
to use the actual GC level of a short sequence or the average GC level
of a batch of sequences. The latter sequences, for example, may be
contigs in a sequencing project.


-nocut
The option -nocut skips a step in the default procedure for human and
rodent queries, in which full-length younger insert are spliced out of
the query to reconstruct a preinsertion situation. RepeatMasker is
generally more sensitive including the deletion step as it can unearth
older repeats that were interrupted by these younger elements. 


1.4  Output options

-a      shows the alignments in a .align output file; -ali(gnments) also works
-inv    alignments are presented in the orientation of the repeat (with option -a)

-cut    saves a sequence (in file.cut) from which full-length repeats are excised
-small  returns complete .masked sequence in lower case
-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked
-x      returns repetitive regions masked with Xs rather than Ns

-poly   reports simple repeats that may be polymorphic (in file.poly)
-ace    creates an additional output file in ACeDB format
-gff    creates an additional General Feature Finding format output
-u      creates an untouched annotation file besides the manipulated file
-xm     creates an additional output file in cross_match format (for parsing)

-fixed  creates an (old style) annotation file with fixed width columns
-no_id  leaves out final column with unique ID for each element
-e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query

-noisy  prints cross_match progress report to screen (defaults to .stderr file)


-a / -ali(gnments) 
-inv
Alignments are saved in a .align file when using the option -a. They
are shown in the orientation of the query sequence, unless you use the
option -inv as well, which will return alignments in the orientation
of the repeats (see 3.2 Alignments).


-cut
The option -cut lets the program save a file "file.cut" which contains
an intermediate sequence in the masking progress. In this sequence all
full-length elements, young LINE1 3' ends, and close to perfect simple
repeats are deleted. 
*This option currently only works with mammalian queries.*
Because of programming complications, no .cut file is saved if a
single sequence is larger than the 'maxsize' parameter, by default
set to 4 Mbp. The parameter can be changed with the option -maxsize.

Another option will grow out of this that returns a sequence in which
only order specific repeats are deleted, allowing superior alignments
of mammalian orthologous sequences.


-x
When -x is used the repeat sequences are replaced by Xs instead of
Ns. The latter allows one to distinguish the masked areas from
possibly existing ambiguous bases in the original sequence. However,
when running BLAST searches (and maybe other programs) Xs are deleted
out of the query and the returned BLAST matches will have position
numbers not necessarily corresponding to that of the original
sequence.

-xsmall
When the option -xsmall is used a sequence is returned in the .masked
file in which repeat regions are in lower case and non-repetitive
regions are in capitals.


-poly
You can get a list of potentially polymorphic microsatellites with the
option -poly. This is simply a subset of the list in .out, with
dimeric to tetrameric repeats less than 10 % diverged from perfection.


-xm
When using the -xm option an additional output file (.out.xm) is
created that contains the same information as the .out file (excluding
the low-complexity/simple DNA), but then in the original cross_match
format. This output is harder to read but there are programs that
require the exact cross_match output format.


-u
The script ProcessRepeats adjusts the original RepeatMasker output so
that the annotation more closely reflects reality. With the option -u
a .ori.out file is created that contains the original (but sorted)
cross_match summary lines.


-ace
With the -ace option an .ace file is created by the script. This is
merely a suggestion. The columns in the table currently are:

Motif_homol <repeat-name> RepeatMasker(method) <percent divergence>
<start in query> <end in query> <orientation> <start in consensus> <end in consensus>


-gff
A .gff file is created by the script with the annotation in 'General
Feature Finding' format. See http://www.sanger.ac.uk/Software/GFF for
details. The current output follows a Sanger convention:

<seqname> RepeatMasker Similarity <start in query> <end in query>
<percent divergence> <orientation> . Target "Motif:<repeat-name>" <start in
consensus> <end in consensus>

In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match
is against an Alu. I don't know why.


-fixed
Since April 1999 the column widths in the annotation table are
adjusted to the maximum length of any string occurring in a column;
this allows long sequence names to be spelled out completely.
Previously, a fixed column width table was returned, which can still
be obtained by using the -fixed option. Parsing should not be effected
by this change of default behavior, as the same number of columns with
the same formatted text are still separated by white-space.


-no_id
Since September 2000 a column displaying a unique number (ID) for each
integrated element is printed by default. This used to be optional
(-id). Fragments of a single element, separated from each other by
subsequent insertions of other elements, deletions or recombinations,
carry the same number. This feature allows better interpretation of
the data and should greatly help proper graphical display of the
repeats.  

The column follows all other columns, except for the (rare) indication
that an annotation overlaps another annotation (*). This change, which
was announced in the previous release, should not hinder most parsing
scripts. If it causes problems, the old fomat can be retrieved with
the option -no_id.


-excln
The percentages displayed in the .tbl file are calculated using a
total sequence length excluding runs of 25 Ns or more. This is useful
when analyzing draft sequences that are often concatenated contigs
separated by (sometimes very) long stretches of Ns.  This option can
be used with ProcessRepeats as well. The number of Ns in long runs in
the query are apparent in the .tbl file, and you only need to run
ProcessRepeats with the option on the .cat file.


-noisy
RepeatMasker used to print the voluminous cross_match progress reports
to the screen. Since the Dec 1998 version this output is stored in a
.stderr file and a more informative much smaller progress report is
printed to the screen. The option -noisy allows one to see the
cross-match reports coming by on the screen (yeah).


1.5  ProcessRepeats options

When you have already run RepeatMasker and want to recreate the .out
or .tbl file, you only need to rerun ProcessRepeats on the .cat
file(s), which will take just a small fraction of the time required to
rerun RepeatMasker. Such a situation can occur when you've accidently
deleted the .out or .tbl file or want additional or differentially
formatted output files. Note that alignment files can not be created
unless RepeatMasker was run with the -a option and that the original
.tbl and .out file will be overwritten unless you rename them.

ProcessRepeats -mus -nolow -gff -excln myhumongousmousesequence.cat 

Repeat matches are processed differently for rodent and primate
queries, so the -mus option is necessary.  With the -low option, the
.out file will not contain information on simple repeats and low
complexity DNA anymore. The -gff option creates an additional output
file in GFF format, and the -excln option displays the density of
repeats in the .tbl file as a percentage of those bp that are not
contained in long stretches of Ns.


The options available for ProcessRepeats are:

       default settings are for handling a human sequence .cat file
-mus   adjusts the processing and .tbl file for rodent repeats
-cow   adjusts processing for cetartiodactyl repeats
-car(nivore)   adjusts processing for carnivore repeats
-ar(abidopsis) adjusts the .tbl file for Arabidopsis thaliana repeats
-dr(osophila)  adjusts the .tbl file for Drosophila repeats
-el(egans) adjusts the .tbl file for nematode repeats
-lib   skips most of processing, does not produce a .tbl file unless the
       custom library is in the >name#class format.

-l(ow) does not display simple repeats or low_complexity DNA in the annotation
-u     creates an untouched annotation file besides the manipulated file
-xm    creates an additional output file in cross_match format (for parsing)
-ace   creates an additional output file in ACeDB format
-gff   creates an additional Gene Feature Finding format
-poly  creates an output file listing only potentially polymorphic simple repeats
-no_id leaves out final column with unique number for each element (was default)
-fixed creates an (old style) annotation file with fixed width columns
-excln calculates repeat densities excluding long stretches of Ns in the query
-orf2   results in sometimes negative coordinates for L1 elements; all L1 subfamilies
       are aligned over the ORF2 region, sometimes improving interpretation of data
-a     shows the alignments in a .align output file


1.6    WU-blast option

-w(ublast) 

Joey Bedell and Warren Gish at the St Louis Washington University
Genome Center have written a script, MaskerAid, that makes it possible
to call Warren's WU-blast as if it is cross_match
(http://sapiens.wustl.edu/MaskerAid). Through MaskerAid, WU-blast can
accept cross_match options and the cross_match complexity adjustment
to the alignment score is applied.  Although the script has wider
applications, the primary idea was to provide RepeatMasker with a
faster search engine.

For longer sequences, default MaskerAided RepeatMasker runs take about
as long as crossmatch powered runs at -qq settings (see "2.4
Sensitivity and speed"). The speed settings have relatively little
effect on the speed when using MaskerAid, with the fastest settings
1.25-1.75 as fast as the slowest settings, while the sensitivity
increases significantly.  Thus, I recommend to always run RepeatMasker
in sensitive (-s) or default mode when using MaskerAid. I've made the
difference in parameters between sensitive and default settings larger
when using MaskerAid, to make these speed options more meaningful and
gain more sensitivity (with little cost in speed).

Even with these more extreme parameters, the sensitivity can't quite
reach that of the sensitive settings using cross_match, but it comes
very close, and the huge difference in speed will make this option very
attractive.

Among the few caveats the most imnportant is that, when using MaskerAid,
currently no alignment files can be returned by RepeatMasker. Also, I
haven't done quite such an extensive quality control on the
MaskerAided output, so that false positives could be more common
(still will be close to non-existent).

When using the wublast option, hyphens in the sequence are retained
(in default mode all non-letters were deleted from the sequence).
WU-blast uses hyphens to indicate insurmountable barriers and
alignments will not span hyphens.


2 METHODOLOGY AND QUALITY OF OUTPUT

2.1 Methodology

RepeatMasker compares the query sequence against one or more files of
fasta sequences. The sequences in the libraries provided with
RepeatMasker are consensus sequences derived from alignment of
multiple copies of interspersed or satellite repeats. For interspersed
repeats, a consensus tends to approach the sequence of the
transposable element from which the repeat is derived.

Both cross_match and WU-blast perform their Smith-Waterman (SW)
alignments by first identifying exact word matches and restricting the
alignment to a band or matrix surrounding this exact match(es).
Overlapping matrices are merged. The speed settings of RepeatMasker
are purely changes in the minimum word length from which an alignment
can be seeded and, in some cases, changes in the width of the band. A
wider bandwidth allows more gaps in the alignment and, more
importantly, increases the likelihood that neighboring matrices
overlap.

Cross_match does a low complexity adjustment of the raw SW score. This
adjustment is performed by the MaskerAid script when WU-blast is
used. Low complexity matches are the primary cause of false matches,
and this adjustment contributes significantly to the high selectivity
of RepeatMasker (see 2.5)

As a result of the existence of many related consensus sequences in
the database, usually multiple repeats match one region in the query
at the same time. Generally, cross_match and WU-blast report to the
script only those matches that are less than 80-90% overlapped by a
higher scoring match. This implies that, at first approximation, names
are assigned to repeats based on the highest SW score. Given
appropriate consensus sequences and alignment parameters, this is
intuitively correct as well. However, the scripts have a lot of code
to improve on this first approximation, primarily to deal with partial
matches.

The cut-off SW score above which matches are reported is empirically
derived (see '2.5 selectivity' below). Note that there is no cut-off
divergence level; reported matches can be less than 60% identical.

The alignments parameters -substitution matrices, and gap initiation
and extension penalties- are derived from data harbored in multiple
alignments of a special subset of interspersed repeats. The derived
matrices are theoretically optimal for a series of conditions (see
below). The gap penalties are sub-optimal, primarily because gap
lengths have a non-linear distribution and are poorly represented by a
single gap-extension penalty.

For primate, rodent and other mammalian DNA, the query is compared to
consecutive subsets of repeat libraries. For primates, perfect simple
repeats, full-length Alus, full-length short interspersed repeats, and
young L1 3' ends are first (and in that order) clipped from the
sequence to expose underlying older elements. Subsequently, the query
is compared to most repeats, a set of ancient elements under
especially sensitive settings, a large set of long retroviral
sequences under faster settings (to save time), and AT-rich L1 3' ends
that may have been disguarded earlier as low complexity matches.
Finally, simple repeats and low complexity regions are masked.


2.2  Scoring matrices

We have calculated statistically optimal scoring matrices for the
alignment of neutrally diverging (non-selected) sequences in human DNA
to their original sequence. These matrices have been in use since the
May 1998 release. The matrices were derived from alignments of DNA
transposon fossils to their consensus sequences (Arian Smit, Arnie Kas
& Phil Green, in preparation...). A series of different matrices are
used dependent on the divergence level (14-25%) of the repeats and the
background GC level (35-53%, neutral mutation patterns differ
significantly in different isochores).

These matrices are (close to) optimal for human genomic sequences
longer than 10 kb, for which length the GC level usually is
representative of the isochore in which the sequence lives. However,
the GC level of small fragments can diverge a lot from the surrounding
(e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich
LINE1 element) and RepeatMasker defaults to using matrices derived for
a 43% GC background when a sequence is shorter than 2000 bp or when a
batch file is submitted. When the appropriate background GC level is
known, this can be entered with the -gc option.

(Note that these matrices are an integral portion of RepeatMasker and
are covered under the same restrictions as the scripts and databases
as described in the signed software agreement).


2.3  Repeat databases

The interspersed repeat databases provided in the RepeatMasker package
are maintained in synch with the repeat databases (Repbase Update)
copyrighted by the Genetic Information Research Institute (G.I.R.I.).
Whereas non-mammalian libraries currently are identical to the RepBase
Update fasta files except for formatting, mammalian databases are
extensively modified. The modification primarily entails inclusion of
complete sets of subfamilies for Alu and L1, modifications to avoid
false matches and false annotations, and subdivision in multiple sets
for optimization of the analysis.

We transformed the RepBase database from a set of prototypes to a set
of consensus sequences (see my dissertation if you're interested) to
allow both determination of the origin of these repeats and improved
detection. A consensus properly derived from a multiple alignment of
copies closely approaches the original transposable element, since
substitutions accumulate by-and-large unselected in copies of
transposable elements. Because of the altter, a copy is on average
twice as close to the consensus as to any other copy. Consensus
sequences are also more sensitive search tools because directional
substitution matrices can be used (see above).

Consensus sequences would be identical to the original transposable
element if all copies were inserted at about the same time from a
single source.  DNA transposon copies approach this ideal, but
retroposons (giving rise to most repeats in our genome) live for long
periods in a genome and evolve doing so. Thus, over time the sequence
of the transposable element has changed, and a single consensus does
not describe the original sequence of each copy. Also, usually at any
time multiple distinct sequences with a common origin, cousins if you
will, were active. This situation is reflected by the presence in the
databases of multiple subfamilies for the more common retroposons
(usually having the same name ending in a different number or letter.

The mammalian repeat libraries contain, besides consensus sequences
for transposon derived repeats, consensus satellite units, and a set
of *small structural RNA sequences*. The latter have created a large
amount of processed pseudogenes in our genome, and in that way are
interspersed repeats.


2.4 Sensitivity and speed  

The program can be run at four levels of sensitivity. The only
difference between these settings is the minimum match or word length
in the initial (not quite) hashing step of the cross_match program
(see the cross_match/phrap documentation). The "slow" setting will
find and mask 0-5% more repetitive DNA sequences than by default,
whereas the "quick" settings miss 5-10% of the sequences masked by
default. The alignments may extend more or be somewhat more accurate
in the more sensitive settings as well. The -s (slow/sensitive)
setting will take on average 2.5 x as long as the default setting,
whereas the -q (quick) setting is 3 to 6 times faster than the
default. 

Because of the continuing growth of the human repeat databases,
RepeatMasker's speed, when using the same settings, has actually
decreased over time.  For when you're in a hurry, I've added a -qq
(rush job) option that runs with the same speed as the old -q option,
but is less sensitive.

Several developments should allow you to do RepeatMasker analyses at a
agreeable speed though (1) your computers are faster, (2) there are
multithreaded versions of cross_match available, (3) you can run batch
files and larger sequences on multiple processors with -par, and (4)
you can choose to run RepeatMasker with WU-blast.  Note that the use
of multiple processors and multithreaded cross_match or WU-blast work
mostly additive.

Here are some user times (in seconds) of human sequences on a single
Digital UNIX V4.0D processor (


       cross_match(default)  WU-blast (-w)
length  -qq   -q  def   -s   -qq  def   -s
 5 kb     8   14   29   64    11   13   15   
10 kb    11   21   57  134    14   15   20 
20 kb    16   33  117  290    19   21   34
40 kb    25   55  227  572    30   33   54
80 kb    41   99  448 1145    55   58   99


Bedell and Gish do a more extensive comparison in their paper on
MaskerAid (Bioinformatics 16:1040-1 ). The -s times are a bit slower
here, because, after they performed their comparisons, I've made the
-s settings more sensitive when using WU-blast. The sensitivity of
runs with MaskerAid/WU-blast is approximately half a step behind that
obtained with the same settings using cross_match, except the -s
settings which I've trumped up to be almost like -s settings with
cross_match.

The relative analysis speeds are very dependent on the computer; for
example, our Linux server is 'better' in short sequences than this
DEC, though slower in analyzing long sequences and Bedell and Gish
achieved a 30 fold speed up at sensitive settings using MaskerAid 
on their computers.
The speed is also dependent on the repeat content of the sequence. For
human sequences, Alu rich sequences are analyzed fastest, LINE rich
sequences somewhat slower, repeat poor regions slower still, and long
satellite regions can take a while.

If you have several shorter sequences it is much faster to run
RepeatMasker on a batch file (all sequences in one file). On above
computer, in the rush mode (cross_match), a batch of 10 5 kb sequences
is analyzed in 23 seconds, 20 5kb in 34 sec., etc.

The user time for sequences or sequence batches over 100 kb (or
whatever the fragment size is set to) is linearly related to the
length of the query due to the fragmentation of the query sequence.

The increase in speed by using multiple processors is dependent on the
the usage of the computer and the above mentioned non-linear
relationships of sequence length and processing time. However, under
the right circumstances,using 2 processors can increase the speed
close to twofold, because the most time-consuming processes are
performed in parallel.


2.5 Selectivity and matches to coding sequences 

The cutoff Smith-Waterman scores for masking interspersed repeats are
conservative, since masking of one short potentially interesting
region generally is more harmful than not masking a number of hard to
find matches.  If there are any false matches, they tend to have
scores close to the cutoff, which is 225 for most repeats, 300 for the
low-complexity LINE1 search*, and 180 for the very old MIR, LINE2 and
MER5 sequences.
* most LINE1s are detected with a 225 cut-off, but in one step in
RepeatMasker the low-complexity score adjustment is turned off to find
ancient A-rich L1 elements.

We tested for the occurrence of false matches in randomized and in
inverted (but not complemented) DNA. To check a variety of conditions,
four 150 to 400 kb DNA fragments were analyzed ranging in GC level
from 36% to 54%. To retain seeds for Smith Waterman alignments,
randomization was done at the 10 bp word level. Note that the inverted
sequences retain the low complexity and simple repeat patterns of the
original sequences. Even at sensitive settings, for which false
matches are most likely, the 1998-2002 versions of RepeatMasker have
reported no (false) matches at all to interspersed repeats in the
randomized or inverted sequences. No simple repeats were reported in
the randomized queries.

RepeatMasker returned only a single probably false match (71 bp) when
analyzing a batch of 4440 coding regions in human mRNAs (7,200,000 bp)
at sensitive settings. The coding regions were collected from GenBank,
based on annotations, filtered for the presence of complete ORFs and
initiator methionines, and made more or less non-redundant. When each
coding region was analyzed individually using the -gccalc option, 5
matches (414 bp, 0.006%) were falsely masked (156 bp at default speed,
76 bp at quick settings). In this analysis each sequence was analyzed
with matrices chosen based on the actual GC level, even for very short
sequences, while in the batch analysis of the coding regions the
'average' 43% GC matrices were used.

The 1998 and later versions of RepeatMasker show somewhat more false
masking when a pre-1998 version of cross_match is used. These are
primarily the result of improper assumptions of the background
nucleotide frequency used in the scoring matrix calculation when
adjusting for the complexity of a match. Specifically, a very GC rich
region in an AT-rich isochore (like an exon) may improperly match a GC
rich repeat, since the scores for C/G matches are higher in the used
scoring matrix than for AT matches (calculated for this AT rich
background) whereas the old cross_match assumed that a 50% GC
background in these calculations and equal scores for A/T and G/C
matches have been given. The new version of cross_match reads the
correct nucleotide background level from the matrix used.


2.6 Simple repeats and low complexity DNA

Low-complexity DNA 

By default, along with the interspersed repeats, RepeatMasker masks
low-complexity DNA. Simple repeats (micro-satellites) can originate at
any site in the genome, and therefore have an interspersed
character. Other low-complexity DNA, primarily poly-purine/
poly-pyrimidine stretches, or regions of extremely high AT or GC
content will result in spurious matches in some database searches as
well (especially in the ungapped BLASTN searches). For example,
extremely AT-rich regions consistently will give very low probability
matches to mitochondrial DNA in BLASTN searches. The settings are very
stringent, and we think that few if any sequences informative in
database searches are masked as low-complexity DNA. However, you can
skip the low-complexity DNA masking using the option -nolow or -l(ow).

Under the current settings a 100 bp stretch of DNA is masked when it
is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides. The settings are slightly more stringent than the
original settings, partly because the gapped BLAST programs are less
sensitive to short regions of low complexity then the old gapless
BLAST. In coding regions I have not yet found extensive regions (>10
bp) masked as low complexity DNA that would not be masked by the
combined XNU and SEG filters routinely used in BLASTX.


Annotation of simple repeats  

Although RepeatMasker does a good job in masking simple repeats to
avoid spurious matches in database searches, it is not written to find
and indicate all possibly polymorphic simple repeat sequences. Only
di- to pentameric and some hexameric repeats are scanned for and
simple repeats shorter than 20 bp are ignored. The -poly option prints
out a separate list of simple repeats of < 10% divergence from a
perfect repeat. However, even long perfect repeats may not be
presented in this list; e.g. two perfect 40 bp long (CA)n repeats
interrupted by 10 Ts are aligned in one piece and may be reported as
having > 10% divergence from the consensus. Many perfect hexameric or
longer unit repeats will be listed as more or less diverged smaller
unit repeats and may not appear in the .polyout file.

Also note that, in the default output, simple repeats expanded from
the poly A tails of ALUs and LINE1 are now included in the Alu or
LINE1 annotation. This cleans up the annotation a bit and lets the
stand-alone poly A regions stand out (they may indicate the presence of
a processed pseudogene). However, even perfect simple repeats in such
tails will be hidden in the .out file.

A program optimized to quickly find all dimeric to pentameric repeats
is sputnik, available at ftp://ftp.nhgri.nih.gov/pub/software/sputnik/
or http://www.abajian.com/sputnik/. Any local duplications (tandem,
inverted, or otherwise) can be detected with the program miropeats
(http://www.ebi.ac.uk/~jparsons/packages/miropeats/miropeats.html). 
Web sites dedicated to identifying tandem repeats are
http://pompous.swmed.edu and http://c3.biomath.mssm.edu/trf.html

LINE rich sequences are analyzed somewhat slower, Alu rich
sequences faster, and long satellites can take quite a while. 


3  HOW TO READ THE RESULTS

3.1 The annotation (.out) file

The annotation file contains the cross_match summary lines. It lists
all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA. The term "best matches" reflects that a match is not
shown if its domain is over 80% or 90% contained within the domain of
a higher scoring match, where the "domain" of a match is the region in
the query sequence that is defined by the alignment start and
stop. These domains have been masked in the returned masked sequence
file. In the output, matches are ordered by query name, and for each
query by position of the start of the alignment.

Example: 

  SW  perc perc perc  query    position in query     matching repeat      position in  repeat
score div. del. ins.  sequence begin  end  (left)    repeat  class/family   begin end (left) ID
...
 1320 15.6  6.2  0.0  HSU08988  6563 6781 (22462) C  MER7A   DNA/MER2_type    (0)  337  104  20
12279 10.5  2.1  1.7  HSU08988  6782 7718 (21525) C  Tigger1 DNA/MER2_type    (0) 2418 1486  19
 1769 12.9  6.6  1.9  HSU08988  7719 8022 (21221) C  AluSx   SINE/Alu         (0)  317    1  17
12279 10.5  2.1  1.7  HSU08988  8023 8694 (20549) C  Tigger1 DNA/MER2_type  (932) 1486  818  19
 2335 11.1  0.3  0.7  HSU08988  8695 9000 (20243) C  AluSg   SINE/Alu         (5)  305    1  18
12279 10.5  2.1  1.7  HSU08988  9001 9695 (19548) C  Tigger1 DNA/MER2_type (1600)  818    2  19
  721 21.2  1.4  0.0  HSU08988  9696 9816 (19427) C  MER7A   DNA/MER2_type  (224)  122    2  20

This is a sequence in which a Tigger1 DNA transposon has integrated
into a MER7 DNA transposon copy. Subsequently two Alus integrated in
the Tigger1 sequence. The first line is interpreted as such:

  1320     = Smith-Waterman score of the match, usually complexity adjusted
	The SW scores are not always directly comparable. Sometimes
	the complexity adjustment has been turned off, and a variety of
	scoring-matrices are used dependent on repeat age and GC level.

  15.6     = % divergence = mismatches/(matches+mismatches) **
  6.2      = % of bases opposite a gap in the query sequence (deleted bp)
  0.0      = % of bases opposite a gap in the repeat consensus (inserted bp)
  HSU08988 = name of query sequence
  6563     = starting position of match in query sequence
  6781     = ending position of match in query sequence
  (22462)  = no. of bases in query sequence past the ending position of match
  C        = match is with the Complement of the repeat consensus sequence
  MER7A    = name of the matching interspersed repeat
  DNA/MER2_type = the class of the repeat, in this case a DNA transposon 
	    fossil of the MER2 group (see below for list and references)
  (0)      = no. of bases in (complement of) the repeat consensus sequence 
             prior to beginning of the match (0 means that the match extended 
             all the way to the end of the repeat consensus sequence)
  337      = starting position of match in repeat consensus sequence
  104      = ending position of match in repeat consensus sequence
  20       = unique identifier for individual insertions 

  An asterisk (*) following the final column (see below example)
  indicates that there is a higher-scoring match whose domain partly
  (<80%) includes the domain of the current match.

** This has changed in August 2001: cross_match output gives the
percent mismatches/(matches+mismatches+unaligned bases in query). I
did't think this definition is otherwise commonly used and most users
will assume the divergence level would be mismatches/(matches+mismatches).

Note that the SW score and divergence numbers for the three Tigger1
lines are identical. This is because the information is derived from a
single alignment (the Alus were deleted from the query before the
alignment with the Tigger element was performed). The ProcessRepeats
script makes educated guesses if any pair of fragments is derived from
the same element or not; if so, the fragments will have the same ID in
the last column, in this example it figured that the MER7A fragments
represent one insert.

Here is another example that shows how much trouble processrepeats
does to defragment elements and how the ID can be useful in
interpreting the results:

 7120 19.9 0.6 0.3 NT_001227  85631  87837 (19816) + L1PA16    LINE/L1       1 1885 (4964)  123  
 2503 14.9 6.5 0.7 NT_001227  87839  88241 (19412) + MSTA      LTR/MaLR      1  428    (0)  100  
  867 12.9 2.7 0.0 NT_001227  88242  88388 (19265) + MSTA-int  LTR/MaLR      1  151 (1500)  100 *
 5219 19.5 2.9 0.6 NT_001227  88386  89342 (18311) + MSTA-int  LTR/MaLR    629 1607   (44)  100  
 8003  3.5 0.8 0.0 NT_001227  89362  90773 (16880) C L1PA3     LINE/L1     (0) 6155   4745  103  
 7677  3.5 0.0 0.0 NT_001227  90795  94059 (13594) C L1PA3     LINE/L1     (0) 6155   2872  104  
 9050  6.5 0.4 0.1 NT_001227  94060  95127 (12526) C MER11C    LTR/ERVK    (0) 1071      1  106  
 7677  3.5 0.0 0.0 NT_001227  95128  97101 (10552) C L1PA3     LINE/L1  (3282) 2873    900  104  
 5619  7.8 0.3 0.9 NT_001227  97097  97865  (9788) C L1PA3     LINE/L1  (5370)  776     13  104 *
  320 16.9 0.0 1.7 NT_001227  97876  97934  (9719) + MSTA-int  LTR/MaLR   1594 1651    (0)  100  
 1475 19.0 4.8 5.6 NT_001227  97935  98255  (9398) + MSTA      LTR/MaLR      1  323   (48)  100  
 2322 14.4 0.8 1.6 NT_001227  98256  98629  (9024) + THE1C     LTR/MaLR      1  371    (0)  112  
10051 12.9 3.5 4.3 NT_001227  98630 100221  (7432) + THE1C-int LTR/MaLR      1 1580    (0)  112  
 2359 15.7 0.3 1.9 NT_001227 100224 100598  (7055) + THE1C     LTR/MaLR      3  371    (0)  112  
 1475 19.0 4.8 5.6 NT_001227 100599 100646  (7007) + MSTA      LTR/MaLR    323  371    (0)  100  
 1360 19.4 8.2 1.7 NT_001227 100662 100955  (6698) + MSTA      LTR/MaLR    114  426    (0)  113  
11892 24.7 1.9 2.0 NT_001227 100968 101243  (6410) + L1PA16    LINE/L1    1881 2143 (4706)  123  
 2062 11.9 8.4 0.0 NT_001227 101244 101563  (6090) C L1PA12    LINE/L1    (10) 6164   5818  116  
11892 24.7 1.9 2.0 NT_001227 101564 105425  (2228) + L1PA16    LINE/L1    2137 5989  (860)  123  
  257  0.0 0.0 2.9 NT_001227 105436 105469  (2184) + (TAA)n    Simple        2   34    (0)  118  
 2189 18.2 0.2 0.7 NT_001227 105470 105893  (1760) + L1PA16    LINE/L1    6062 6483  (386)  123  
  255  6.1 0.0 0.0 NT_001227 105896 105928  (1725) + (TA)n     Simple        1   33    (0)  120 *
  369  0.0 0.0 0.0 NT_001227 105928 105968  (1685) + (GA)n     Simple        2   42    (0)  121  
  305 18.8 0.0 1.0 NT_001227 105971 106066  (1587) + (TA)n     Simple        2   96    (0)  122  
 1589 21.2 1.6 1.1 NT_001227 106068 106449  (1204) + L1PA16    LINE/L1    6485 6868    (1)  123  

This entire 20,819 bp block of sequence is comprised by an L1PA16
(#123), in which 7 or 8 elements have integrated (it is unclear to me
if the MSTA #113 is a separate integration or a tandem duplication).
There are at least four layers with MER11 (#106) inserted in L1PA3
(#104) inserted in MSTA (#100, maybe in #113) inserted in L1PA16.
L1PA16 is already primate specific, so that all these insertions took
place in primate evolution.

The ID column helps much in deciphering the events. It also should be
a basis for the graphic display of RepeatMasker output.


3.2  Alignments

When using the -a option, a .align file is created that contains the
alignments of your query sequence to the matching repeat consensus
sequences. The alignments are given in the same order as listed in the
.out file. 

These alignments may be most generally useful for people designing PCR
primers in a region full of repeats. It is possible to get primers
that work in a whole genome, when the 3' end of it lies in a region of
(even a common) repeat that is very different from the consensus.

Here is an example of an alignment of a MIR spanning an Alu element
deleted in an earlier step:

665  28.45  2.93  5.02  g5129s420  7350  7882  (1924)  C  MIR#SINE/MIR  (1)  261  28  3

  g5129s420         7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397   
                           v     v           i i  i v     viv    v i v v  v
C MIR#SINE/MIR       261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212    

  g5129s420         7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443   
                           vi  i iv   i        i i   i  i    i  v    i     
C MIR#SINE/MIR       211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164    

  g5129s420         7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787   
                         iiv              v i         iii   v      i  i  i 
C MIR#SINE/MIR       163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123    

  g5129s420         7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837   
                          v i v  i      i v  v  v     ii     v      i  ii  
C MIR#SINE/MIR       122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74     

  g5129s420         7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882   
                          iivi    i iiii  i    i i         i  v     i  
C MIR#SINE/MIR        73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28     

Transitions / transversions = 1.96 (45 / 23)
Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8)  


In cross_match alignments mismatches caused by transitions are
indicated with an i and those by transversions with a v. The position
of the deleted Alu in the query is indicated with an X in the
g5129s420 sequence. You can use the -inv option to produce alignments
in the orientation of the consensus sequence.
The lines in the .out file describing this match appear as:

 578  28.4  2.9  5.0  g5129s420  7350  7467 (533) C  MIR   SINE/MIR  (1)   261  149
2222  10.2  2.7  0.0  g5129s420  7468  7762 (238) C  AluSg SINE/Alu  (7)   303    1
 578  28.4  2.9  5.0  g5129s420  7763  7882 (118) C  MIR   SINE/MIR  (113) 149   28


Discrepancies between alignments and the .out file

Discrepancies between alignments and annotation result from the
adjustments made by the ProcessRepeats script to produce more legible
annotation. This annotation also tends to be closer to the biological
reality than the raw cross_match output. 

For example, adjustments often are necessary when a repeat is
fragmented through deletions, insertions, or an inversion.  Many
subfamilies of repeats closely resemble each other, and when a repeat
is fragmented these fragments can be assigned different subfamily
names in the raw output.  ProcessRepeats often can decide if fragments
are derived from the same integrated transposable element and which
subfamily name is appropriate (subsequently given to all fragments).
This can result in discrepancies in the repeat name and matching
positions in the consensus sequence (subfamily consensus sequences
differ in length).

In many cases matches are fused into one annotation. To give just four
common examples: (1) A-rich simple repeats originated from the poly A
tail of ALUs and LINEs are incorporated in the annotation of the Alu
or LINE1.  (2) In large sequences that are analyzed in fragments
consecutive fragments overlap and repeats in these overlaps will
appear twice (partially or wholly) in the alignment file.  (3) There is
an 'endless' number of subfamilies for retroposons which can not all
be represented in the databases and sometimes an element is matched by
overlapping pieces of two related subfamilies (which will be
merged). (4) You may find large discrepancies in position numbering if
an element includes tandem repeat units. For example, MER109 contains
multiple ~300 bp repeat units which can lead to overlapping
matches. In the annotation such matches are fused.


Specific LINE1 problems:

Some other discrepancies are specific to LINE elements. These repeats
do not appear as complete elements in the consensus database. This is
mostly due to the contrast in conservation over the length of its
sequence during its evolution in the mammalian genome; the ~3 kb ORF2
region of LINE1 has been very conserved, whereas the untranslated
regions and ORF1 to a lesser degree have evolved very fast. Thus the
3' end or 5' end of an ancient LINE1 does not even remotely resemble
that of the currently active LINE1, whereas the coding region for
reverse transcriptase is closely related. Thus, many subfamilies have
been defined for both the 5' and 3' UTRs (30 and 52, resp.) of LINE1
elements in human DNA, whereas only four ORF2 entries are present in
the database. Besides some remaining uncertainties about which 5' ends
go with which 3' ends, including 50 full length (6 to 8 kb) LINE1
elements in the database would make the program very slow. LINE1
elements therefore are presented in the database in 3 pieces, and the
ProcessRepeats script puts these pieces together. As a result both the
names of the repeats and position numbering in the consensus sequence
are generally different in the alignments than in the output file.
The currently 3.3 kb LINE2 elements are likewise broken up in 3' UTRs
for different subfamilies and one (complete!) ORF2 region.

Between LINE1 subfamilies, the 3' UTR ranges from 500 bp to over 2000
bp (in L1MC/D3), and the length of the 5' UTR is even more variable,
even between subfamilies that show strong similarity in the 3' UTR.
To allow the LINE1 fragments to be put together, all position numbers
in older LINE1 subfamilies are normalized relative to the position of
ORF2 (the conserved part of LINE1) in a complete L1PA2 element. Since
some older elements have much longer 5' UTRs or ORF1-ORF2 linker
regions than L1PA2, this often results in the assignment of negative
position numbers for the 5' end of LINEs. Since the March2000 release,
such positions and all positions in fragments thought to be part of
the same LINE1 insert are readjusted to count from the 5' end (which
is not necessarily the very 5' end of the LINE1 source gene, as these
are hard to derive for old elements). One problem with this approach
is that positions are not adjusted in detached 3' fragments that are
somehow not recognized by the program as originating from the same
insertion. Thereby, the common origin of the 5' fragments and 3'
fragments may become completely obscured. Use the option '-orf2' of
ProcessRepeats to retrieve an output in which all LINE1s are numbered
so that position 1 of ORF2 is aligned (resulting in occasionally
negative positions).


3.3  The summary (.tbl) file

The summary file is pretty much self explanatory. Below is an example.

==================================================
file name: AC027410.fa
sequences:            1
total length:    152192 bp (148791 bp excl N-runs)
GC level:         39.59 %
bases masked:     88734 bp ( 59.64 %)
==================================================
               number of      length   percentage
               elements*    occupied  of sequence
--------------------------------------------------
SINEs:              195        45195 bp    30.37 %
      ALUs          178        43249 bp    29.07 %
      MIRs           17         1946 bp     1.31 %

LINEs:               54        31173 bp    20.95 %
      LINE1          36        24602 bp    16.53 %
      LINE2          18         6571 bp     4.42 %
      L3/CR1          0            0 bp     0.00 %

LTR elements:        13         5833 bp     3.92 %
      MaLRs           8         4079 bp     2.74 %
      ERVL            0            0 bp     0.00 %
      ERV_classI      5         1754 bp     1.18 %
      ERV_classII     0            0 bp     0.00 %

DNA elements:        17         4459 bp     3.00 %
      MER1_type      12         1903 bp     1.28 %
      MER2_type       4         2466 bp     1.66 %

Unclassified:         0            0 bp     0.00 %

Total interspersed repeats:    86660 bp    58.24 %


Small RNA:            2          124 bp     0.08 %

Satellites:           0            0 bp     0.00 %
Simple repeats:      22         1151 bp     0.77 %
Low complexity:      22          799 bp     0.54 %
==================================================

* most repeats fragmented by insertions or deletions
  have been counted as one element
Runs of >20 Ns in query were excluded in % calcs

The sequence(s) were assumed to be of primate origin.
RepeatMasker version 09/09/2000 , default mode
run with cross_match version 0.990329
RepBase 5.08, vs 09092000
----------------------------------------------------

Since the Sept 2000 release, it is indicated in the table with which
version of cross_match or wu-blast, and the database the analysis was
done.
AC027410 was a draft sequence, with individual contigs separated by
poly N linkers. In this case, the option -excln was used, so that
these strings of Ns were ignored for the percent calculations.


The classification in this table is well defined (see my reviews in
COGD) and forms a good basis for visual presentation and tabulation of
the repeats in your study.

We've been able to classify almost all human repeats, most of them
even in subclasses. The totals for the classes often are higher than
the sum of the subclasses, because not all elements fit in a subclass
and minor subclasses are not listed separately in the table (e.g. for
the human table the Mariner, Tc2, Piggybac, Zaphod, and Arthur families
of DNA transposons). The HAL1 element, ancestral to or derived from
LINE1, is added to the LINE1 total in this table.

Note that the "MER" subclasses have no relationship to each other. The
term MER (MEdium Reiterated repeats) was introduced for purely
administrative purposes to give the beast a name. The MER1 and MER2
groups were named after the first member of these groups identified as
an interspersed repat in our genome. I'm considering renaming them
Tigger and Charlie group, which may be more memorable.

The nomenclature of mammalian repeats derived from retrovirus-like
elements is different from older versions. I've now divided this class
up in the traditional class I, class II (ERVK), class III (ERVL)
retroviruses and the ERVL-derived but very distinct non-autonomous
MaLR elements. Since 'class III' is not an accepted classification
yet, for now this class is called ERVL. The large MER4-group of
non-autonomous LTR elements merges seemlesly with class I endogenous
retroviruses, making it hard to define, and is now incorporated in the
latter group. The ERV classes are most readily distinguished by the
size of the insertion site duplication: 4 in class I, 6 in class II, 5
in class III. However, my LTR classification is based on internal
sequences and matches to LTRs with internal sequences, not on target
size duplication.


As described above, the ProcessRepeats script tries very hard to find
out which repeat fragments were derived from the same insertion event
of a transposable element, but there still will be a slight
overestimate of the copy numbers. 


The 'bases masked' number is calculated from the total number of Xs in
the masked sequences (before these are changed to Ns or lower case
letters). The other numbers are derived from the annotation (.out)
file.  Discrepancies between the 'bases masked' number and the sum of
'total interspersed repeats', small RNA, satellites and low complexity
are generally very small. Most of these are accounted for by unmasked
regions between flanking identical simple repeats, annotated as one
stretch if fewer than 10 bases separate them, and fragments of repeats
shorter than 10 bp which are not annotated but are masked. The numbers
may be quite different if you started out with a query sequence
containing Xs.


4    APPLICATIONS

4.1  Use in database searches

RepeatMasker is most commonly used to avoid spurious matches in
database searches. Generally this step is strongly recommended before
doing BLASTN or BLASTX equivalent searches with mammalian DNA
sequence.

The most common concern is of course if RepeatMasker ever masks coding
regions.
We found that false matches in coding regions are extremely rare, but
did identify 38 genuine fragments of interspersed repeats (4214 bp) in
the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb)
analyzed (excluding annotated coding sequences of LINE1 elements and
endogenous retroviruses). We verified matches with lower scores by
comparing the translation products to close homologous or redundant
entries in the database (the repeat matching regions always were
exactly missing). In the majority of these cases, the sequences appear
to be improperly annotated or to represent either artificially or
naturally defective mRNAs (e.g.  alternatively spliced exons comprised
of a small fragment of a repeat).  Genuine overlaps of interspersed
repeats with coding sequences usually involve terminal regions of the
ORFs. Since the transposable element derived region is unique to the
protein in that (group of) species, the masking does not interfere
with database searches.

However, some cautionary comments are necessary. First, a few active
cellular genes are derived from transposable elements (see my 1999
review for a list of 19 in our genome). Some of these genes will be
partially masked by a (related) transposon in the repeat database. EST
and cDNA matches beyond the masked region should alert you.

Also be aware that RepeatMasker screens for small RNA pseudogenes and
will therefore mask the active small RNA genes as well (I think the
tRNA list is complete, I stopped adding snRNAs unless I found an
indication that they have created many pseudogenes). The number of
matches to small RNAs are listed in the overview table; (close to)
exact matches are possibly active genes, although related active genes
not in the database may show diverged matches.

A final caution relates to the fact that 3' UTRs of transcripts are
about as dense in interspersed repeats as intergenic regions
are. Thus, many ESTs are completely masked as repetitive DNA. I
recommend that, when you compare a genomic sequence against the EST
database or use ESTs as a query in nucleotide searches, you search
with the unmasked sequence as well; use a long minimum match (word
length/ word size) like 40 bp to identify exact matches and avoid most
background. Unfortunately the maximum word length that can be used in
the NCBI BLASTN program is 18 (due to memory limitations).


4.2  Identification of DNA source (contamination detection)

Bacterial insertion elements 

Bacterial insertion sequences (IS elements) often crop up in foreign
sequences, as their activity in the E. coli is not always succesfully
suppressed during cloning. AS late as 2002, human entries in the
'finished' section of GenBank contained over a hundred IS elements.

With each run, RepeatMasker includes a quick check for bacterial
insertion elements that may have inserted during cloning. You can turn
this off with the -no_is option. The -is_only option limits the run to
this check only.

When a full-length element is found and a target site duplication is
confirmed, its location is both reported to the screen and stored in a
.alert file. The latter also contains information of possible
mouse<->human contamination.

-is_clip, -is_only
With the -is_only and is_clip options, the detected IS and one of the
flanking repeats is clipped out to restore the pre-cloning artifact
situation before comparison with the repeat databases. The original
query fasta file will remain unchanged. An insertion sequence-clipped,
but otherwise unmasked query sequence is printed to <file>.withoutIS.

For single sequences larger than 4 Mbp, the -maxsize option needs to 
be set to a number larger than the sequence length to retrieve this file.

With either of these options, a properly adjusted quality string is
printed to a file with the suffix .qual.withoutIS when a corresponding
phred quality file (.qual) is in the same directory. Note that these
names won't be such that the clipped sequence and quality file form a
pair for subsequent cross_match/phrap work. They need to be renamed,
as I assume one wants to do anyway.


Most but not all IS elements can be precisely cut out. The element may
be at the edge of a sequence, or (rarely) the element may have
inserted improperly, lacking target site dups or missing terminal
bases (internal deletion products are generally handled okay). These
matches are reported, but are left untouched even in _is_only or
is_clip mode.

The location of any IS element is both reported to the screen and
stored in an .alert file. The latter also contains information of 
possible mouse<->human contamination.  

Here are the specifics of IS element insertions:

IS1     9 bp duplication
IS2     5 bp duplication; published sequence was too short
IS3     3 bp duplication
IS4     No examples of clonal artifacts; no dup site info
IS5     4 bp duplication; preferred target TCTAGA
IS10    9 bp duplication; extreme preference for CGCTNAGCN; published
        sequence for IS5 & 10 were too long, included preferred target site
IS30    2 bp duplication
IS150   3 bp dup, with one exception (4 bp); strong pref for CAGNNTGGGGCY
IS186   6 or 7 bp dup, extremely specific for CG rich hairpin: 
        SSSGGAGGGAGGCGGGG(6-7)CCCCGCCCSSSSSSSSSSS
Tn1000  5 bp duplication;


Human <-> mouse sequence contamination or mix-up.

A straightforward way to distinguish murine and human DNA is by
checking for either rodent-specific or primate specific
repeats. Likewise, rodent or primate contamination in any other
mammalian or non-mammalian background can be picked up as well. If
your lab has, say, a rat and a pink fairy armadillo sequencing
project, rat DNA in a supposedly armadillo sequence can be picked up
quite reliably, depending on the length of the query.

When the option -rodspec or -primspec is used, RepeatMasker only
checks the query against a small library of repeats which have not
(yet) been observed in the 'other' species. The locations of the
matches are printed to <file>.alert. This function will be expanded to
other mammals, when these species are starting to be sequenced in
earnest.

I've checked for the specificity of the reported matches quite
extensively. Whenever two or more types of repeats are reported, the
odds are that the alert is correct. Very occasionally, a single
reported match could be a false alert. This is especially possible
when a 'new' mammalian species is analyzed, because, unbeknownst to
me, a related repeat may have amplified in such a genome.


Other species contamination.

When a supposedly rodent or primate clone is of non-mammalian origin,
very few if any interspersed repeats will be reported by
RepeatMasker. Human and mouse genomic sequences are on average 40-50%
dense in recognizable interspersed repeats, so that any stretch of
genomic DNA of significant length (say 30 kb or more) showing less
than 10% density in interspersed repeats is of suspect origin. An
automated alert for such a situation is not included, as query
sequences of coding regions or transcripts, generally of very low
repeat density, would be constantly alerted.


4.3  Use in gene prediction and other applications

Predicting genes from a masked sequence has several problems. First,
one should use the option -nolow to avoid masking low complexity
regions and trinucleotide repeats in coding regions. But even with
only interspersed repeats masked, gene prediction programs may fail to
identify exons correctly. As pointed out above, sometimes tail ends of
coding regions may have originated from transposable elements. Some
gene prediction programs suggest the extend of 3' UTRs. These will be
often overestimated in masked DNA, as many genuine poly A signals are
located in interspersed repeats. Finally, even if no coding regions
have been masked, splice sites may be compromised; e.g. the
polypyrimidine region that contributes to an acceptor splice site may
be contained within a repeat.

Thus, I generally recommend to run a gene prediction program on
unmasked DNA (as well) and compare the predicted genes and exons with
the RepeatMasker output. Some gene prediction program allow you to
force certain exons out of the predictions (e.g. often the old ORFs of
LINE1 elements and endogenous retroviruses are included in
genes). Work is also in progress at several sites to incorporate
RepeatMasker into gene prediction programs, in which cases matches to
repeats are weighted in along with the other parameters used.


Other uses

Many people mask repeats before designing primers or oligo probes from
sequence data. I've often been told that primers/probes designed from
regions unmasked by RepeatMasker have a much better success rate. A
cautionary note here is that unmasked regions not necessarily are
unique in the genome (e.g. many lower copy repeats are not in the
database yet) and experiments should be performed as if no filtering
against repeats has been done.  The alignments can help in designing
primers from sequences that are completely masked. Regions that
diverge much from the consensus are less likely to misbehave than
others.

RepeatMasker is sometimes used during assembly of large genomic
sequences.  This procedure probably is most useful in very Alu rich
regions; in that situation I recommend to only mask the Alus, and
maybe limit the masking to those Alus less than 15% diverged (-div
15).

There are plenty of other uses, e.g. analysis of repeats can reveal a
lot about the evolution of a locus (deletions vs insertions,
inversions, approximate time of these events). When you're doing that
you're a specialist and don't need any help from this help file (maybe
from some of the literature sited below though).


5  REFERENCES

Reference for RepeatMasker

We still haven't published a paper on RepeatMasker yet, but appreciate
it if you could refer to the web page (Smit,AFA & Green,P RepeatMasker
at http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) or
otherwise to Smit, AFA & Green, P., unpublished results.


Literature and further information on specific repeats

The EMBL format of the Repbase Update database contains references for
specific repeats, as well as annotation with respect to divergence
level, affiliation, copy number, etc. Much if not most of the
information in this database is not published elsewhere. It can be
accessed at http://www.girinst.org/~server/repbase.html. 
We are trying to keep the nomenclature of the interspersed repeats in
the output of RepeatMasker identical to that of the reference
database. In most cases the names correspond to those most commonly
used in the literature.


The following list of literature is minimal and restricted to human
interspersed repeat articles.


Overviews 

Smit, A.F.A. (1999) Interspersed repeats and other mementos of
transposable elements in mammalian genomes. Curr Opin Genet Devel 9
(6), 657-663.

Jurka, J. (1998) Repeats in genomic DNA: mining and meaning. Curr Opin
Struct Biol 8 (3), 333-337

Smit, A.F.A. (1996) Origin of interspersed repeats in the human
genome. Curr Opin Genet Devel 6 (6), 743-749.

Smit, A.F.A. (1995) Origin and evolution of mammalian interspersed
repeats. PhD dissertation, USC.


SINE/Alu

Schmid, C.W. (1998) Does SINE evolution preclude Alu function? Nucleic
Acids Res 26, 4541-4550.

Schmid, C.W. (1996). Alu: structure, origin, evolution, significance,
and function of one-tenth of human DNA. Prog Nucleic Acids Res Mol
Biol 53, 283-319.

Jurka, J. (1996) Origin and evolution of Alu repetitive elements. In "
The impact of short interspersed elements (SINEs) on the host
genome. Maraia, R.J., editor. Springer Verlag


SINE/MIR & LINE/L2

Smit, AFA, and Riggs, AD. (1995). MIRs are classic, tRNA-derived SINEs
that amplified before the mammalian radiation. Nucleic Acids Res 23,
98-102.


LINE/L1

Smit, AFA, Toth, G, Riggs, AD, Jurka, J., Ancestral mammalian-wide
subfamilies of LINE-1 repetitive sequences. J Mol Biol 246, 401-417.


LTR/MaLR

Smit, A. F. A. (1993). Identification of a new, abundant superfamily
of mammalian LTR-transposons. Nucleic Acids Res 21, 1863-72.


LTR/Retroviral

Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous
Human Retroviruses. In The Retroviridae, J. A. Levy, ed. (New York:
Plenum Press), pp. 465-535.


DNA/all types

Smit, A.F.A. and Riggs, A. D. (1996). Tiggers and other DNA
transposon fossils in the human genome. Proc Natl Acad Sci USA 93,
1443-8.


Update history:

Improvements and new features in the April 1997 version compared to
the June 1996 version:

Besides a massive (2.5 fold) expansion of the databases, the program
itself is more sensitive and selective, has several new features and
an improved output. The script is now divided in two; one
(RepeatMasker) performs the cross_match searches, the other
(ProcessRepeats) takes the RepeatMasker output to create the overview
table and to improve the output in the .out file. The cross_match
searches have been optimized, especially with regard to detection of
low complexity sequences and old LINE1 elements. The most obvious
changes in the processed output file compared to the unprocessed file
are (i) overlapping matches are usually resolved, (ii) LINE1 fragments
are annotated with position numbers as in a full L1 element, and (iii)
when an Alu or LINE1 is fragmented information from both or all
fragments is used to assign a subfamily name. New features in the
program include the ability to screen a custom library and to create
an output file with alignments in positional order.

Improvements May 97: (minor update)
- added option to only mask low complexity DNA
- added version information to .tbl output
- changed artreps.lib to othermamreps.lib, adjusted parameters to
  accommodate larger size of db
- many improvements in estimating number of elements in query
- added name adjustments for MLT2
- fixed many bugs...

Improvements September 1997 (minor update)
- major expansion of the rodent libraries and significant update 
	of the human libraries as well, especially in LINE1 elements.
- scripts modified to accommodate new entries in databases
- simple repeats masking optimized by including pentamers and
	using a more stringent matrix
- several bugs fixed (e.g. sequences without repeats are now counted)
- table now displays parameters use
- temporarily, for comparison with the human LINE library the same 
	minimum match is used in the selective settings as in the default
	settings to avoid masking small inserts in the LINE elements
- forthcoming release of cross_match has improved performance on a
	tandemly repeated element (currently sometimes the lower scoring 
	unit may go unmasked, even when it is a common repeat)

Improvements and new features in the May 1998 version compared to
the September 1997 version:

- the program now accepts most 'not quite fasta' format files 
- large sequences are analyzed in fragments of 100 kb to reduce the
	memory requirements of the program. Similarly files with very
	many sequence entries are divided up. You shouldn't notice any
	of this in the output files.
- matrices are used that are optimal for the divergence level of the
	repeats to which the query is compared and the background
	nucleotide composition.  
- another big update of the human repeat databases.
- the small RNA sequences have been corrected and expanded (all tRNAs
	should be there now) 
- close to perfect simple repeats, full-length shorter interspersed
	repeats and young LINE1 3' ends are excised from the sequence
	(in both human and rodent analysis) to allow better detection
	of any underlying repeats. A sequence file with these repeats
	deleted can be saved.
- the -low option doesn't mask out any type of simple repeats anymore
- alignments are shown in the orientation of the query sequence
- new options include 
	masking Alus only
	obtaining a sequence with full lengths repeats deleted
	obtaining a(n incomplete) list of possibly polymorphic microsatellites
	setting a cutoff score when using the -lib option.
minor fixes
- the .out.xm and .ace files now also contain the simple repeats and
	low complexity DNA (can still be omitted by running
	ProcessRepeats with the -low option on the .cat file)
- sequence names including a number between parentheses used to
	confuse the program thoroughly; now fixed
- many that you wouldn't find interesting


Improvements and new features December 1998

- This version is optimized for use with the 1998 cross_match release
  The difference for RepeatMasker is mainly in the complexity adjusted
  length of the matches that function as kernels for Smith Waterman
  alignments and the matrix dependent adjustment of the score for
  complexity of the alignment.
- Among bugs in the May 1998 version fixed are those resulting in
  bogus output when the sequence name ends with .seq and when a raw
  sequence is submitted. Also, sequence files that contain carriage
  returns from PCs and Mac are handled better now.
- You can now limit the masking to younger repeats by setting a
  maximum allowed divergence of repeats from their consensus sequence
- A mRNA/EST option is available that prevents false masking due to
  inappropriate matrix choice and low complexity matches to LINE1 elements.
- You can set the background GC level (determining which matrices are
  used) overriding the programs' calculations.	
- The full description ('>') lines are retained in the masked file.
- The .out file table can be returned with flexible length columns
  allowing the full length of long query sequence names to be displayed
- The sequences identified as repeats can be returned in lower case
  (rest in capitals) rather than masked out by Ns or Xs.
- Output to the screen is more informative and less panicky
- Simple repeat and satellite masking has been improved again; their
  annotation has changed a bit, most notably they are now all listed in
  the orientation of the query sequence


April 1999

The default return format of the annotation file is changed, hopefully
in a way that does not interfere with any type of parsing; the width
of the columns is now adjusted to the longest entry in that column,
allowing query names to be spelled out in full, and usually leading to
narrower tables.

Arabidopsis, Drosophila, and grass repeat libraries were added; other
repeat libraries were updated.

Three measures were taken to eliminate the (few) false positives:
- Use of the actual average GC level of sequences in a batch file may
  sometimes lead to false masking (or failure to mask) in sequences that
  diverge largely from the average. Thus, by default, all batch files
  are now analyzed with the innocuous 43% matrices.
- one entry, responsible for 90% of false masking in GC rich regions,
  is deleted from the 'tough L1' library.
- the matrix used for identification of the most diverged sequences in
  very GC rich regions, based on too little data and too much
  extrapolation, was 'too easy' on the mismatches and has been
  adjusted. 
Thanks to these measures the 'mrna' option is not necessary and has
been removed.

A bug is fixed that led to (wildly) improper annotation for some
sequences fully consisting of repeats (all bases masked). A series of
lesser bugs were taken care of. New bugs were skillfully introduced,
probably.

May 5 1999
- Eliminated a really dumb bug that resulted in having the percent
  deletions replaced by the percent insertions.
- Made it easier to use your own database with repeatmasker. The
  database does not have to be in the repeatmasker directory.


March 2000

Besides a long overdue update of the databases the following
improvements have been made:

speed, sensitivity, user-friendliness
- It is now possible to run large sequences and batch files on
  multiple processors.
- An even faster option (-qq) is available for people in a serious hurry
- More repeats are cut out, in particular LINE1 3' fragments, to better
  uncover underlying repeats
- I've reduced the default fragment length to 51000 bp (incl 1000 bp
  overlaps); this gives a slightly lower chance of running out of memory
  (followed by resorting to a larger wordlength) and sometimes better
  choice of substitution matrices.
- The -cut option does not overrule fragmentation anymore
- RepeatMasker now handles zipped (.gz) and compressed (.Z) sequence files
- You can now quit the program at any point with 'control-c'.


annotation, display, summary
- An option is added providing unique IDs for individualy integrated
  elements, labeling fragments of the same element with the same number
- Classification of mammalian LTR elements has changed (now includes
  the conventional three ERV classes)
- Some repeat names have been adjusted (notably the MLT2 subfamilies) to
  be consistent with the RepBase nomenclature
- Improved interpretation of fragmented sequences resulting in more
  accurate counts (for the .tbl file) of total insertions in the query
  sequence
- Negative coordinates in LINE1 elements are now avoided (but see
  'Specific LINE1 problems' in helpfile above)
- Improved accounting of LTR elements; now most LTR elements receive
  the same name for the LTRs and internal sequence and are counted as
  one insertion.
- Divergence and insertion/deletion levels are calculated for
  annotations that are derived from two or more fused fragments
- Fixed the .ace output so that the orientation of the match is displayed.
- Output can be retrieved in the GFF (General Feature Finding)
  format. The current output is following a Sanger convention.

bugs
- The .tbl file format was not prepared for sequences over 10 million bp.
  It's now ready for sequences up to 1 billion bp. For larger sequences,
  I'd recommend doing the analysis in two or more steps...
- A bug has been fixed that crashed scripts trying to start several
  RepeatMasker jobs simultaneously
- A bug is fixed that resulted in sometimes incorrect output, when
  multiple files were fed to repeatmasker and one was masked in full
- Sequences and fragments >> 100 kb completely existing of Ns (no ACGT)
  used to crash the program
- Drosophila and Arabidopsis masking allowed no overlaps in matches..
- Several other bugs were fixed that gave slightly incorrect output
  under cruel and unusual circumstances


May 2000
- When using a -frag lower than 50000 on a sequence <50000 sequence
  would be analyzed in one piece anyway. Fixed.
- A cut file was created under some circumstances without asking. When
  masking non-mammalian DNA, this '.cut file' had any reported repeat
  deleted, rather than all full-length elements. Fixed too.

June 2000
- An option -w(ublast) has been added that lets RepeatMasker run with
  WU blast rather than cross_match via the MaskerAid script by Joey
  Bedell and Warren Gish. I haven not implemented an automatic update 
  of the blast formatted databases and matrices.
- When using the wublast option, hyphens in the sequence are retained
  (previously all non-letters were deleted from the sequence). WU blast
  uses hyphens to indicate insurmountable barriers.
- A half dozen bugs have been fixed that let to crashes of the
  ProcessRepeats script or negative substitution or deletion levels
  in the .out file
- Some changes have been made in the forking procedure and system
  calls, avoiding some reported problems with large batch files 
- LINE1 consensi have been updated

August 2000
- Added a feature to check for human and rodent DNA contamination.
- Included a step that surgically removes bacterial insertion sequences,
  arrived by transposition during cloning
- Improved satellite detection somewhat
- Improved statistics in the .tbl file. Among others, long stretches
  of Ns are counted in the query and can be ignored in calculating
  percent coverage, and most repeats that have spawned a satellite
  sequence are now counted as a single copy.
- The temporary files created by the program are reduced and are written
  to a temporary subdirectory of the working directory, rather than to the
  directory containing the query file. Besides a reduction in clutter,
  this can make a big difference in speed if the working directory and
  file are accross system boundaries.
- Removed a very rare but awful bug leading to false extension of an
  annotation all the way to bp 1, several divide by zero bugs, a bug
  ocassionally crashing the script when the query name contained an
  asterisk, and a few minor bugs

September 2000
- The contamination checking step is separated from the repeatmasker run
- IS elements are only optionally clipped from the query sequence
- C elegans repeats can be screened with -elegans and gets .tbl file,
  zebrafish repeats separated from other vertebrates in tiny danio.lib
  database.
- More information is stored in .cat files, e.g. number of Ns in
  linker regions and GC level, so that this information does not have
  to be hand fed to ProcessRepeats. 
- Version and type of alignment programs indicated in .tbl file
- Spontaneous running of all fragments in parallel when using WU-blast fixed
- Hang-up when last fragment fully-masked when using WU-blast avoided.
- False matches to a fragment of a particular L1 subfamily avoided

October 2000
- Fixed failing IS element check when using WU-blast
- Fixed typo causing some B1 elements to be reported as primate specific Alu

November/December 2000
- Fixed the following bad bug (present since April): when using -xsmall on
  large sequence that are analyzed in multiple fragments, the lower
  case replacement sequences in fragments 2 and up were taken from the
  first fragment. Please check old results that may have depended on 
  the masked sequence being identical to the non-masked sequence.
- Changed some code so that files with many sequences are processed
  more efficiently
- When choosing two species options, the program won't continue as if
  everything's right, but now gives an error message.

January-March 2001
- The script now works on Windows systems with a Cygwin port;
  I can't test it too often, so there may be more bugs than elsewhere
- Simple repeats are cut out in a more sophisticated way; main
  behavioral change is that poly A tails are less likely to be cut out
  before the SINEs and LINEs (with poly A tails) are recognized
- I added a cutting step in rodent analysis, just like the first steps
  in human analysis; also elsewhere rodent analysis a bit more sensitive
fixed bugs:
- closed all opened files; 
- replaced a -M option for cross_match with -matrix (as -M wasn't
  recognized by MaskerAid)
- false SVA (iso Alu) annotations with the option -nocut eliminated

April 2001
- Made the creation of the .align file far more efficient 
fixed bugs:
- The -frag option wasn't functioning when analyzing sequences <50 kb
- A .cut file was created under some circumstances without asking.
  When masking non-mammalian DNA, this file contained the query with any
  reported repeat deleted, rather than all full-length elements.

May 2001
- Large sequences (>2x the -frag settings, i.e. default > 100 kb) in a
  batch file are analyzed in fragments, just like single-large
  sequence files 
- Duplicate alignments of repeats in regions of overlap between two
  fragments are eliminated if one is contained in the other.
fixed bugs:
- The -frag option wasn't functioning when analyzing sequences  below 50 kb
- A .cut file was created under some circumstances without asking. 
  When masking non-mammalian DNA, this file contained the query with any
  reported repeat deleted, rather than all full-length elements.
- Output files would not be written to the query file directory if the
  query file was read only

June 2001
- Alternative output formats (.out.xm, .ace, .gff) now also available
  when using your own repeat sequence files (-lib option).
- Now usually skips a step in which entire query file is read into one
  scalar. This caused memory problems for *very* large files.

August 2001
- The divergence levels indicated in al output files now is
  "mismatches/(matches+mismatches)" rather than the cross_match
  idiosyncratic "mismatches/(matches+mismatches+unaligned bases in query)"

November 2001
- Fixed bad bug that caused the -lib and -u options to not report
  matches or alignments if a single match actually had been found.
- When a library invoked with the -lib option is formatted like a
  repeatmasker library, the repeats are processed (merged etc.) and an
  overview (.tbl) file is created.

December 2001
- Improved memory requirements of ProcessRepeats.

January 2002
- On rare occasions up to 500 bp were missing from the 3' end of masked
  sequences when using the -frag option (ouch)

February 2002
- By not always slurping the entier query in an array, significantly
  reduced in memory requirements for large query sequences (thereby
  saving much time as well)

March 2002
- Added chicken, carnivores, and artiodactyl libraries and code
- Script backs up previous repeatmasker output it encounters, rather
  than delete them after a warning
- Most 'cross_match error (1)' etc error messages replaced with
  something more user friendly

April 2002
- More rewriting to reduce array sizes and the size of sequences analyzed
  for IS elements
- Made the unique IDs in the .out file start at 1 and not skip any numbers
- Some rare combinations of input format errors are now handled correctly
- Estimates for masked bases and total bases excluding Ns in the query
  were sometimes off by a few bases due to small overlaps of repeats
  that were temporarily cut out of the query in the masking process
- Lots of work on the mouse libraries.

 May 2002
- 2 bugs leading to improper annotation of > 100 kb contigs in > 4 Mbp
  batch files and failure to mask certain B1 elements in rodent
  DNA. Fixed between 20020505 general and 20020515 fixed releases.

 June 2002

- Improved memory efficiency of ProcessRepeats and its speed when only
  investigating simple repeats (-noint)
- Added -maxsize option. Since February large sequences are handled in
  pieces of 4 Mbp each to avoid having several arrays of 4 Mbytes each
  (this is different then the fragment settings, which determines the
  size of fragments cross_matched at once). This 'maxsize' can now be
  adjusted, among others, because for sequences > maxsize, IS elements
  and full-length, young repeats can not be clipped out.
- Fixed several problems with E coli insertion element detection in >
  maxsize queries
- Fixed a serious bug causing the script to go in an infinite loop on
  sequences with ^M type carriage returns (thanks Mitch Skinner)
- Fixed several bugs in ProcessRepeats leading to (inoccuous) warnings
  when using a personal repeat library (thanks Alfred Beck)
- Ambiguous sequence (N) strings >4 Mbp (maxsize) in the query were
  not incorporated in the masked outputfile. These do occur in some 
  chromosome assemblies (to replace centromeres, etc.). Fixed.
- The name of the custom library was not displayed in the .tbl file. Fixed.

 July 2002
- Reduced false positives (due to new elements in the libraries)
- Fixed several strange progress messages 
  (like 'analyzing fragment 9 of 8 of sequence 0')

If you have ideas for improvements or found a problem, drop a note
at asmit@hoh.biotech.washington.edu or afasmit@pacbell.net

/*****************************************************************************
#   Copyright (C) 1996-2002 by Arian Smit                          
#   All rights reserved.                           
#        
#   The software and databases should not be redistributed or used for
#   any commercial purpose, including commercially funded sequencing,
#   without written permission from Geospiza Inc, Seattle
#   (http://www.geospiza.com/)
/*****************************************************************************