###################################################################### RepeatMasker, Arian Smit 03/12/96, most recent change 07/13/2002 Please refer to: Smit, AFA & Green, P "RepeatMasker" at http://repeatmasker.genome.washington.edu The interspersed repeat databases are modified versions of those found in "RepBase Update" (http://www.girinst.org/~server/repbase.html) ###################################################################### RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences. The output of the program is a detailed annotation of the repeats that are present in the query sequence as well as a modified version of the query sequence in which all the annotated repeats have been masked (default: replaced by Ns). Sequence comparisons in RepeatMasker are performed by the program cross_match, an efficient implementation of the Smith-Waterman-Gotoh algorithm developed by Phil Green. or, optionally, by WU-Blast developed by Warren Gish. This helpfile discusses the following topics: 0 Basic input and output 1 Options 1.1 Species and contamination check options 1.2 Options effecting which repeats get masked 1.3 Speed, engine and search parameters 1.4 Output and formatting 1.5 ProcessRepeats options 1.6 WU-blast search-engine option 2 Methodology and quality of output 2.1 Methodology 2.2 Scoring matrices 2.3 Databases 2.4 Sensitivity and speed 2.5 Selectivity and matches to coding sequences 2.6 Low complexity DNA and simple repeats 3 How to read the results 3.1 The annotation (.out) file 3.2 Alignments 3.3 The summary (.tbl) file 4 Applications 4.1 Use in database searches 4.2 Identification of DNA source and bacterial insertions 4.3 Use with gene prediction programs and other applications 5 References 0 INPUT and OUTPUT Input format: Sequences have to be in the 'fasta format': >sequencename all kind of info AGCGATCGCATCGAGCGCATTCGCATGGGG >sequencename2 all kind of info GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA The program accepts most common erroneous 'almost fasta format' and raw sequence files, but does not yet work with other formats (GenBank, Staden, etc.). You can use RepeatMasker on a file containing multiple fasta format sequences and on multiple sequence files at the same time: RepeatMasker *.fasta This command will mask all files that end with .fasta in the current directory and give separate reports for each file. Note that if you have multiple small sequences it is considerably faster to run RepeatMasker on one batch file than on many single sequence files. The summary file will be more informative as well. However, analysis on single files (when larger than 2 kb each) can be slightly more accurate, since GC levels for each sequence will be calculated and used to choose appropriate parameters. Standard output: RepeatMasker returns a .masked file containing the query sequence(s) with all identified repeats and low complexity sequences masked. These masked sequences are listed and annotated in the .out file. The masked sequences are returned in the same order as they are in the submitted file, whereas the sequences are presented alphabetically in the annotation table. The .tbl file is a summary of the repeat content of the analyzed sequence. 1 OPTIONS 1.1 Species options -m(us) masks rodent specific and mammalian wide repeats -rod(ent) same as -mus -cow masks artiodactyl, whale, and mammalian wide repeats -pig, -cet(acea), -art(iodactyl) same as -cow -car(nivore) mask carnivore-specific and mammalian wide repeats -cat -dog same as -car -mam(mal) masks repeats found in mammals not mentioned above -ch(icken) masks repeats found in chicken and related birds -ar(abidopsis) masks repeats found in Arabidopsis -dr(osophila) masks repeats found in Drosophilas -el(egans) masks repeats found in C. elegans -fugu masks repeats found in Takifugu rubripres -lib [filename] allows usage of a custom library (e.g. from another species) contamination checking options -is_only only clips E coli insertion elements out of fasta and .qual files -is_clip clips IS elements before analysis (default: IS only reported) -no_is skips bacterial insertion element check -rodspec only checks for rodent specific repeats (no repeatmasker run) -primspec only checks for primate specific repeats (no repeatmasker run) For detailed explanation of the contamination detection options, see "4.2 Identification of DNA source" below. The default settings of RepeatMasker are for masking a primate (human) sequence. Interspersed repeats are specific to a (group of) species, dependent on the time of activity of the source transposable element. Less than half of the repeats identified in human DNA are specific to primates, i.e. over half amplified before the eukaryotic radiation some 100 million years ago. Most repeats that can be identified in mouse DNA are specific to rodents though, due to higher activity and faster mutation rates in the rodent lineage. RepeatMasker has separate protocols optimized for analysis of genomes of different mammalian orders. These are the numbers and bp of repeat consensus sequences specific in the May 5 2002 databases, species # of consensi total bp mammalian-wide 421 459641 primate-specific 324 550887 rodent-specific 150 231345 cetartiodactyl-spec 44 23658 carnivore-specific 20 22694 spec to other mammals 11 17998 chicken (birds) 34 39512 Xenopus 27 26515 pufferfish 94 169989 zebrafish 49 59425 other vertebrates 42 38699 Drosophila 184 524645 C. elegans 122 161140 Arabidopsis 435 1419710 maize, rice 71 246544 Mammalian sequences are compared to both order specific and mammalian-wide repeats (transpositional activity predates the mammalian radiation). One can see that the majority of sequences against which other mammals are compared are repeats that have been identified in the human genome but are thought to predate the mammalian radiation. Six libraries are extracted from the 'RepBase Update' fasta libraries with very limited curation. The large C. elegans, Arabidopsis, and Drosophila libraries have been built primarily by the people at the Genetic Information Research Institute (GIRI). The Xenopus, danio (zebrafish), other vertebrate (a rather useless mixture) and grasses (maize and rice) libraries are still fetal. The latter smaller libraries are accessed with the -lib option. In 2001/2002 I've created (the cetartiodactyl, carnivore, chicken and pufferfish libraries, and significantly extended the rodent databases. RepBase Update contains repeats for many other species. These are not included here, either because interspersed repeats are not an analytical problem for these species (e.g. prokaryotes, yeast) and/or the number of repeats is impractically small (i.e. you don't need RepeatMasker to compare your query to one repeat sequence). RepBase Update is maintained by GIRI and me at (http://www.girinst.org/~server/repbase.html). -lib With the -lib option you can specify a custom library of sequences to be masked in the query. The library file needs to contain sequences in fasta format. Unless a full path is given on the command line, such a file should be in the current directory or in the .../RepeatMaskerxx/Libraries directory with the other library files. I've provided libraries for some vertebrate (vertebrate.lib, danio.lib xenopus.lib), and grasses (grasses.lib) repeats, which are not yet fully integrated and have to be accessed by using the -lib option. 'RepeatMasker -lib xenopus.lib bigfrog.seq' will mask all sequences similar to repeats in the Xenopus database as well as all low complexity and simple repetitive DNA in "bigfrog.seq". I recommend to format your own repeat library like a RepeatMasker .lib file (the file name does not need to end with .lib). Like this >repeatname#class/subclass or simply >repeatname#class In that format, the data will be processed (overlapping repeats are merged etc), alternative output (.ace or .gff) can be created and an overview .tbl file will be created. Classes that will be displayed in the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA','Satellite', anything with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the latter defaults when class is missing). Subclasses are plentiful, but are not all spelled tabulated in the .tbl file. Check the accompanying .lib files for names that can be parsed into the .tbl file. -no_is, -is_clip, -is_only, -primspec, -rodspec contamination checking options -is_only only clips E coli insertion elements out of fasta and .qual files -is_clip clips IS elements before analysis (default: IS only reported) -no_is skips bacterial insertion element check -rodspec only checks for rodent specific repeats (no repeatmasker run) -primspec only checks for primate specific repeats (no repeatmasker run) See "Contamination detection" below 1.2 Masking options (options that determine what kind of repeats are masked) -cutoff [number] sets cutoff score for masking repeats when using -lib (default cutoff 225) -nolow does not mask low_complexity DNA or simple repeats -l(ow) same as nolow (historical) -(no)int only masks low complex/simple repeats (no interspersed repeats) -alu only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA) -div [number] masks only those repeats that are less than [number] percent diverged from the consensus sequence -cutoff When using a local library you may want to change the minimum score for reporting a match. The default is 225, lowering it below 200 will usually start to give you significant numbers of false matches, raising it to 250 will guarantee that all matches are real. Note that low complexity regons in otherwise complex repeat sequences in your library are most likely to give false matches. -nolow / -l(ow) With the option -nolow or -l(ow) only interspersed repeats are masked. By default simple tandem repeats and low complexity (polypurine, AT-rich) regions are masked besides the interspersed repeats. For database searches the default setting is recommended, but sometimes, e.g. when using the masked sequence to predict the presence of exons, it may be better to skip the low complexity masking. -noint / -int When using the -noint or -int option only low complexity DNA and simple repeats will be masked in the query sequence ("minus interspersed repeats"). Since the 03032000 release, A-rich simple repeats derived from the poly A tails of SINEs and LINES are merged with the annotation of the SINE or LINE (i.e. you can't tell there is a simple repeat). Thus, if you're interested in finding the location of potentially polymorphic simple repeats, this option is recommended. -alu -div You can limit the masking and annotation to (primate) Alu repeats with the -alu option and to a subset of less diverged (younger) repeats with the option -div. For example, "RepeatMasker -div 20 -mus mysequence" will mask only those rodent repeats and simple repeats that are less than 20% diverged from the consensus sequence and "RepeatMasker -div 10 -alu mysequence" will mask Alus that are less than 10% diverged from the Alu consensus sequences and no other repeats. The -div option may be used to limit the masking to those repeats that are either specific to primates or another mammalian order for use in subsequent comparison of orthologous mammalian loci. On average, interspersed repeats have diverged 16% in human (~28% in mouse) from their consensus since the mammalian orders separated (average substitution levels are 18% and 35%, respectively). Note that this method is rather crude, mostly since the range of deterioration of repeats of the same age is wide; some order specific repeats may remain unmasked and shared repeats may be masked. 1.3 Options effecting speed and search parameters -q quick search; 5-10% less sensitive, 3-4 times faster than default -qq rush job; about 10% less sensitive, -s slow search; 0-5% more sensitive, 2.5 times slower than default -pa(rallel) number of processors to use in parallel (only works for batch files or [number] sequences larger than 50 kb) -w(ublast) use WU-blast, rather than cross_match as engine (see 1.6) -frag [number] maximum sequence length masked without fragmenting (default 51000) -maxsize [nr] maximum length for which IS- or repeat clipped sequences can be produced (default 4000000). Memory requirements go up with higher maxsize. -gc [number] use matrices calculated for 'number' percentage background GC level -gccalc program calculates the GC content even for batch files/small seqs -nocut skips the steps in which repeats are excised -noisy prints cross_match progress report to screen (defaults to .stderr file) -s -q -qq RepeatMasker can be run at four different sensitivity/speed levels, with the option -q providing quick (less sensitive) and -s slow (sensitive) results compared to default. The option -qq has been added for when you're in a frightful hurry. Each higher gear is about 2-3 times faster, and 90% as sensitive as the next lower gear. See "2.4 Sensitivity and Speed" below for details -pa(rallel) For sequences over 50 kb long or files wit multiple sequences, RepeatMasker can use multiple processors. When you type: RepeatMasker -par 10 A batch file of sequences will run with up to 10 sequences at the time, until all sequences are done, while a file with one large sequence will analyze the sequence in up to 10 fragments at the same time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all sequences over 100 kb are divided in 33-66 kb fragments). For the batch files no minimum size exists. Thus, If contains: RM runs in parallel: one 60 kb seq two 30 kb fragments one 400 kb seq ten 40 kb fragments one 1 Mb seq ten 50 kb fragments, twice ten 500 bp sequences ten 500 bp sequences two 500 kb sequences ten 50 kb fragments, twice Processing of the detected matches takes place after all batches or fragments have been cross matched with the databases. Beware that, generally, you have a limited number of processor IDs allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're allotted 64 user PIDs, you can 'only' run 16 fragments/batches in parallel. -frag Even when the -par option is not used, RepeatMasker transparently fragments sequences over 51 kb in fragments of equal sizes with 1 kb overlaps. Similarly, sequence batches containing more than 51 kb are subdivided in batches of 51 kb or less. The -frag option sets the maximum fragment and batch size The only visible effect of the fragmentation is in the alignment files, where alignments at the edges of the fragments can be duplicated and/or truncated. The 1 kb overlap between fragments almost guarantees that there is no loss in sensitivity at the edges. Fragmentation initially was implemented to allow the size of sequences and sequence batches to be unlimited. Cross_match can be very memory intensive when SW alignments have to be performed in large matrices. This may happen with short minmatch and large bandwidth settings. Note that RepeatMasker should not croak when cross_match runs out of memory; it will redo the failed search with a higher word length or smaller bandwidth until it succeeds. However, this will lead to gradually less sensitive comparisons. Fragmentation also can improve repeat detection when a genomic sequence contains large regions of DNA with significantly different GC levels (isochores), since sets of scoring matrices are chosen based on the GC level of a fragment. Since April 2002 the maximum fragment size is hardwired to be half of "maxsize" (see below). -maxsize To limit the memory requirments of the script an upper boundary to the amount of sequence stored in a single array in the script is set to 4 million bp. This parameter can be reduced with the -maxsize option to a minimum of 500000, for severely memory impaired computers. The size of maxsize further determines the largest length single sequence from which E. coli insertion sequences and full-length repeats can be clipped. Increase the size of maxsize to allow removal of IS elements from larger sequences, like: RepeatMasker -is_clip -maxsize 9999999999 muntjakchromosome1 -gc -gccalc Neutral mutation patterns differ significantly depending on the GC richness of a locus and we have calculated optimal scoring matrices for the alignment to consensus sequences in a range of background GC levels (see 2.2). Usually, RepeatMasker calculates the percentage of the sequence consisting of Gs and Cs and uses the appropriate matrices. However, the program defaults to using 'average' 43% GC matrices when the query is shorter than 2000 bp or a batch file is analyzed. Short sequences are less likely to share the GC level of the locus. For example, CpG islands and exons are more GC rich than the surrounding DNA, whereas a LINE1 element usually is more AT rich than the background. In a batch file, RepeatMasker analyses all sequences together with the same matrices. The percentage GC in all the sequences combined may be inappropriate for some sequence entries; using high GC level matrices in AT rich sequences (and vice versa) may result in false masking. One can override this behavior in two ways: With the option -gc you can set the GC level to a certain percentage: RepeatMasker -gc 37 mybatchofsequences.fa lets the program use matrices appropriate for 37% GC background. The batch could, for example, contain ESTs from a single locus with a known GC level. Alternatively, the -gccalc option forces RepeatMasker to use the actual GC level of a short sequence or the average GC level of a batch of sequences. The latter sequences, for example, may be contigs in a sequencing project. -nocut The option -nocut skips a step in the default procedure for human and rodent queries, in which full-length younger insert are spliced out of the query to reconstruct a preinsertion situation. RepeatMasker is generally more sensitive including the deletion step as it can unearth older repeats that were interrupted by these younger elements. 1.4 Output options -a shows the alignments in a .align output file; -ali(gnments) also works -inv alignments are presented in the orientation of the repeat (with option -a) -cut saves a sequence (in file.cut) from which full-length repeats are excised -small returns complete .masked sequence in lower case -xsmall returns repetitive regions in lowercase (rest capitals) rather than masked -x returns repetitive regions masked with Xs rather than Ns -poly reports simple repeats that may be polymorphic (in file.poly) -ace creates an additional output file in ACeDB format -gff creates an additional General Feature Finding format output -u creates an untouched annotation file besides the manipulated file -xm creates an additional output file in cross_match format (for parsing) -fixed creates an (old style) annotation file with fixed width columns -no_id leaves out final column with unique ID for each element -e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query -noisy prints cross_match progress report to screen (defaults to .stderr file) -a / -ali(gnments) -inv Alignments are saved in a .align file when using the option -a. They are shown in the orientation of the query sequence, unless you use the option -inv as well, which will return alignments in the orientation of the repeats (see 3.2 Alignments). -cut The option -cut lets the program save a file "file.cut" which contains an intermediate sequence in the masking progress. In this sequence all full-length elements, young LINE1 3' ends, and close to perfect simple repeats are deleted. *This option currently only works with mammalian queries.* Because of programming complications, no .cut file is saved if a single sequence is larger than the 'maxsize' parameter, by default set to 4 Mbp. The parameter can be changed with the option -maxsize. Another option will grow out of this that returns a sequence in which only order specific repeats are deleted, allowing superior alignments of mammalian orthologous sequences. -x When -x is used the repeat sequences are replaced by Xs instead of Ns. The latter allows one to distinguish the masked areas from possibly existing ambiguous bases in the original sequence. However, when running BLAST searches (and maybe other programs) Xs are deleted out of the query and the returned BLAST matches will have position numbers not necessarily corresponding to that of the original sequence. -xsmall When the option -xsmall is used a sequence is returned in the .masked file in which repeat regions are in lower case and non-repetitive regions are in capitals. -poly You can get a list of potentially polymorphic microsatellites with the option -poly. This is simply a subset of the list in .out, with dimeric to tetrameric repeats less than 10 % diverged from perfection. -xm When using the -xm option an additional output file (.out.xm) is created that contains the same information as the .out file (excluding the low-complexity/simple DNA), but then in the original cross_match format. This output is harder to read but there are programs that require the exact cross_match output format. -u The script ProcessRepeats adjusts the original RepeatMasker output so that the annotation more closely reflects reality. With the option -u a .ori.out file is created that contains the original (but sorted) cross_match summary lines. -ace With the -ace option an .ace file is created by the script. This is merely a suggestion. The columns in the table currently are: Motif_homol RepeatMasker(method) -gff A .gff file is created by the script with the annotation in 'General Feature Finding' format. See http://www.sanger.ac.uk/Software/GFF for details. The current output follows a Sanger convention: RepeatMasker Similarity . Target "Motif:" In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match is against an Alu. I don't know why. -fixed Since April 1999 the column widths in the annotation table are adjusted to the maximum length of any string occurring in a column; this allows long sequence names to be spelled out completely. Previously, a fixed column width table was returned, which can still be obtained by using the -fixed option. Parsing should not be effected by this change of default behavior, as the same number of columns with the same formatted text are still separated by white-space. -no_id Since September 2000 a column displaying a unique number (ID) for each integrated element is printed by default. This used to be optional (-id). Fragments of a single element, separated from each other by subsequent insertions of other elements, deletions or recombinations, carry the same number. This feature allows better interpretation of the data and should greatly help proper graphical display of the repeats. The column follows all other columns, except for the (rare) indication that an annotation overlaps another annotation (*). This change, which was announced in the previous release, should not hinder most parsing scripts. If it causes problems, the old fomat can be retrieved with the option -no_id. -excln The percentages displayed in the .tbl file are calculated using a total sequence length excluding runs of 25 Ns or more. This is useful when analyzing draft sequences that are often concatenated contigs separated by (sometimes very) long stretches of Ns. This option can be used with ProcessRepeats as well. The number of Ns in long runs in the query are apparent in the .tbl file, and you only need to run ProcessRepeats with the option on the .cat file. -noisy RepeatMasker used to print the voluminous cross_match progress reports to the screen. Since the Dec 1998 version this output is stored in a .stderr file and a more informative much smaller progress report is printed to the screen. The option -noisy allows one to see the cross-match reports coming by on the screen (yeah). 1.5 ProcessRepeats options When you have already run RepeatMasker and want to recreate the .out or .tbl file, you only need to rerun ProcessRepeats on the .cat file(s), which will take just a small fraction of the time required to rerun RepeatMasker. Such a situation can occur when you've accidently deleted the .out or .tbl file or want additional or differentially formatted output files. Note that alignment files can not be created unless RepeatMasker was run with the -a option and that the original .tbl and .out file will be overwritten unless you rename them. ProcessRepeats -mus -nolow -gff -excln myhumongousmousesequence.cat Repeat matches are processed differently for rodent and primate queries, so the -mus option is necessary. With the -low option, the .out file will not contain information on simple repeats and low complexity DNA anymore. The -gff option creates an additional output file in GFF format, and the -excln option displays the density of repeats in the .tbl file as a percentage of those bp that are not contained in long stretches of Ns. The options available for ProcessRepeats are: default settings are for handling a human sequence .cat file -mus adjusts the processing and .tbl file for rodent repeats -cow adjusts processing for cetartiodactyl repeats -car(nivore) adjusts processing for carnivore repeats -ar(abidopsis) adjusts the .tbl file for Arabidopsis thaliana repeats -dr(osophila) adjusts the .tbl file for Drosophila repeats -el(egans) adjusts the .tbl file for nematode repeats -lib skips most of processing, does not produce a .tbl file unless the custom library is in the >name#class format. -l(ow) does not display simple repeats or low_complexity DNA in the annotation -u creates an untouched annotation file besides the manipulated file -xm creates an additional output file in cross_match format (for parsing) -ace creates an additional output file in ACeDB format -gff creates an additional Gene Feature Finding format -poly creates an output file listing only potentially polymorphic simple repeats -no_id leaves out final column with unique number for each element (was default) -fixed creates an (old style) annotation file with fixed width columns -excln calculates repeat densities excluding long stretches of Ns in the query -orf2 results in sometimes negative coordinates for L1 elements; all L1 subfamilies are aligned over the ORF2 region, sometimes improving interpretation of data -a shows the alignments in a .align output file 1.6 WU-blast option -w(ublast) Joey Bedell and Warren Gish at the St Louis Washington University Genome Center have written a script, MaskerAid, that makes it possible to call Warren's WU-blast as if it is cross_match (http://sapiens.wustl.edu/MaskerAid). Through MaskerAid, WU-blast can accept cross_match options and the cross_match complexity adjustment to the alignment score is applied. Although the script has wider applications, the primary idea was to provide RepeatMasker with a faster search engine. For longer sequences, default MaskerAided RepeatMasker runs take about as long as crossmatch powered runs at -qq settings (see "2.4 Sensitivity and speed"). The speed settings have relatively little effect on the speed when using MaskerAid, with the fastest settings 1.25-1.75 as fast as the slowest settings, while the sensitivity increases significantly. Thus, I recommend to always run RepeatMasker in sensitive (-s) or default mode when using MaskerAid. I've made the difference in parameters between sensitive and default settings larger when using MaskerAid, to make these speed options more meaningful and gain more sensitivity (with little cost in speed). Even with these more extreme parameters, the sensitivity can't quite reach that of the sensitive settings using cross_match, but it comes very close, and the huge difference in speed will make this option very attractive. Among the few caveats the most imnportant is that, when using MaskerAid, currently no alignment files can be returned by RepeatMasker. Also, I haven't done quite such an extensive quality control on the MaskerAided output, so that false positives could be more common (still will be close to non-existent). When using the wublast option, hyphens in the sequence are retained (in default mode all non-letters were deleted from the sequence). WU-blast uses hyphens to indicate insurmountable barriers and alignments will not span hyphens. 2 METHODOLOGY AND QUALITY OF OUTPUT 2.1 Methodology RepeatMasker compares the query sequence against one or more files of fasta sequences. The sequences in the libraries provided with RepeatMasker are consensus sequences derived from alignment of multiple copies of interspersed or satellite repeats. For interspersed repeats, a consensus tends to approach the sequence of the transposable element from which the repeat is derived. Both cross_match and WU-blast perform their Smith-Waterman (SW) alignments by first identifying exact word matches and restricting the alignment to a band or matrix surrounding this exact match(es). Overlapping matrices are merged. The speed settings of RepeatMasker are purely changes in the minimum word length from which an alignment can be seeded and, in some cases, changes in the width of the band. A wider bandwidth allows more gaps in the alignment and, more importantly, increases the likelihood that neighboring matrices overlap. Cross_match does a low complexity adjustment of the raw SW score. This adjustment is performed by the MaskerAid script when WU-blast is used. Low complexity matches are the primary cause of false matches, and this adjustment contributes significantly to the high selectivity of RepeatMasker (see 2.5) As a result of the existence of many related consensus sequences in the database, usually multiple repeats match one region in the query at the same time. Generally, cross_match and WU-blast report to the script only those matches that are less than 80-90% overlapped by a higher scoring match. This implies that, at first approximation, names are assigned to repeats based on the highest SW score. Given appropriate consensus sequences and alignment parameters, this is intuitively correct as well. However, the scripts have a lot of code to improve on this first approximation, primarily to deal with partial matches. The cut-off SW score above which matches are reported is empirically derived (see '2.5 selectivity' below). Note that there is no cut-off divergence level; reported matches can be less than 60% identical. The alignments parameters -substitution matrices, and gap initiation and extension penalties- are derived from data harbored in multiple alignments of a special subset of interspersed repeats. The derived matrices are theoretically optimal for a series of conditions (see below). The gap penalties are sub-optimal, primarily because gap lengths have a non-linear distribution and are poorly represented by a single gap-extension penalty. For primate, rodent and other mammalian DNA, the query is compared to consecutive subsets of repeat libraries. For primates, perfect simple repeats, full-length Alus, full-length short interspersed repeats, and young L1 3' ends are first (and in that order) clipped from the sequence to expose underlying older elements. Subsequently, the query is compared to most repeats, a set of ancient elements under especially sensitive settings, a large set of long retroviral sequences under faster settings (to save time), and AT-rich L1 3' ends that may have been disguarded earlier as low complexity matches. Finally, simple repeats and low complexity regions are masked. 2.2 Scoring matrices We have calculated statistically optimal scoring matrices for the alignment of neutrally diverging (non-selected) sequences in human DNA to their original sequence. These matrices have been in use since the May 1998 release. The matrices were derived from alignments of DNA transposon fossils to their consensus sequences (Arian Smit, Arnie Kas & Phil Green, in preparation...). A series of different matrices are used dependent on the divergence level (14-25%) of the repeats and the background GC level (35-53%, neutral mutation patterns differ significantly in different isochores). These matrices are (close to) optimal for human genomic sequences longer than 10 kb, for which length the GC level usually is representative of the isochore in which the sequence lives. However, the GC level of small fragments can diverge a lot from the surrounding (e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich LINE1 element) and RepeatMasker defaults to using matrices derived for a 43% GC background when a sequence is shorter than 2000 bp or when a batch file is submitted. When the appropriate background GC level is known, this can be entered with the -gc option. (Note that these matrices are an integral portion of RepeatMasker and are covered under the same restrictions as the scripts and databases as described in the signed software agreement). 2.3 Repeat databases The interspersed repeat databases provided in the RepeatMasker package are maintained in synch with the repeat databases (Repbase Update) copyrighted by the Genetic Information Research Institute (G.I.R.I.). Whereas non-mammalian libraries currently are identical to the RepBase Update fasta files except for formatting, mammalian databases are extensively modified. The modification primarily entails inclusion of complete sets of subfamilies for Alu and L1, modifications to avoid false matches and false annotations, and subdivision in multiple sets for optimization of the analysis. We transformed the RepBase database from a set of prototypes to a set of consensus sequences (see my dissertation if you're interested) to allow both determination of the origin of these repeats and improved detection. A consensus properly derived from a multiple alignment of copies closely approaches the original transposable element, since substitutions accumulate by-and-large unselected in copies of transposable elements. Because of the altter, a copy is on average twice as close to the consensus as to any other copy. Consensus sequences are also more sensitive search tools because directional substitution matrices can be used (see above). Consensus sequences would be identical to the original transposable element if all copies were inserted at about the same time from a single source. DNA transposon copies approach this ideal, but retroposons (giving rise to most repeats in our genome) live for long periods in a genome and evolve doing so. Thus, over time the sequence of the transposable element has changed, and a single consensus does not describe the original sequence of each copy. Also, usually at any time multiple distinct sequences with a common origin, cousins if you will, were active. This situation is reflected by the presence in the databases of multiple subfamilies for the more common retroposons (usually having the same name ending in a different number or letter. The mammalian repeat libraries contain, besides consensus sequences for transposon derived repeats, consensus satellite units, and a set of *small structural RNA sequences*. The latter have created a large amount of processed pseudogenes in our genome, and in that way are interspersed repeats. 2.4 Sensitivity and speed The program can be run at four levels of sensitivity. The only difference between these settings is the minimum match or word length in the initial (not quite) hashing step of the cross_match program (see the cross_match/phrap documentation). The "slow" setting will find and mask 0-5% more repetitive DNA sequences than by default, whereas the "quick" settings miss 5-10% of the sequences masked by default. The alignments may extend more or be somewhat more accurate in the more sensitive settings as well. The -s (slow/sensitive) setting will take on average 2.5 x as long as the default setting, whereas the -q (quick) setting is 3 to 6 times faster than the default. Because of the continuing growth of the human repeat databases, RepeatMasker's speed, when using the same settings, has actually decreased over time. For when you're in a hurry, I've added a -qq (rush job) option that runs with the same speed as the old -q option, but is less sensitive. Several developments should allow you to do RepeatMasker analyses at a agreeable speed though (1) your computers are faster, (2) there are multithreaded versions of cross_match available, (3) you can run batch files and larger sequences on multiple processors with -par, and (4) you can choose to run RepeatMasker with WU-blast. Note that the use of multiple processors and multithreaded cross_match or WU-blast work mostly additive. Here are some user times (in seconds) of human sequences on a single Digital UNIX V4.0D processor ( cross_match(default) WU-blast (-w) length -qq -q def -s -qq def -s 5 kb 8 14 29 64 11 13 15 10 kb 11 21 57 134 14 15 20 20 kb 16 33 117 290 19 21 34 40 kb 25 55 227 572 30 33 54 80 kb 41 99 448 1145 55 58 99 Bedell and Gish do a more extensive comparison in their paper on MaskerAid (Bioinformatics 16:1040-1 ). The -s times are a bit slower here, because, after they performed their comparisons, I've made the -s settings more sensitive when using WU-blast. The sensitivity of runs with MaskerAid/WU-blast is approximately half a step behind that obtained with the same settings using cross_match, except the -s settings which I've trumped up to be almost like -s settings with cross_match. The relative analysis speeds are very dependent on the computer; for example, our Linux server is 'better' in short sequences than this DEC, though slower in analyzing long sequences and Bedell and Gish achieved a 30 fold speed up at sensitive settings using MaskerAid on their computers. The speed is also dependent on the repeat content of the sequence. For human sequences, Alu rich sequences are analyzed fastest, LINE rich sequences somewhat slower, repeat poor regions slower still, and long satellite regions can take a while. If you have several shorter sequences it is much faster to run RepeatMasker on a batch file (all sequences in one file). On above computer, in the rush mode (cross_match), a batch of 10 5 kb sequences is analyzed in 23 seconds, 20 5kb in 34 sec., etc. The user time for sequences or sequence batches over 100 kb (or whatever the fragment size is set to) is linearly related to the length of the query due to the fragmentation of the query sequence. The increase in speed by using multiple processors is dependent on the the usage of the computer and the above mentioned non-linear relationships of sequence length and processing time. However, under the right circumstances,using 2 processors can increase the speed close to twofold, because the most time-consuming processes are performed in parallel. 2.5 Selectivity and matches to coding sequences The cutoff Smith-Waterman scores for masking interspersed repeats are conservative, since masking of one short potentially interesting region generally is more harmful than not masking a number of hard to find matches. If there are any false matches, they tend to have scores close to the cutoff, which is 225 for most repeats, 300 for the low-complexity LINE1 search*, and 180 for the very old MIR, LINE2 and MER5 sequences. * most LINE1s are detected with a 225 cut-off, but in one step in RepeatMasker the low-complexity score adjustment is turned off to find ancient A-rich L1 elements. We tested for the occurrence of false matches in randomized and in inverted (but not complemented) DNA. To check a variety of conditions, four 150 to 400 kb DNA fragments were analyzed ranging in GC level from 36% to 54%. To retain seeds for Smith Waterman alignments, randomization was done at the 10 bp word level. Note that the inverted sequences retain the low complexity and simple repeat patterns of the original sequences. Even at sensitive settings, for which false matches are most likely, the 1998-2002 versions of RepeatMasker have reported no (false) matches at all to interspersed repeats in the randomized or inverted sequences. No simple repeats were reported in the randomized queries. RepeatMasker returned only a single probably false match (71 bp) when analyzing a batch of 4440 coding regions in human mRNAs (7,200,000 bp) at sensitive settings. The coding regions were collected from GenBank, based on annotations, filtered for the presence of complete ORFs and initiator methionines, and made more or less non-redundant. When each coding region was analyzed individually using the -gccalc option, 5 matches (414 bp, 0.006%) were falsely masked (156 bp at default speed, 76 bp at quick settings). In this analysis each sequence was analyzed with matrices chosen based on the actual GC level, even for very short sequences, while in the batch analysis of the coding regions the 'average' 43% GC matrices were used. The 1998 and later versions of RepeatMasker show somewhat more false masking when a pre-1998 version of cross_match is used. These are primarily the result of improper assumptions of the background nucleotide frequency used in the scoring matrix calculation when adjusting for the complexity of a match. Specifically, a very GC rich region in an AT-rich isochore (like an exon) may improperly match a GC rich repeat, since the scores for C/G matches are higher in the used scoring matrix than for AT matches (calculated for this AT rich background) whereas the old cross_match assumed that a 50% GC background in these calculations and equal scores for A/T and G/C matches have been given. The new version of cross_match reads the correct nucleotide background level from the matrix used. 2.6 Simple repeats and low complexity DNA Low-complexity DNA By default, along with the interspersed repeats, RepeatMasker masks low-complexity DNA. Simple repeats (micro-satellites) can originate at any site in the genome, and therefore have an interspersed character. Other low-complexity DNA, primarily poly-purine/ poly-pyrimidine stretches, or regions of extremely high AT or GC content will result in spurious matches in some database searches as well (especially in the ungapped BLASTN searches). For example, extremely AT-rich regions consistently will give very low probability matches to mitochondrial DNA in BLASTN searches. The settings are very stringent, and we think that few if any sequences informative in database searches are masked as low-complexity DNA. However, you can skip the low-complexity DNA masking using the option -nolow or -l(ow). Under the current settings a 100 bp stretch of DNA is masked when it is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC) nucleotides. The settings are slightly more stringent than the original settings, partly because the gapped BLAST programs are less sensitive to short regions of low complexity then the old gapless BLAST. In coding regions I have not yet found extensive regions (>10 bp) masked as low complexity DNA that would not be masked by the combined XNU and SEG filters routinely used in BLASTX. Annotation of simple repeats Although RepeatMasker does a good job in masking simple repeats to avoid spurious matches in database searches, it is not written to find and indicate all possibly polymorphic simple repeat sequences. Only di- to pentameric and some hexameric repeats are scanned for and simple repeats shorter than 20 bp are ignored. The -poly option prints out a separate list of simple repeats of < 10% divergence from a perfect repeat. However, even long perfect repeats may not be presented in this list; e.g. two perfect 40 bp long (CA)n repeats interrupted by 10 Ts are aligned in one piece and may be reported as having > 10% divergence from the consensus. Many perfect hexameric or longer unit repeats will be listed as more or less diverged smaller unit repeats and may not appear in the .polyout file. Also note that, in the default output, simple repeats expanded from the poly A tails of ALUs and LINE1 are now included in the Alu or LINE1 annotation. This cleans up the annotation a bit and lets the stand-alone poly A regions stand out (they may indicate the presence of a processed pseudogene). However, even perfect simple repeats in such tails will be hidden in the .out file. A program optimized to quickly find all dimeric to pentameric repeats is sputnik, available at ftp://ftp.nhgri.nih.gov/pub/software/sputnik/ or http://www.abajian.com/sputnik/. Any local duplications (tandem, inverted, or otherwise) can be detected with the program miropeats (http://www.ebi.ac.uk/~jparsons/packages/miropeats/miropeats.html). Web sites dedicated to identifying tandem repeats are http://pompous.swmed.edu and http://c3.biomath.mssm.edu/trf.html LINE rich sequences are analyzed somewhat slower, Alu rich sequences faster, and long satellites can take quite a while. 3 HOW TO READ THE RESULTS 3.1 The annotation (.out) file The annotation file contains the cross_match summary lines. It lists all best matches (above a set minimum score) between the query sequence and any of the sequences in the repeat database or with low complexity DNA. The term "best matches" reflects that a match is not shown if its domain is over 80% or 90% contained within the domain of a higher scoring match, where the "domain" of a match is the region in the query sequence that is defined by the alignment start and stop. These domains have been masked in the returned masked sequence file. In the output, matches are ordered by query name, and for each query by position of the start of the alignment. Example: SW perc perc perc query position in query matching repeat position in repeat score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID ... 1320 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 337 104 20 12279 10.5 2.1 1.7 HSU08988 6782 7718 (21525) C Tigger1 DNA/MER2_type (0) 2418 1486 19 1769 12.9 6.6 1.9 HSU08988 7719 8022 (21221) C AluSx SINE/Alu (0) 317 1 17 12279 10.5 2.1 1.7 HSU08988 8023 8694 (20549) C Tigger1 DNA/MER2_type (932) 1486 818 19 2335 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1 18 12279 10.5 2.1 1.7 HSU08988 9001 9695 (19548) C Tigger1 DNA/MER2_type (1600) 818 2 19 721 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2 20 This is a sequence in which a Tigger1 DNA transposon has integrated into a MER7 DNA transposon copy. Subsequently two Alus integrated in the Tigger1 sequence. The first line is interpreted as such: 1320 = Smith-Waterman score of the match, usually complexity adjusted The SW scores are not always directly comparable. Sometimes the complexity adjustment has been turned off, and a variety of scoring-matrices are used dependent on repeat age and GC level. 15.6 = % divergence = mismatches/(matches+mismatches) ** 6.2 = % of bases opposite a gap in the query sequence (deleted bp) 0.0 = % of bases opposite a gap in the repeat consensus (inserted bp) HSU08988 = name of query sequence 6563 = starting position of match in query sequence 6781 = ending position of match in query sequence (22462) = no. of bases in query sequence past the ending position of match C = match is with the Complement of the repeat consensus sequence MER7A = name of the matching interspersed repeat DNA/MER2_type = the class of the repeat, in this case a DNA transposon fossil of the MER2 group (see below for list and references) (0) = no. of bases in (complement of) the repeat consensus sequence prior to beginning of the match (0 means that the match extended all the way to the end of the repeat consensus sequence) 337 = starting position of match in repeat consensus sequence 104 = ending position of match in repeat consensus sequence 20 = unique identifier for individual insertions An asterisk (*) following the final column (see below example) indicates that there is a higher-scoring match whose domain partly (<80%) includes the domain of the current match. ** This has changed in August 2001: cross_match output gives the percent mismatches/(matches+mismatches+unaligned bases in query). I did't think this definition is otherwise commonly used and most users will assume the divergence level would be mismatches/(matches+mismatches). Note that the SW score and divergence numbers for the three Tigger1 lines are identical. This is because the information is derived from a single alignment (the Alus were deleted from the query before the alignment with the Tigger element was performed). The ProcessRepeats script makes educated guesses if any pair of fragments is derived from the same element or not; if so, the fragments will have the same ID in the last column, in this example it figured that the MER7A fragments represent one insert. Here is another example that shows how much trouble processrepeats does to defragment elements and how the ID can be useful in interpreting the results: 7120 19.9 0.6 0.3 NT_001227 85631 87837 (19816) + L1PA16 LINE/L1 1 1885 (4964) 123 2503 14.9 6.5 0.7 NT_001227 87839 88241 (19412) + MSTA LTR/MaLR 1 428 (0) 100 867 12.9 2.7 0.0 NT_001227 88242 88388 (19265) + MSTA-int LTR/MaLR 1 151 (1500) 100 * 5219 19.5 2.9 0.6 NT_001227 88386 89342 (18311) + MSTA-int LTR/MaLR 629 1607 (44) 100 8003 3.5 0.8 0.0 NT_001227 89362 90773 (16880) C L1PA3 LINE/L1 (0) 6155 4745 103 7677 3.5 0.0 0.0 NT_001227 90795 94059 (13594) C L1PA3 LINE/L1 (0) 6155 2872 104 9050 6.5 0.4 0.1 NT_001227 94060 95127 (12526) C MER11C LTR/ERVK (0) 1071 1 106 7677 3.5 0.0 0.0 NT_001227 95128 97101 (10552) C L1PA3 LINE/L1 (3282) 2873 900 104 5619 7.8 0.3 0.9 NT_001227 97097 97865 (9788) C L1PA3 LINE/L1 (5370) 776 13 104 * 320 16.9 0.0 1.7 NT_001227 97876 97934 (9719) + MSTA-int LTR/MaLR 1594 1651 (0) 100 1475 19.0 4.8 5.6 NT_001227 97935 98255 (9398) + MSTA LTR/MaLR 1 323 (48) 100 2322 14.4 0.8 1.6 NT_001227 98256 98629 (9024) + THE1C LTR/MaLR 1 371 (0) 112 10051 12.9 3.5 4.3 NT_001227 98630 100221 (7432) + THE1C-int LTR/MaLR 1 1580 (0) 112 2359 15.7 0.3 1.9 NT_001227 100224 100598 (7055) + THE1C LTR/MaLR 3 371 (0) 112 1475 19.0 4.8 5.6 NT_001227 100599 100646 (7007) + MSTA LTR/MaLR 323 371 (0) 100 1360 19.4 8.2 1.7 NT_001227 100662 100955 (6698) + MSTA LTR/MaLR 114 426 (0) 113 11892 24.7 1.9 2.0 NT_001227 100968 101243 (6410) + L1PA16 LINE/L1 1881 2143 (4706) 123 2062 11.9 8.4 0.0 NT_001227 101244 101563 (6090) C L1PA12 LINE/L1 (10) 6164 5818 116 11892 24.7 1.9 2.0 NT_001227 101564 105425 (2228) + L1PA16 LINE/L1 2137 5989 (860) 123 257 0.0 0.0 2.9 NT_001227 105436 105469 (2184) + (TAA)n Simple 2 34 (0) 118 2189 18.2 0.2 0.7 NT_001227 105470 105893 (1760) + L1PA16 LINE/L1 6062 6483 (386) 123 255 6.1 0.0 0.0 NT_001227 105896 105928 (1725) + (TA)n Simple 1 33 (0) 120 * 369 0.0 0.0 0.0 NT_001227 105928 105968 (1685) + (GA)n Simple 2 42 (0) 121 305 18.8 0.0 1.0 NT_001227 105971 106066 (1587) + (TA)n Simple 2 96 (0) 122 1589 21.2 1.6 1.1 NT_001227 106068 106449 (1204) + L1PA16 LINE/L1 6485 6868 (1) 123 This entire 20,819 bp block of sequence is comprised by an L1PA16 (#123), in which 7 or 8 elements have integrated (it is unclear to me if the MSTA #113 is a separate integration or a tandem duplication). There are at least four layers with MER11 (#106) inserted in L1PA3 (#104) inserted in MSTA (#100, maybe in #113) inserted in L1PA16. L1PA16 is already primate specific, so that all these insertions took place in primate evolution. The ID column helps much in deciphering the events. It also should be a basis for the graphic display of RepeatMasker output. 3.2 Alignments When using the -a option, a .align file is created that contains the alignments of your query sequence to the matching repeat consensus sequences. The alignments are given in the same order as listed in the .out file. These alignments may be most generally useful for people designing PCR primers in a region full of repeats. It is possible to get primers that work in a whole genome, when the 3' end of it lies in a region of (even a common) repeat that is very different from the consensus. Here is an example of an alignment of a MIR spanning an Alu element deleted in an earlier step: 665 28.45 2.93 5.02 g5129s420 7350 7882 (1924) C MIR#SINE/MIR (1) 261 28 3 g5129s420 7350 ATCATAACAAACATTTAT--GGTGCCTCCTATGGAGCAGGGATTTTGCTT 7397 v v i i i v viv v i v v v C MIR#SINE/MIR 261 ATAATAACCAACATTTATTGAGCGCTTACTATGTGCCAGGCACTGTTCTA 212 g5129s420 7398 AGGACTCTGAACTATAT---CTTACTT-GTCTTCATTAAAAACCTTATGA 7443 vi i iv i i i i i i v i C MIR#SINE/MIR 211 AGCGCTTTACA-TGTATTAACTCATTTAATCCTCA-CAACAACCCTATGA 164 g5129s420 7444 AAAAGGTACTATTATTAACTGGGGXTGGGTTGTTTAACAGATAAGAAAGC 7787 iiv v i iii v i i i C MIR#SINE/MIR 163 GGTAGGTACTATTATTATCC---------CCATTTTACAGATGAGGAAAC 123 g5129s420 7788 TTAAGAATTAGAGAGATAAATTATCTTGCTTAAGGTAACACAGTTAACAA 7837 v i v i i v v v ii v i ii C MIR#SINE/MIR 122 TGAGGCA-CAGAGAGGTTAAGTAACTTGCCCAAGGTCACACAGCTAGTAA 74 g5129s420 7838 GCATTAG-GTCAAAGTTTGAACTCGGGCAGTCTGACTACAGAGCCC 7882 iivi i iiii i i i i v i C MIR#SINE/MIR 73 GTGGCAGAGCCGGGATTCGAACCCAGGCAGTCTGGCTCCAGAGTCC 28 Transitions / transversions = 1.96 (45 / 23) Gap_init rate = 0.03 (8 / 234), avg. gap size = 2.38 (19 / 8) In cross_match alignments mismatches caused by transitions are indicated with an i and those by transversions with a v. The position of the deleted Alu in the query is indicated with an X in the g5129s420 sequence. You can use the -inv option to produce alignments in the orientation of the consensus sequence. The lines in the .out file describing this match appear as: 578 28.4 2.9 5.0 g5129s420 7350 7467 (533) C MIR SINE/MIR (1) 261 149 2222 10.2 2.7 0.0 g5129s420 7468 7762 (238) C AluSg SINE/Alu (7) 303 1 578 28.4 2.9 5.0 g5129s420 7763 7882 (118) C MIR SINE/MIR (113) 149 28 Discrepancies between alignments and the .out file Discrepancies between alignments and annotation result from the adjustments made by the ProcessRepeats script to produce more legible annotation. This annotation also tends to be closer to the biological reality than the raw cross_match output. For example, adjustments often are necessary when a repeat is fragmented through deletions, insertions, or an inversion. Many subfamilies of repeats closely resemble each other, and when a repeat is fragmented these fragments can be assigned different subfamily names in the raw output. ProcessRepeats often can decide if fragments are derived from the same integrated transposable element and which subfamily name is appropriate (subsequently given to all fragments). This can result in discrepancies in the repeat name and matching positions in the consensus sequence (subfamily consensus sequences differ in length). In many cases matches are fused into one annotation. To give just four common examples: (1) A-rich simple repeats originated from the poly A tail of ALUs and LINEs are incorporated in the annotation of the Alu or LINE1. (2) In large sequences that are analyzed in fragments consecutive fragments overlap and repeats in these overlaps will appear twice (partially or wholly) in the alignment file. (3) There is an 'endless' number of subfamilies for retroposons which can not all be represented in the databases and sometimes an element is matched by overlapping pieces of two related subfamilies (which will be merged). (4) You may find large discrepancies in position numbering if an element includes tandem repeat units. For example, MER109 contains multiple ~300 bp repeat units which can lead to overlapping matches. In the annotation such matches are fused. Specific LINE1 problems: Some other discrepancies are specific to LINE elements. These repeats do not appear as complete elements in the consensus database. This is mostly due to the contrast in conservation over the length of its sequence during its evolution in the mammalian genome; the ~3 kb ORF2 region of LINE1 has been very conserved, whereas the untranslated regions and ORF1 to a lesser degree have evolved very fast. Thus the 3' end or 5' end of an ancient LINE1 does not even remotely resemble that of the currently active LINE1, whereas the coding region for reverse transcriptase is closely related. Thus, many subfamilies have been defined for both the 5' and 3' UTRs (30 and 52, resp.) of LINE1 elements in human DNA, whereas only four ORF2 entries are present in the database. Besides some remaining uncertainties about which 5' ends go with which 3' ends, including 50 full length (6 to 8 kb) LINE1 elements in the database would make the program very slow. LINE1 elements therefore are presented in the database in 3 pieces, and the ProcessRepeats script puts these pieces together. As a result both the names of the repeats and position numbering in the consensus sequence are generally different in the alignments than in the output file. The currently 3.3 kb LINE2 elements are likewise broken up in 3' UTRs for different subfamilies and one (complete!) ORF2 region. Between LINE1 subfamilies, the 3' UTR ranges from 500 bp to over 2000 bp (in L1MC/D3), and the length of the 5' UTR is even more variable, even between subfamilies that show strong similarity in the 3' UTR. To allow the LINE1 fragments to be put together, all position numbers in older LINE1 subfamilies are normalized relative to the position of ORF2 (the conserved part of LINE1) in a complete L1PA2 element. Since some older elements have much longer 5' UTRs or ORF1-ORF2 linker regions than L1PA2, this often results in the assignment of negative position numbers for the 5' end of LINEs. Since the March2000 release, such positions and all positions in fragments thought to be part of the same LINE1 insert are readjusted to count from the 5' end (which is not necessarily the very 5' end of the LINE1 source gene, as these are hard to derive for old elements). One problem with this approach is that positions are not adjusted in detached 3' fragments that are somehow not recognized by the program as originating from the same insertion. Thereby, the common origin of the 5' fragments and 3' fragments may become completely obscured. Use the option '-orf2' of ProcessRepeats to retrieve an output in which all LINE1s are numbered so that position 1 of ORF2 is aligned (resulting in occasionally negative positions). 3.3 The summary (.tbl) file The summary file is pretty much self explanatory. Below is an example. ================================================== file name: AC027410.fa sequences: 1 total length: 152192 bp (148791 bp excl N-runs) GC level: 39.59 % bases masked: 88734 bp ( 59.64 %) ================================================== number of length percentage elements* occupied of sequence -------------------------------------------------- SINEs: 195 45195 bp 30.37 % ALUs 178 43249 bp 29.07 % MIRs 17 1946 bp 1.31 % LINEs: 54 31173 bp 20.95 % LINE1 36 24602 bp 16.53 % LINE2 18 6571 bp 4.42 % L3/CR1 0 0 bp 0.00 % LTR elements: 13 5833 bp 3.92 % MaLRs 8 4079 bp 2.74 % ERVL 0 0 bp 0.00 % ERV_classI 5 1754 bp 1.18 % ERV_classII 0 0 bp 0.00 % DNA elements: 17 4459 bp 3.00 % MER1_type 12 1903 bp 1.28 % MER2_type 4 2466 bp 1.66 % Unclassified: 0 0 bp 0.00 % Total interspersed repeats: 86660 bp 58.24 % Small RNA: 2 124 bp 0.08 % Satellites: 0 0 bp 0.00 % Simple repeats: 22 1151 bp 0.77 % Low complexity: 22 799 bp 0.54 % ================================================== * most repeats fragmented by insertions or deletions have been counted as one element Runs of >20 Ns in query were excluded in % calcs The sequence(s) were assumed to be of primate origin. RepeatMasker version 09/09/2000 , default mode run with cross_match version 0.990329 RepBase 5.08, vs 09092000 ---------------------------------------------------- Since the Sept 2000 release, it is indicated in the table with which version of cross_match or wu-blast, and the database the analysis was done. AC027410 was a draft sequence, with individual contigs separated by poly N linkers. In this case, the option -excln was used, so that these strings of Ns were ignored for the percent calculations. The classification in this table is well defined (see my reviews in COGD) and forms a good basis for visual presentation and tabulation of the repeats in your study. We've been able to classify almost all human repeats, most of them even in subclasses. The totals for the classes often are higher than the sum of the subclasses, because not all elements fit in a subclass and minor subclasses are not listed separately in the table (e.g. for the human table the Mariner, Tc2, Piggybac, Zaphod, and Arthur families of DNA transposons). The HAL1 element, ancestral to or derived from LINE1, is added to the LINE1 total in this table. Note that the "MER" subclasses have no relationship to each other. The term MER (MEdium Reiterated repeats) was introduced for purely administrative purposes to give the beast a name. The MER1 and MER2 groups were named after the first member of these groups identified as an interspersed repat in our genome. I'm considering renaming them Tigger and Charlie group, which may be more memorable. The nomenclature of mammalian repeats derived from retrovirus-like elements is different from older versions. I've now divided this class up in the traditional class I, class II (ERVK), class III (ERVL) retroviruses and the ERVL-derived but very distinct non-autonomous MaLR elements. Since 'class III' is not an accepted classification yet, for now this class is called ERVL. The large MER4-group of non-autonomous LTR elements merges seemlesly with class I endogenous retroviruses, making it hard to define, and is now incorporated in the latter group. The ERV classes are most readily distinguished by the size of the insertion site duplication: 4 in class I, 6 in class II, 5 in class III. However, my LTR classification is based on internal sequences and matches to LTRs with internal sequences, not on target size duplication. As described above, the ProcessRepeats script tries very hard to find out which repeat fragments were derived from the same insertion event of a transposable element, but there still will be a slight overestimate of the copy numbers. The 'bases masked' number is calculated from the total number of Xs in the masked sequences (before these are changed to Ns or lower case letters). The other numbers are derived from the annotation (.out) file. Discrepancies between the 'bases masked' number and the sum of 'total interspersed repeats', small RNA, satellites and low complexity are generally very small. Most of these are accounted for by unmasked regions between flanking identical simple repeats, annotated as one stretch if fewer than 10 bases separate them, and fragments of repeats shorter than 10 bp which are not annotated but are masked. The numbers may be quite different if you started out with a query sequence containing Xs. 4 APPLICATIONS 4.1 Use in database searches RepeatMasker is most commonly used to avoid spurious matches in database searches. Generally this step is strongly recommended before doing BLASTN or BLASTX equivalent searches with mammalian DNA sequence. The most common concern is of course if RepeatMasker ever masks coding regions. We found that false matches in coding regions are extremely rare, but did identify 38 genuine fragments of interspersed repeats (4214 bp) in the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb) analyzed (excluding annotated coding sequences of LINE1 elements and endogenous retroviruses). We verified matches with lower scores by comparing the translation products to close homologous or redundant entries in the database (the repeat matching regions always were exactly missing). In the majority of these cases, the sequences appear to be improperly annotated or to represent either artificially or naturally defective mRNAs (e.g. alternatively spliced exons comprised of a small fragment of a repeat). Genuine overlaps of interspersed repeats with coding sequences usually involve terminal regions of the ORFs. Since the transposable element derived region is unique to the protein in that (group of) species, the masking does not interfere with database searches. However, some cautionary comments are necessary. First, a few active cellular genes are derived from transposable elements (see my 1999 review for a list of 19 in our genome). Some of these genes will be partially masked by a (related) transposon in the repeat database. EST and cDNA matches beyond the masked region should alert you. Also be aware that RepeatMasker screens for small RNA pseudogenes and will therefore mask the active small RNA genes as well (I think the tRNA list is complete, I stopped adding snRNAs unless I found an indication that they have created many pseudogenes). The number of matches to small RNAs are listed in the overview table; (close to) exact matches are possibly active genes, although related active genes not in the database may show diverged matches. A final caution relates to the fact that 3' UTRs of transcripts are about as dense in interspersed repeats as intergenic regions are. Thus, many ESTs are completely masked as repetitive DNA. I recommend that, when you compare a genomic sequence against the EST database or use ESTs as a query in nucleotide searches, you search with the unmasked sequence as well; use a long minimum match (word length/ word size) like 40 bp to identify exact matches and avoid most background. Unfortunately the maximum word length that can be used in the NCBI BLASTN program is 18 (due to memory limitations). 4.2 Identification of DNA source (contamination detection) Bacterial insertion elements Bacterial insertion sequences (IS elements) often crop up in foreign sequences, as their activity in the E. coli is not always succesfully suppressed during cloning. AS late as 2002, human entries in the 'finished' section of GenBank contained over a hundred IS elements. With each run, RepeatMasker includes a quick check for bacterial insertion elements that may have inserted during cloning. You can turn this off with the -no_is option. The -is_only option limits the run to this check only. When a full-length element is found and a target site duplication is confirmed, its location is both reported to the screen and stored in a .alert file. The latter also contains information of possible mouse<->human contamination. -is_clip, -is_only With the -is_only and is_clip options, the detected IS and one of the flanking repeats is clipped out to restore the pre-cloning artifact situation before comparison with the repeat databases. The original query fasta file will remain unchanged. An insertion sequence-clipped, but otherwise unmasked query sequence is printed to .withoutIS. For single sequences larger than 4 Mbp, the -maxsize option needs to be set to a number larger than the sequence length to retrieve this file. With either of these options, a properly adjusted quality string is printed to a file with the suffix .qual.withoutIS when a corresponding phred quality file (.qual) is in the same directory. Note that these names won't be such that the clipped sequence and quality file form a pair for subsequent cross_match/phrap work. They need to be renamed, as I assume one wants to do anyway. Most but not all IS elements can be precisely cut out. The element may be at the edge of a sequence, or (rarely) the element may have inserted improperly, lacking target site dups or missing terminal bases (internal deletion products are generally handled okay). These matches are reported, but are left untouched even in _is_only or is_clip mode. The location of any IS element is both reported to the screen and stored in an .alert file. The latter also contains information of possible mouse<->human contamination. Here are the specifics of IS element insertions: IS1 9 bp duplication IS2 5 bp duplication; published sequence was too short IS3 3 bp duplication IS4 No examples of clonal artifacts; no dup site info IS5 4 bp duplication; preferred target TCTAGA IS10 9 bp duplication; extreme preference for CGCTNAGCN; published sequence for IS5 & 10 were too long, included preferred target site IS30 2 bp duplication IS150 3 bp dup, with one exception (4 bp); strong pref for CAGNNTGGGGCY IS186 6 or 7 bp dup, extremely specific for CG rich hairpin: SSSGGAGGGAGGCGGGG(6-7)CCCCGCCCSSSSSSSSSSS Tn1000 5 bp duplication; Human <-> mouse sequence contamination or mix-up. A straightforward way to distinguish murine and human DNA is by checking for either rodent-specific or primate specific repeats. Likewise, rodent or primate contamination in any other mammalian or non-mammalian background can be picked up as well. If your lab has, say, a rat and a pink fairy armadillo sequencing project, rat DNA in a supposedly armadillo sequence can be picked up quite reliably, depending on the length of the query. When the option -rodspec or -primspec is used, RepeatMasker only checks the query against a small library of repeats which have not (yet) been observed in the 'other' species. The locations of the matches are printed to .alert. This function will be expanded to other mammals, when these species are starting to be sequenced in earnest. I've checked for the specificity of the reported matches quite extensively. Whenever two or more types of repeats are reported, the odds are that the alert is correct. Very occasionally, a single reported match could be a false alert. This is especially possible when a 'new' mammalian species is analyzed, because, unbeknownst to me, a related repeat may have amplified in such a genome. Other species contamination. When a supposedly rodent or primate clone is of non-mammalian origin, very few if any interspersed repeats will be reported by RepeatMasker. Human and mouse genomic sequences are on average 40-50% dense in recognizable interspersed repeats, so that any stretch of genomic DNA of significant length (say 30 kb or more) showing less than 10% density in interspersed repeats is of suspect origin. An automated alert for such a situation is not included, as query sequences of coding regions or transcripts, generally of very low repeat density, would be constantly alerted. 4.3 Use in gene prediction and other applications Predicting genes from a masked sequence has several problems. First, one should use the option -nolow to avoid masking low complexity regions and trinucleotide repeats in coding regions. But even with only interspersed repeats masked, gene prediction programs may fail to identify exons correctly. As pointed out above, sometimes tail ends of coding regions may have originated from transposable elements. Some gene prediction programs suggest the extend of 3' UTRs. These will be often overestimated in masked DNA, as many genuine poly A signals are located in interspersed repeats. Finally, even if no coding regions have been masked, splice sites may be compromised; e.g. the polypyrimidine region that contributes to an acceptor splice site may be contained within a repeat. Thus, I generally recommend to run a gene prediction program on unmasked DNA (as well) and compare the predicted genes and exons with the RepeatMasker output. Some gene prediction program allow you to force certain exons out of the predictions (e.g. often the old ORFs of LINE1 elements and endogenous retroviruses are included in genes). Work is also in progress at several sites to incorporate RepeatMasker into gene prediction programs, in which cases matches to repeats are weighted in along with the other parameters used. Other uses Many people mask repeats before designing primers or oligo probes from sequence data. I've often been told that primers/probes designed from regions unmasked by RepeatMasker have a much better success rate. A cautionary note here is that unmasked regions not necessarily are unique in the genome (e.g. many lower copy repeats are not in the database yet) and experiments should be performed as if no filtering against repeats has been done. The alignments can help in designing primers from sequences that are completely masked. Regions that diverge much from the consensus are less likely to misbehave than others. RepeatMasker is sometimes used during assembly of large genomic sequences. This procedure probably is most useful in very Alu rich regions; in that situation I recommend to only mask the Alus, and maybe limit the masking to those Alus less than 15% diverged (-div 15). There are plenty of other uses, e.g. analysis of repeats can reveal a lot about the evolution of a locus (deletions vs insertions, inversions, approximate time of these events). When you're doing that you're a specialist and don't need any help from this help file (maybe from some of the literature sited below though). 5 REFERENCES Reference for RepeatMasker We still haven't published a paper on RepeatMasker yet, but appreciate it if you could refer to the web page (Smit,AFA & Green,P RepeatMasker at http://repeatmasker.genome.washington.edu/cgi-bin/RM2_req.pl) or otherwise to Smit, AFA & Green, P., unpublished results. Literature and further information on specific repeats The EMBL format of the Repbase Update database contains references for specific repeats, as well as annotation with respect to divergence level, affiliation, copy number, etc. Much if not most of the information in this database is not published elsewhere. It can be accessed at http://www.girinst.org/~server/repbase.html. We are trying to keep the nomenclature of the interspersed repeats in the output of RepeatMasker identical to that of the reference database. In most cases the names correspond to those most commonly used in the literature. The following list of literature is minimal and restricted to human interspersed repeat articles. Overviews Smit, A.F.A. (1999) Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr Opin Genet Devel 9 (6), 657-663. Jurka, J. (1998) Repeats in genomic DNA: mining and meaning. Curr Opin Struct Biol 8 (3), 333-337 Smit, A.F.A. (1996) Origin of interspersed repeats in the human genome. Curr Opin Genet Devel 6 (6), 743-749. Smit, A.F.A. (1995) Origin and evolution of mammalian interspersed repeats. PhD dissertation, USC. SINE/Alu Schmid, C.W. (1998) Does SINE evolution preclude Alu function? Nucleic Acids Res 26, 4541-4550. Schmid, C.W. (1996). Alu: structure, origin, evolution, significance, and function of one-tenth of human DNA. Prog Nucleic Acids Res Mol Biol 53, 283-319. Jurka, J. (1996) Origin and evolution of Alu repetitive elements. In " The impact of short interspersed elements (SINEs) on the host genome. Maraia, R.J., editor. Springer Verlag SINE/MIR & LINE/L2 Smit, AFA, and Riggs, AD. (1995). MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res 23, 98-102. LINE/L1 Smit, AFA, Toth, G, Riggs, AD, Jurka, J., Ancestral mammalian-wide subfamilies of LINE-1 repetitive sequences. J Mol Biol 246, 401-417. LTR/MaLR Smit, A. F. A. (1993). Identification of a new, abundant superfamily of mammalian LTR-transposons. Nucleic Acids Res 21, 1863-72. LTR/Retroviral Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous Human Retroviruses. In The Retroviridae, J. A. Levy, ed. (New York: Plenum Press), pp. 465-535. DNA/all types Smit, A.F.A. and Riggs, A. D. (1996). Tiggers and other DNA transposon fossils in the human genome. Proc Natl Acad Sci USA 93, 1443-8. Update history: Improvements and new features in the April 1997 version compared to the June 1996 version: Besides a massive (2.5 fold) expansion of the databases, the program itself is more sensitive and selective, has several new features and an improved output. The script is now divided in two; one (RepeatMasker) performs the cross_match searches, the other (ProcessRepeats) takes the RepeatMasker output to create the overview table and to improve the output in the .out file. The cross_match searches have been optimized, especially with regard to detection of low complexity sequences and old LINE1 elements. The most obvious changes in the processed output file compared to the unprocessed file are (i) overlapping matches are usually resolved, (ii) LINE1 fragments are annotated with position numbers as in a full L1 element, and (iii) when an Alu or LINE1 is fragmented information from both or all fragments is used to assign a subfamily name. New features in the program include the ability to screen a custom library and to create an output file with alignments in positional order. Improvements May 97: (minor update) - added option to only mask low complexity DNA - added version information to .tbl output - changed artreps.lib to othermamreps.lib, adjusted parameters to accommodate larger size of db - many improvements in estimating number of elements in query - added name adjustments for MLT2 - fixed many bugs... Improvements September 1997 (minor update) - major expansion of the rodent libraries and significant update of the human libraries as well, especially in LINE1 elements. - scripts modified to accommodate new entries in databases - simple repeats masking optimized by including pentamers and using a more stringent matrix - several bugs fixed (e.g. sequences without repeats are now counted) - table now displays parameters use - temporarily, for comparison with the human LINE library the same minimum match is used in the selective settings as in the default settings to avoid masking small inserts in the LINE elements - forthcoming release of cross_match has improved performance on a tandemly repeated element (currently sometimes the lower scoring unit may go unmasked, even when it is a common repeat) Improvements and new features in the May 1998 version compared to the September 1997 version: - the program now accepts most 'not quite fasta' format files - large sequences are analyzed in fragments of 100 kb to reduce the memory requirements of the program. Similarly files with very many sequence entries are divided up. You shouldn't notice any of this in the output files. - matrices are used that are optimal for the divergence level of the repeats to which the query is compared and the background nucleotide composition. - another big update of the human repeat databases. - the small RNA sequences have been corrected and expanded (all tRNAs should be there now) - close to perfect simple repeats, full-length shorter interspersed repeats and young LINE1 3' ends are excised from the sequence (in both human and rodent analysis) to allow better detection of any underlying repeats. A sequence file with these repeats deleted can be saved. - the -low option doesn't mask out any type of simple repeats anymore - alignments are shown in the orientation of the query sequence - new options include masking Alus only obtaining a sequence with full lengths repeats deleted obtaining a(n incomplete) list of possibly polymorphic microsatellites setting a cutoff score when using the -lib option. minor fixes - the .out.xm and .ace files now also contain the simple repeats and low complexity DNA (can still be omitted by running ProcessRepeats with the -low option on the .cat file) - sequence names including a number between parentheses used to confuse the program thoroughly; now fixed - many that you wouldn't find interesting Improvements and new features December 1998 - This version is optimized for use with the 1998 cross_match release The difference for RepeatMasker is mainly in the complexity adjusted length of the matches that function as kernels for Smith Waterman alignments and the matrix dependent adjustment of the score for complexity of the alignment. - Among bugs in the May 1998 version fixed are those resulting in bogus output when the sequence name ends with .seq and when a raw sequence is submitted. Also, sequence files that contain carriage returns from PCs and Mac are handled better now. - You can now limit the masking to younger repeats by setting a maximum allowed divergence of repeats from their consensus sequence - A mRNA/EST option is available that prevents false masking due to inappropriate matrix choice and low complexity matches to LINE1 elements. - You can set the background GC level (determining which matrices are used) overriding the programs' calculations. - The full description ('>') lines are retained in the masked file. - The .out file table can be returned with flexible length columns allowing the full length of long query sequence names to be displayed - The sequences identified as repeats can be returned in lower case (rest in capitals) rather than masked out by Ns or Xs. - Output to the screen is more informative and less panicky - Simple repeat and satellite masking has been improved again; their annotation has changed a bit, most notably they are now all listed in the orientation of the query sequence April 1999 The default return format of the annotation file is changed, hopefully in a way that does not interfere with any type of parsing; the width of the columns is now adjusted to the longest entry in that column, allowing query names to be spelled out in full, and usually leading to narrower tables. Arabidopsis, Drosophila, and grass repeat libraries were added; other repeat libraries were updated. Three measures were taken to eliminate the (few) false positives: - Use of the actual average GC level of sequences in a batch file may sometimes lead to false masking (or failure to mask) in sequences that diverge largely from the average. Thus, by default, all batch files are now analyzed with the innocuous 43% matrices. - one entry, responsible for 90% of false masking in GC rich regions, is deleted from the 'tough L1' library. - the matrix used for identification of the most diverged sequences in very GC rich regions, based on too little data and too much extrapolation, was 'too easy' on the mismatches and has been adjusted. Thanks to these measures the 'mrna' option is not necessary and has been removed. A bug is fixed that led to (wildly) improper annotation for some sequences fully consisting of repeats (all bases masked). A series of lesser bugs were taken care of. New bugs were skillfully introduced, probably. May 5 1999 - Eliminated a really dumb bug that resulted in having the percent deletions replaced by the percent insertions. - Made it easier to use your own database with repeatmasker. The database does not have to be in the repeatmasker directory. March 2000 Besides a long overdue update of the databases the following improvements have been made: speed, sensitivity, user-friendliness - It is now possible to run large sequences and batch files on multiple processors. - An even faster option (-qq) is available for people in a serious hurry - More repeats are cut out, in particular LINE1 3' fragments, to better uncover underlying repeats - I've reduced the default fragment length to 51000 bp (incl 1000 bp overlaps); this gives a slightly lower chance of running out of memory (followed by resorting to a larger wordlength) and sometimes better choice of substitution matrices. - The -cut option does not overrule fragmentation anymore - RepeatMasker now handles zipped (.gz) and compressed (.Z) sequence files - You can now quit the program at any point with 'control-c'. annotation, display, summary - An option is added providing unique IDs for individualy integrated elements, labeling fragments of the same element with the same number - Classification of mammalian LTR elements has changed (now includes the conventional three ERV classes) - Some repeat names have been adjusted (notably the MLT2 subfamilies) to be consistent with the RepBase nomenclature - Improved interpretation of fragmented sequences resulting in more accurate counts (for the .tbl file) of total insertions in the query sequence - Negative coordinates in LINE1 elements are now avoided (but see 'Specific LINE1 problems' in helpfile above) - Improved accounting of LTR elements; now most LTR elements receive the same name for the LTRs and internal sequence and are counted as one insertion. - Divergence and insertion/deletion levels are calculated for annotations that are derived from two or more fused fragments - Fixed the .ace output so that the orientation of the match is displayed. - Output can be retrieved in the GFF (General Feature Finding) format. The current output is following a Sanger convention. bugs - The .tbl file format was not prepared for sequences over 10 million bp. It's now ready for sequences up to 1 billion bp. For larger sequences, I'd recommend doing the analysis in two or more steps... - A bug has been fixed that crashed scripts trying to start several RepeatMasker jobs simultaneously - A bug is fixed that resulted in sometimes incorrect output, when multiple files were fed to repeatmasker and one was masked in full - Sequences and fragments >> 100 kb completely existing of Ns (no ACGT) used to crash the program - Drosophila and Arabidopsis masking allowed no overlaps in matches.. - Several other bugs were fixed that gave slightly incorrect output under cruel and unusual circumstances May 2000 - When using a -frag lower than 50000 on a sequence <50000 sequence would be analyzed in one piece anyway. Fixed. - A cut file was created under some circumstances without asking. When masking non-mammalian DNA, this '.cut file' had any reported repeat deleted, rather than all full-length elements. Fixed too. June 2000 - An option -w(ublast) has been added that lets RepeatMasker run with WU blast rather than cross_match via the MaskerAid script by Joey Bedell and Warren Gish. I haven not implemented an automatic update of the blast formatted databases and matrices. - When using the wublast option, hyphens in the sequence are retained (previously all non-letters were deleted from the sequence). WU blast uses hyphens to indicate insurmountable barriers. - A half dozen bugs have been fixed that let to crashes of the ProcessRepeats script or negative substitution or deletion levels in the .out file - Some changes have been made in the forking procedure and system calls, avoiding some reported problems with large batch files - LINE1 consensi have been updated August 2000 - Added a feature to check for human and rodent DNA contamination. - Included a step that surgically removes bacterial insertion sequences, arrived by transposition during cloning - Improved satellite detection somewhat - Improved statistics in the .tbl file. Among others, long stretches of Ns are counted in the query and can be ignored in calculating percent coverage, and most repeats that have spawned a satellite sequence are now counted as a single copy. - The temporary files created by the program are reduced and are written to a temporary subdirectory of the working directory, rather than to the directory containing the query file. Besides a reduction in clutter, this can make a big difference in speed if the working directory and file are accross system boundaries. - Removed a very rare but awful bug leading to false extension of an annotation all the way to bp 1, several divide by zero bugs, a bug ocassionally crashing the script when the query name contained an asterisk, and a few minor bugs September 2000 - The contamination checking step is separated from the repeatmasker run - IS elements are only optionally clipped from the query sequence - C elegans repeats can be screened with -elegans and gets .tbl file, zebrafish repeats separated from other vertebrates in tiny danio.lib database. - More information is stored in .cat files, e.g. number of Ns in linker regions and GC level, so that this information does not have to be hand fed to ProcessRepeats. - Version and type of alignment programs indicated in .tbl file - Spontaneous running of all fragments in parallel when using WU-blast fixed - Hang-up when last fragment fully-masked when using WU-blast avoided. - False matches to a fragment of a particular L1 subfamily avoided October 2000 - Fixed failing IS element check when using WU-blast - Fixed typo causing some B1 elements to be reported as primate specific Alu November/December 2000 - Fixed the following bad bug (present since April): when using -xsmall on large sequence that are analyzed in multiple fragments, the lower case replacement sequences in fragments 2 and up were taken from the first fragment. Please check old results that may have depended on the masked sequence being identical to the non-masked sequence. - Changed some code so that files with many sequences are processed more efficiently - When choosing two species options, the program won't continue as if everything's right, but now gives an error message. January-March 2001 - The script now works on Windows systems with a Cygwin port; I can't test it too often, so there may be more bugs than elsewhere - Simple repeats are cut out in a more sophisticated way; main behavioral change is that poly A tails are less likely to be cut out before the SINEs and LINEs (with poly A tails) are recognized - I added a cutting step in rodent analysis, just like the first steps in human analysis; also elsewhere rodent analysis a bit more sensitive fixed bugs: - closed all opened files; - replaced a -M option for cross_match with -matrix (as -M wasn't recognized by MaskerAid) - false SVA (iso Alu) annotations with the option -nocut eliminated April 2001 - Made the creation of the .align file far more efficient fixed bugs: - The -frag option wasn't functioning when analyzing sequences <50 kb - A .cut file was created under some circumstances without asking. When masking non-mammalian DNA, this file contained the query with any reported repeat deleted, rather than all full-length elements. May 2001 - Large sequences (>2x the -frag settings, i.e. default > 100 kb) in a batch file are analyzed in fragments, just like single-large sequence files - Duplicate alignments of repeats in regions of overlap between two fragments are eliminated if one is contained in the other. fixed bugs: - The -frag option wasn't functioning when analyzing sequences below 50 kb - A .cut file was created under some circumstances without asking. When masking non-mammalian DNA, this file contained the query with any reported repeat deleted, rather than all full-length elements. - Output files would not be written to the query file directory if the query file was read only June 2001 - Alternative output formats (.out.xm, .ace, .gff) now also available when using your own repeat sequence files (-lib option). - Now usually skips a step in which entire query file is read into one scalar. This caused memory problems for *very* large files. August 2001 - The divergence levels indicated in al output files now is "mismatches/(matches+mismatches)" rather than the cross_match idiosyncratic "mismatches/(matches+mismatches+unaligned bases in query)" November 2001 - Fixed bad bug that caused the -lib and -u options to not report matches or alignments if a single match actually had been found. - When a library invoked with the -lib option is formatted like a repeatmasker library, the repeats are processed (merged etc.) and an overview (.tbl) file is created. December 2001 - Improved memory requirements of ProcessRepeats. January 2002 - On rare occasions up to 500 bp were missing from the 3' end of masked sequences when using the -frag option (ouch) February 2002 - By not always slurping the entier query in an array, significantly reduced in memory requirements for large query sequences (thereby saving much time as well) March 2002 - Added chicken, carnivores, and artiodactyl libraries and code - Script backs up previous repeatmasker output it encounters, rather than delete them after a warning - Most 'cross_match error (1)' etc error messages replaced with something more user friendly April 2002 - More rewriting to reduce array sizes and the size of sequences analyzed for IS elements - Made the unique IDs in the .out file start at 1 and not skip any numbers - Some rare combinations of input format errors are now handled correctly - Estimates for masked bases and total bases excluding Ns in the query were sometimes off by a few bases due to small overlaps of repeats that were temporarily cut out of the query in the masking process - Lots of work on the mouse libraries. May 2002 - 2 bugs leading to improper annotation of > 100 kb contigs in > 4 Mbp batch files and failure to mask certain B1 elements in rodent DNA. Fixed between 20020505 general and 20020515 fixed releases. June 2002 - Improved memory efficiency of ProcessRepeats and its speed when only investigating simple repeats (-noint) - Added -maxsize option. Since February large sequences are handled in pieces of 4 Mbp each to avoid having several arrays of 4 Mbytes each (this is different then the fragment settings, which determines the size of fragments cross_matched at once). This 'maxsize' can now be adjusted, among others, because for sequences > maxsize, IS elements and full-length, young repeats can not be clipped out. - Fixed several problems with E coli insertion element detection in > maxsize queries - Fixed a serious bug causing the script to go in an infinite loop on sequences with ^M type carriage returns (thanks Mitch Skinner) - Fixed several bugs in ProcessRepeats leading to (inoccuous) warnings when using a personal repeat library (thanks Alfred Beck) - Ambiguous sequence (N) strings >4 Mbp (maxsize) in the query were not incorporated in the masked outputfile. These do occur in some chromosome assemblies (to replace centromeres, etc.). Fixed. - The name of the custom library was not displayed in the .tbl file. Fixed. July 2002 - Reduced false positives (due to new elements in the libraries) - Fixed several strange progress messages (like 'analyzing fragment 9 of 8 of sequence 0') If you have ideas for improvements or found a problem, drop a note at asmit@hoh.biotech.washington.edu or afasmit@pacbell.net /***************************************************************************** # Copyright (C) 1996-2002 by Arian Smit # All rights reserved. # # The software and databases should not be redistributed or used for # any commercial purpose, including commercially funded sequencing, # without written permission from Geospiza Inc, Seattle # (http://www.geospiza.com/) /*****************************************************************************