biscan program

Documentation for the biscan program is below, with links to related programs in the "see also" section.

{   version = 1.32; (* of biscan.p 2007 Mar 21}

(* begin module describe.biscan *)
(*
name
   biscan: multiple part scanning program

synopsis
program biscan(ribla: in, riblb: in, scanpa: in, scanpb: in, histog: in,
               book: in, biscanp: in, scanfeatures: out, data: out,
               scaninst: out, output: out);

files

   book: a book from the delila system

   ribla and riblb: a weight matrix from sites or ri programs.
      Lines that start with * are notes.  the next line contains the matrix
      FROM-TO coordinates, this is followed by the matrix in the order A, C, G,
      T from FROM to TO.  Ribla is the ribl for model A and riblb is the 
      ribl for model B.

   scanpa and scanpb: parameters to control the program.

      parameterversion: the version number of the program.  This allows the
         user to be warned if an old parameter file is used.

      seqs: One integer on the first line is the number of sequences to scan
         to produce the vector.  0 = none, positive = that number; negative =
         all.

      Ri range : Two real numbers on the second line give the range of
         information content to report in the data file.

      Z score range: Two real numbers on the third line give the range of the
         Z score to report in the data file.  A negative sign will be
         converted to a positive sign so that this parameter limits the range
         of acceptable sites to an interval on the real line.  Note: normally
         one would want the lower number to be zero.

      Probability range: Two real numbers on the fourth line give the range
         of probability to report in the data file.  The probability of a
         site is determined from the mean and standard deviation of the Ri
         distribution.  Note: normally one would want the lower number to be
         zero.

      fromwanted towanted range: two integers that define the FROM-TO range
         of the ribl matrix to use for computations.  This is independent of
         the range displayed in the walker.

      ways:  One integer.  2 means scan both the sequence and its
         complement.  1 means simply scan the sequence.  0 means to let the
         program figure it out.  The Ri program determines the symmetry of
         the matrix.  If it is symmetrical, it will only scan one way.  If it
         is asymmetrical, both scans are done.

      sitedefinition:  If the first non-blank character on the line is 'd',
         then the rest of the line contains a definition of how to write out
         the sites.  If no site is defined, the scanfeatures file will not be
         written to.  See program lister.p for details.  The basic format for
         an ASCII definition looks like this:

         define "Fis" "-" "[0]" "[0]" -7  0 +7

         For a walker it looks like:

         define "Fis" " w" "  " "  " -7 +7

         NOTE: the range for walker display (given in this site definition)
         is independent of the range of the weight matrix used for
         computation (given in the fromwanted and towanted parameters).

      print definitions:  Any number of lines that define how to print the
         "other" feature string in each feature definition.  The data that may
         be printed are the same as those in the data file.  They are:

         #           width
         length      width
         name        width
         coordinate  width
         orientation width
         Ri          width decimal
         Z           width decimal
         probability width decimal
         string      "quote string"
         .           end of print definitions

         If the first character on a line is '#', the line defines the
         width for the coordinate of the number of the DNA piece from the book.

         If the first character on a line is 'l' or 'L', the line defines the
         width for the length of the DNA piece in the book.

         If the first character on a line is 'n' or 'N', the line defines the
         width for the name of the DNA piece in the book.

         If the first character on a line is 'c' or 'C', the line defines the
         width for the coordinate of the zero base of the site.

         If the first character on a line is 'o' or 'O', the line defines the
         width for the orientation of the site.  If the width is 1, the
         orientation is given as + or -, if ithe width is larger the
         orientation is given as -1 or +1.

         If the first character on a line is 'r' or 'R', the line defines the
         width and decimal fields for the individual information in bits.  The
         word "bits" is attached to the end of the string.

         If the first character on a line is 'p' or 'P', the line defines the
         width and decimal fields for the probability of the site.

         If the first character on a line is 'z' or 'Z', the line defines the
         width and decimal fields for the Z score of the site.

         If the first character is 's' or 'S' then the line defines a string to
         insert.

         The end of the file or a period "." ends the print definitions.

         The lines may be put in any order and this defines the order that they
         will be printed to the "other" string.  If the first character is not
         found (as, for example by having a blank in front of it), the
         corresponding data will not be printed.  This gives the user full
         control of the "other" string contents.

         The only kind of definition that may be repeated is the "string".
         This allows the user to put whatever they desire between the data
         items.

      file output definitions:  The first three characters on the line define
         which files will be output.  Capital characters turn on the output.
         Small characters turn it off.  The files are data, (scan)features,
         and (scan)inst so the characters are d, f and i, respectively.  Thus
         DfI turns on the data and scaninst files and leaves the scanfeatures
         off.  (Unidentified characters default to upper case.)

      normalizeRi:  The first character is defines how to normalize
         the reported Ri values.  The Ri value at coordinate zero
         is called Ri0.

         n: normal: scan and report Ri

         s: subtract: compute Ri(l) - Ri(0) at each position l

         d: divide: compute Ri(l) / Ri(0) at each position l

         The s and d modes are usually to be used in conjunction with
         renumbering by Delila (the 'default coordinate zero' command).

      instfrom, instto: range of Delila instructions produced in scaninst
         if that file is created.

   histog:  output of genhis program, it is the distribution used to compute
      the uncertainty due to various distances between the two models.
      Histog must be in increments of 1 over the range.

      To create the data file for genhis, start with a Delila instruction
      file.  Then use the malign program to get improved alignments.  Use
      malin to extract one of the alignments as a second Delila instruction
      file.  Then use the diffinst program to make the data file.  Finally,
      run genhis to get the histog.

   biscanp:  parameters to control the program.  The file must contain the
      following parameters, one per line:
 
      parameter version number, needs to be compatible with current program
         version
 
      lowhistog, highhistog (two integers): range of the histog distribution
         to use, in the order lowest value to highest
 
      ricutoff: total Ri cutoff, this value is compared to the individual
         info of site 1 + the individual info of site 2 + the sample
         correction - the -log2 of the distance probability.
 
      singleA (character):  If singleA is 'f' then filter out any feature
         pairs that contain the same A coordinate that have lower information
         than another pair.  Note: singleA and singleB cannot be both 'f' at
         the same time.

   data: The results.  Comments are lines that begin with '*'.  The columns are
      defined in comments in the file.  The matrix is searched over both the
      sequence and its complement.  Ri is reported, as is the Z and probability
      based on the mean and st.dev.

   scanfeatures: The results in the "features" format for input to the
      lister program.  This consists of comment lines (beginning with "*"),
      definition lines (as shown above), and features of the form:

      @ K01789 229.0 -1 "dnaA" "+12.2 bits " 12.200338    -0.473212     0.318031

      See program lister.p for details.

   scaninst:  The results are given in the form of delila instructions:
      name "dnaA"; piece K01789; get from 229 -100 to 229 +100 direction -;

   output: messages to the user

description
 
   This program is used to compare the binding patterns of 2 different
   binding site models, it selects sites that are within a certain range of
   each other and then adds their individual information together and
   subtracts a distance based distribution probability value to determine the
   new total information.
 
   The theory is that the distances have a distribution.  So one can assign
   probabilities to each distance.  One can compute the uncertainty of such a
   distribution, so one can also compute the individual information
   (surprisal) of each gap - it's just -log2(gap distance frequency) + small
   sample correction.  Ryan's discan program does this.
 
   What I like about this is it combines the information of the parts
   smoothly with the information about gap distance!  There are *no*
   arbitrary gap penalties or other arbitrary parameters!  Of course, if the
   model fails we are in big trouble ...
 
   To make the distribution file (histog) and know which model should be A
   and which should be B, use whichever model was subtracted as
   A and the model subtracted from as B.  For example, if your
   disribution is negative, you probably subtracted the model further
   downstream from the model upstream, so the downstream model would be A and
   the upstream model would be B.  With ribosome binding sites sd always came
   before atg, aug was subtracted from sd, giving a negative distribution and
   was then assigned as the A model and sd was assigned as
   the B model.
 

examples

   An example scanp file:

-1      number of seqs to scan 0 = none, positive = that number; negative = all
0       information content at or above which to report in the data file.
100     Z score below which to report in the data file
0       probability at or above which to report in the data file.
-10 +10 desired region of the ribl weight matrix to use
0       0: program figures it out; 1: one way scan; 2: two way scan.
define "Fis" "-" "[0]" "[0]" -7  0 +7
string "data at:" string: A string listed at the feature
coordinate 5
string " Ri = "   string: A string listed at the feature
Ri 5 1  Riwidth Ridecimal: character places for reporting bits to scanfeatures
string " Z = "    string: A string listed at the feature
Z 4 1   z score
string " p = "    string: A string listed at the feature
probability 5 2
.       end of print definitions
DFI     dfi: data, features, inst: files output
n       normalizeRi: n: normal, s: Ri(l)-Ri(0), d: Ri(l)/Ri(0)
-50 +50 instfrom, instto: range to make the scaninst file (if made)
   scanp: parameters to control the program.

   An example biscanp file:

1.00    version of discan that this parameter file is designed for.
-18 -4  range
4       total Ri cutoff
f       f means filter out duplicates of the features

documentation

@article{Schneider.Ri,
author = "T. D. Schneider",
title = "Information Content of Individual Genetic Sequences",
journal = "J. Theor. Biol.",
volume = "189",
number = "4",
pages = "427-441",
note = "http://www.lecb.ncifcrf.gov/$\sim$toms/paper/ri/",
comment = "indiv.tex",
comment = "Submitted, April 1997",
year = "1997"}

@article{Schneider.walker,
author = "T. D. Schneider",
title = "Sequence Walkers:
a graphical method to display how binding proteins
interact with {DNA} or {RNA} sequences",
journal = "Nucl. Acids Res.",
volume = "25",
comment = "walker.tex, November 1, issue 21",
note = "http://www.lecb.ncifcrf.gov/$\sim$toms/paper/walker/,
erratum: NAR 26(4): 1135, 1998",
pages = "4408-4415",
year = "1997"}

see also
   sites.p, ri.p, genhis.p, lister.p, dnaplot.p

author
   Ryan Shultzaberger  stealing Thomas Schneider's scan.p code

bugs
   * The quote strings in the parameter file are not recorded and so are not
   reproduced in the data file comments.
   * Blank characters are placed around the quote strings.

   * Complimentary scans should work, but I haven't tested them completely.
     The A model can be scanned both ways, but b is fixed.
     It is not clear that this works.

   * If sites have an even symmetry then there is a problem. (need to use
the riblarray^.symmetry parameter to fix the problem)

technical notes
   The mean and standard deviation of the Ri distribution are stored just
   after the Ri(b,l) table in the ribl file.  They are produced automatically
   by the ri program.

   To provide upwards compatability, scanp files of version 2.90 or less will
   be interpreted by the old definitions for the bounds of Ri, Z and p:

      Ri cutoff : One real on the second line is the information content at
      or above which to report in the data file.

      Z score cutoff: One real on the third line is the Z score at or below
      which to report in the data file.  A negative sign will be converted to
      a positive sign so that this parameter limits the range of acceptable
      sites to an interval on the real line.

      Probability cutoff: One real on the fourth line is the lowest
      probability which to report in the data file.  The probability of a
      site is determined from the mean and standard deviation of the Ri
      distribution.

   It is not advisable to rely on this feature, as it will go away at some
   point.

*)
(* end module describe.biscan *)
{This manual page was created by makman 1.44}
{created by htmlink 1.52}