Poisson-Fitter
Overview
Poisson-Fitter analyzes genetic data from homogeneous sequences, such as HIV or HCV sequences from a single patient early in infection. The maximum genetic diversity from the consensus should not exceed 10%. (For sets of sequences with >10% diversity, we suggest using BEAST or other tools.) Poisson-Fitter performs statistical tests on Hamming
Distance (HD) frequency distributions: it computes the best fitting Poisson distribution through Maximum Likelihood, performs a Χ2
Goodness of Fit (GOF) test, and tests for Star-Phylogeny.
If desired, the tool also checks for hypermutation enrichment and, if found, removes positions
and/or sequences presenting the APOBEC3G/F signature. Because APOBEC mutations happen at a faster rate than random mutations, they can bias the Poisson fit. We recommend first testing the sample with NO APOBEC correction. However, if some early samples show strong divergence from a Poisson, it is advisable to check whether or not this divergence is be due to APOBEC enrichment. We provide two ways to do so: after uploading the input file the user is redirected to a screen with "APOBEC choices." The user at that point can choose to either have all positions found in an APOBEC context removed, or all sequences found to be significantly enriched for APOBEC (tested via the hypermut tool), or both. The tool will then analyze the additional, APOBEC-removed files, as well as the original file, for easy comparison.
Reference: Giorgi EE, Funkhouser B, Athreya G, Perelson AS, Korber BT, Bhattacharya T. Estimating time since infection in early homogeneous HIV-1 samples using a Poisson model. BMC Bioinformatics 2010 Oct 25;11:532. PMID: 20973976
Input
The tool accepts a single input file. This file may contain alignments from multiple patient sets, to spare the user from uploading each file separately. The alignment file can be in any of the Common Sequence Formats.
The following are REQUIREMENTS for the input file:
- Each sample set must be aligned. If not, an alignment can be obtained for HIV-1 sequences through the tool HIValign.
- For optimal results, we recommend that all sequences within each patient set have the same start and end position; see example below.
- Sequences do NOT need to be codon-aligned. The program will gapstrip any columns that contain only gaps.
- Each sample set must have a consensus sequence and the consensus should be the first sequence in the set. The user can create a consensus using Consensus Maker. When using the Consensus Maker tool remember to check "Like input" under the option "Appearance of output" in order to ensure that the consensus sequence will be aligned with the rest of the sample. If the samples do not have a consensus or are not properly aligned, the program will stop and give a warning. Note: Because of the several options available when creating an alignment and a consensus, it is recommended that the user performs these
steps for each sample. For sexual transmissions sampled very early into the infection,
the consensus will often coincide with the most represented sequence in the sample.
- The very first sequence in the intrapatient alignment should be the consensus. We recommend that you name that sequence XXXX.CONSENSUS, where XXXX is the sample name. The program will look for the string ".CONSENSUS" as the identifier of the first sequence in a new alignment and will use the string immediately preceding it as the unique identifier of that sample. This text string will be used to label the figures and output files for that particular sample.
- All sequences must have unique names.
(note: the software will change sequence names to all upper case)
Example of valid input format:
>CH40.CONSENSUS
AGCCAGCATGGGATGGACGACCCAGA
>CH40.SEQ1
AGCCTGCATGGGATGGACGACCCAGA
>CH40.SEQ2
AGCCAGCATGGGATGGACGACCCAGA
>CH40.SEQ3
AGCCAGCATGGGATGGACGACCCAGA
>SUMA.CONSENSUS
TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA
>SUMA.SEQ1
TCCAGACCCGACAGGCCCAAA---AAAAAAGTAGACAA
>SUMA.SEQ2
TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA
This input file contains two alignments, each from a different patient and different genomic region. Separate result files will be generated for CH40 and for SUMA. The
sample ID need not be in every sequence name, but it is required that the user provides an ID in the consensus sequence.
Large-scale input
Deep sequencing detection: Poisson-Fitter can handle big alignments such as the
ones obtained through deep sequencing. When the program detects more than 150 sequences in at least one of the
alignments, it will automatically reformat the alignments and treat them as "deep sequencing" input.
Formatting: The standard
format we chose for deep sequencing input (Fischer, Ganusov, et al. PLoS ONE 5(8):e12303)
is the following: every unique sequence is represented only
once, and its name should be of the type XXX.xxx_yyy, where XXX is an arbitrary string (for example the sample ID), xxx
is the unique identifier of the sequence, and the yyy is the multiplicity of such sequence, i.e. how many sequences
identical to that one are represented in the alignment. The user can choose to input the alignment already in this
format, however, if it is NOT formatted, and the number of sequences exceeds 150, the program will automatically format the
sample in the manner explained above. Note: When the original input is not formatted in this manner, a link to the formatted files will be provided in the output page.
Mailback: When the number of sequences in at least one alignment exceeds 35,000 the program will
run as a background job and the user will be informed when the job is complete. The user will be asked to provide an
email address, and a link to the results will be sent. All result files
will be available for 4 days.
Calculations
The Poisson-Fitter provides a null model of an early HIV-1 infection initiated by a single genetic strain, and prior to the
onset of host-driven selection. Such assumption is based on the occurrence of a genetic bottleneck in HIV sexual or mother to infant
transmissions (Wolinsky et al. Science 1992 (255): 1134-1137; Derdeyn et al. Science 2004 (303): 2019-2022; Delwart et al. AIDS 2001 (15):
1-7; Zhang et al. J Virol 1993 (67): 3345-3356). This results in a majority of new infections being homogeneous, i.e., initiated by a single
genetic strain (Keele et al. Proc Natl Acad Sci 2008 May 27;105(21):7552-7). Furthermore, the viral population grows exponentially during
the early phases of infection prior to the onset of the host immune response. In this simple setting, the Poisson Fitter provides a tool for
estimating evolutionary and demographic parameters.
The null model provided by the tool has a two-fold use: it can be used as a method of comparison to determine whether or not the sample had
indeed originated from one unique genetic patriarch; and, for those samples that do meet such assumptions, the tool can provide an estimate
on the time since the Most Recent Common Ancestor (MRCA) and a test for star-phylogeny evolution.
Time since the MRCA is estimated based on a model of exponential evolution and random accumulation of mutations. This is the case early in
the infection, when the population is expanding, and selective pressure from the host has not yet started. Under such scenario, the pairwise
Hamming Distance (HD, the number of bases at which any two sequences differ) frequency counts are expected to follow a Poisson distribution
with main parameter given by (from minimizing the Log-Likelihood function)
| | (0.1) |
where
Y=(Y0, ... ,Yn) are the pairwise HD frequency counts (i.e.
Y0=number of identical pairs,
Y1=number of pairs that differ at one position, etc.).
Assuming a generation time τ of 2 days (Markowitz et al., J. Virol. 2003 (77) 5037-5038), a mutation rate ε =2.16 x 10
-5 (Mansky and Temin,
J. Virol. 1995 (69) 5087-5094), and a basic reproductive ratio
R0=6 (Stafford et al., J. Theor. Biol. 2000 (203) 285-301), we theorize a
model of asynchronous infections in which every infected cells gives rise to
R0/2 newly infected cells at time
τ1 and
another
R0/2 at time
τ2. Under such model, the time since MRCA is estimated as
| | (0.2) |
where λ has been computed as in equation (0.1), and
(for details on how the above formula has been derived see Lee et
al., J Theor Biol. 2009 Aug 4). As mentioned above, the basic reproductive ratio and the mutation rate were implemented according to the parameters studied in HIV. However, the model can be successfully applied to any other population that shows similar growth patterns (such as HCV, for example). In that case, in the upload page, the user is given the option to change the default mutation rate to any appropriate value between 0 and 10
-5.
After estimating the time since MRCA, Poisson Fitter performs a Χ
2 GOF test for dependent cells (details
in Lee et al., J Theor Biol. 2009 Aug 4) and computes theoretical frequency counts for the pairwise Hamming Distance if the sample were to follow a star-phylogeny.
This is done as follows: let
X0 the number of sequences identical to the consensus,
X1 the number that differ from the
consensus at exactly one position, and so on, up to
Xm. We derive the self-convolution frequencies
, where
n=2m,
from the arithmetic convolution of the frequencies
Xk with themselves using the formula
| | (0.3) |
for k=1, ..., n. For example,
, and so on. When the sample follows a star-like phylogeny, one finds that
where
Yi
is the number of pairs with HD=
i (the pairwise HD frequency counts as defined above). The exploration of the output will help the reader in deciding whether
the analyzed samples do meet the assumptions implicit in the null model or not.
The standard deviation is calculated with different methods, depending on whether the sample size on input is greater or less than 100. For "small" samples (<100 sequences per patient), a jackknife resampling is used to estimate the deistribution of the parameter lambda (the mean of the Poisson distribution), and then a standard deviation estimated from the resampling distribution. For samples with 100 or more sequences instead, we estimate the standard deviation using U-statistics (Gilbert, Rossini, Shankarappa, Biometrics, March 2005).
Output
Compressed files: For each input alignment, a fasta file gets created with the consensus sequence followed by a summary sequence that
displays all mutations found across the entire sample. This type of output is useful when testing for overall APOBEC enrichment (see below).
Hypermut results: If the user chooses to remove APOBEC positions and/or APOBEC enriched sequences, each alignment is
tested for APOBEC enrichment through the tool Hypermut, and p-values
below the threshold selected by the user are displayed in the following format:
The tool tests hypermutation for each sequence in the alignment, as well as for "overall" enrichment, that is, a situation in which APOBEC mutations are scattered throughout the sample, but no single sequence is significantly hypermutated. When testing for overall enrichment, the tool creates a "compressed" file, which consists of the consensus sequence together with a sequence containing all the mutations found across the sample. This "summary" sequence is then tested for APOBEC enrichment using Hypermut. Compressed files will appear in the above table in one row, and the sequence name will be "compressedMutations". If the corresponding p value is below the set threshold, it means that the alignment was found to be OVERALL
significantly enriched for APOBEC. The user sets a threshold of significance (default p=0.1) below which the sample or sequence is declared
hypermutated. The corresponding rows will be highlighted in the above table.
The user can decide how to correct for hypermutation. The
options are:
- Remove the sequences in the alignment that are found to be significantly hypermutated (hypermut p value below threshold). The created
file will be named SampleIDnoAPOBECseq.fasta
- Remove APOBEC positions, in which case all positions in the alignment that correspond to a G in the consensus and follow an APOBEC3G/F
pattern (Harris et al. Nat. Rev. Imm. 2004 (4) 868-877; Simon et al. PLoS Path. 2005 1:e6; Bourara et al. PLoS Path. 2007 (3) 1477-1485)
will be removed throughout the alignment (resulting in shorter sequences). The created file will be named SampleIDnoAPOBECpos.fasta
NOTE: If the user selects a certain hypermutation p-value threshold (first option), then new, APOBEC-removed files will be created only if the hypermut tool will detect APOBEC enrichment below the p-value threshold. If the user wishes to remove APOBEC in all samples regardless of whether hypermutation is significant of not, then the second option should be checked.
In either case both corrected and non-corrected files will be analyzed by the program, so that
the user can evaluate the effect of removing APOBEC from the alignment. Links to the APOBEC-cleaned files are provided in the output for the
user to download.
Log Likelihood - Estimated Parameters: This is a table with all the computed statistics for each alignment. It can be viewed by clicking on the link in the output page, and can be downloaded as a text file by clicking on the link "View As Data File" next to the table title. The table contains the following columns:
SAMPLE: unique alignment identifier, extrapolated from the name in the consensus sequence.
LAMBDA: parameter of the Poisson distribution that best fits the HD frequency counts.
ST.DEV.: standard deviation on the parameter estimated above.
NSEQ: number of sequences in the alignment.
NBASES: number of bases in the alignment.
meanHD: mean intersequence Hamming Distance (should be equal to or very close to Lambda).
maxHD: maximum intersequence Hamming Distance.
DAYS(CI): number of days since MRCA with 95% Confidence Intervals.
Chi2:
χ2 statistic from Goodness of Fit test.
DF: degrees of freedom on the
χ2 statistic.
GOF_PVAL: P value from the Goodness of Fit (low p-values indicate divergence from a Poisson).
NOTE: sometimes the fit is so poor (for
example when the maximum Hamming Distance is too high to be compatible with a homogeneous infection) that the GOF fails and NAs are
outputted. The user should take that as an indication of divergence from a Poisson distribution.
Convolution Estimates: This is a text file to be used as an internal check for star-phylogeny. As the program runs, for each alignment the following data is appended to the file:
The first column displays the HD value, the second column the pairwise HD frequency counts (from 0 to the maximum), and the third the expected values if the sample were to follow a star-like phylogeny. In the above example, because column 2 is nearly a perfect
match of column 3 (ignore NAs), we deem the sample 1006 star-like. A divergence of 10% or less is allowed between the two columns, but notice that this is not a statistical test, rather a completely arbitrary value. When the two columns don't match perfectly we recommend to also check the actual tree for a more accurate verification on whether the sample follows a star phylogeny or not. If a
divergence greater than 10% is found, the above output will display "does not follow a star-phylogeny" in a separate row at the end of the counts.
Figure files: For each alignment, the program outputs two figures. The first one plots the histogram of the pairwise HD frequency counts and
the best fitting Poisson distribution (in red, as a continuous line for better visualization). The following is an example:
The second figure plots once again the pairwise HD histogram (in blue) and the theoretical pairwise HD frequency counts (in red) if the
sample were to follow a star-phylogeny, as computed in (0.3). The alignment has evolved according to a star-like topology when the red line
follows closely the histogram, as illustrated in the following example:
The following instead is an example where the test for star-phylogeny fails:
When in doubt, the user should check the convolution result file where the actual numbers are displayed, to see if they coincide and if not
how divergent they are.
Note: When the sample size exceeds 35,000, the star-phylogeny graphs are plotted on a base 10
logarithmic scale. This is motivated by the fact that for large alignments, often the number of identical sequences far
exceeds the unique ones, hence a linear scale masks the values at the tail of the distribution. Here is an example of a
logarithmic plot:
In the above figure, the actual counts of the HD distribution are represented by the black dots (connected by a line for
easier visualization), the star-phylogeny values by the blue dots, and finally the best fitting Poisson by the red line.
Acknowledgements
This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.
Return to Poisson-Fitter
Go Back to the Poisson Fitter submission form.
last modified: Tue Jan 22 13:27 2013