HIV Databases HIV Databases home HIV Databases home
HIV sequence database



Poisson-Fitter

Overview

Poisson-Fitter analyzes genetic data from homogeneous sequences, such as HIV or HCV sequences from a single patient early in infection. The maximum genetic diversity from the consensus should not exceed 10%. (For sets of sequences with >10% diversity, we suggest using BEAST or other tools.) Poisson-Fitter performs statistical tests on Hamming Distance (HD) frequency distributions: it computes the best fitting Poisson distribution through Maximum Likelihood, performs a Χ2 Goodness of Fit (GOF) test, and tests for Star-Phylogeny.

If desired, the tool also checks for hypermutation enrichment and, if found, removes positions and/or sequences presenting the APOBEC3G/F signature. Because APOBEC mutations happen at a faster rate than random mutations, they can bias the Poisson fit. We recommend first testing the sample with NO APOBEC correction. However, if some early samples show strong divergence from a Poisson, it is advisable to check whether or not this divergence is be due to APOBEC enrichment. We provide two ways to do so: after uploading the input file the user is redirected to a screen with "APOBEC choices." The user at that point can choose to either have all positions found in an APOBEC context removed, or all sequences found to be significantly enriched for APOBEC (tested via the hypermut tool), or both. The tool will then analyze the additional, APOBEC-removed files, as well as the original file, for easy comparison.

Reference: Giorgi EE, Funkhouser B, Athreya G, Perelson AS, Korber BT, Bhattacharya T. Estimating time since infection in early homogeneous HIV-1 samples using a Poisson model. BMC Bioinformatics 2010 Oct 25;11:532. PMID: 20973976

Input

The tool accepts a single input file. This file may contain alignments from multiple patient sets, to spare the user from uploading each file separately. The alignment file can be in any of the Common Sequence Formats.

The following are REQUIREMENTS for the input file:

(note: the software will change sequence names to all upper case)

Example of valid input format:

>CH40.CONSENSUS
AGCCAGCATGGGATGGACGACCCAGA
>CH40.SEQ1
AGCCTGCATGGGATGGACGACCCAGA
>CH40.SEQ2
AGCCAGCATGGGATGGACGACCCAGA
>CH40.SEQ3
AGCCAGCATGGGATGGACGACCCAGA
>SUMA.CONSENSUS
TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA
>SUMA.SEQ1
TCCAGACCCGACAGGCCCAAA---AAAAAAGTAGACAA
>SUMA.SEQ2
TCCAGACCCGACAGGCCCAAAATCAAAAAAGTAGACAA

This input file contains two alignments, each from a different patient and different genomic region. Separate result files will be generated for CH40 and for SUMA. The sample ID need not be in every sequence name, but it is required that the user provides an ID in the consensus sequence.

Large-scale input

Deep sequencing detection: Poisson-Fitter can handle big alignments such as the ones obtained through deep sequencing. When the program detects more than 150 sequences in at least one of the alignments, it will automatically reformat the alignments and treat them as "deep sequencing" input.

Formatting: The standard format we chose for deep sequencing input (Fischer, Ganusov, et al. PLoS ONE 5(8):e12303) is the following: every unique sequence is represented only once, and its name should be of the type XXX.xxx_yyy, where XXX is an arbitrary string (for example the sample ID), xxx is the unique identifier of the sequence, and the yyy is the multiplicity of such sequence, i.e. how many sequences identical to that one are represented in the alignment. The user can choose to input the alignment already in this format, however, if it is NOT formatted, and the number of sequences exceeds 150, the program will automatically format the sample in the manner explained above. Note: When the original input is not formatted in this manner, a link to the formatted files will be provided in the output page.

Mailback: When the number of sequences in at least one alignment exceeds 35,000 the program will run as a background job and the user will be informed when the job is complete. The user will be asked to provide an email address, and a link to the results will be sent. All result files will be available for 4 days.

Calculations

The Poisson-Fitter provides a null model of an early HIV-1 infection initiated by a single genetic strain, and prior to the onset of host-driven selection. Such assumption is based on the occurrence of a genetic bottleneck in HIV sexual or mother to infant transmissions (Wolinsky et al. Science 1992 (255): 1134-1137; Derdeyn et al. Science 2004 (303): 2019-2022; Delwart et al. AIDS 2001 (15): 1-7; Zhang et al. J Virol 1993 (67): 3345-3356). This results in a majority of new infections being homogeneous, i.e., initiated by a single genetic strain (Keele et al. Proc Natl Acad Sci 2008 May 27;105(21):7552-7). Furthermore, the viral population grows exponentially during the early phases of infection prior to the onset of the host immune response. In this simple setting, the Poisson Fitter provides a tool for estimating evolutionary and demographic parameters.

The null model provided by the tool has a two-fold use: it can be used as a method of comparison to determine whether or not the sample had indeed originated from one unique genetic patriarch; and, for those samples that do meet such assumptions, the tool can provide an estimate on the time since the Most Recent Common Ancestor (MRCA) and a test for star-phylogeny evolution.

Time since the MRCA is estimated based on a model of exponential evolution and random accumulation of mutations. This is the case early in the infection, when the population is expanding, and selective pressure from the host has not yet started. Under such scenario, the pairwise Hamming Distance (HD, the number of bases at which any two sequences differ) frequency counts are expected to follow a Poisson distribution with main parameter given by (from minimizing the Log-Likelihood function)

(0.1)

where Y=(Y0, ... ,Yn) are the pairwise HD frequency counts (i.e. Y0=number of identical pairs, Y1=number of pairs that differ at one position, etc.).

Assuming a generation time τ of 2 days (Markowitz et al., J. Virol. 2003 (77) 5037-5038), a mutation rate ε =2.16 x 10-5 (Mansky and Temin, J. Virol. 1995 (69) 5087-5094), and a basic reproductive ratio R0=6 (Stafford et al., J. Theor. Biol. 2000 (203) 285-301), we theorize a model of asynchronous infections in which every infected cells gives rise to R0/2 newly infected cells at time τ1 and another R0/2 at time τ2. Under such model, the time since MRCA is estimated as

(0.2)

where λ has been computed as in equation (0.1), and (for details on how the above formula has been derived see Lee et al., J Theor Biol. 2009 Aug 4). As mentioned above, the basic reproductive ratio and the mutation rate were implemented according to the parameters studied in HIV. However, the model can be successfully applied to any other population that shows similar growth patterns (such as HCV, for example). In that case, in the upload page, the user is given the option to change the default mutation rate to any appropriate value between 0 and 10-5. After estimating the time since MRCA, Poisson Fitter performs a Χ2 GOF test for dependent cells (details in Lee et al., J Theor Biol. 2009 Aug 4) and computes theoretical frequency counts for the pairwise Hamming Distance if the sample were to follow a star-phylogeny. This is done as follows: let X0 the number of sequences identical to the consensus, X1 the number that differ from the consensus at exactly one position, and so on, up to Xm. We derive the self-convolution frequencies , where n=2m, from the arithmetic convolution of the frequencies Xk with themselves using the formula

(0.3)

for k=1, ..., n. For example, , and so on. When the sample follows a star-like phylogeny, one finds that where Yi is the number of pairs with HD=i (the pairwise HD frequency counts as defined above). The exploration of the output will help the reader in deciding whether the analyzed samples do meet the assumptions implicit in the null model or not.

The standard deviation is calculated with different methods, depending on whether the sample size on input is greater or less than 100. For "small" samples (<100 sequences per patient), a jackknife resampling is used to estimate the deistribution of the parameter lambda (the mean of the Poisson distribution), and then a standard deviation estimated from the resampling distribution. For samples with 100 or more sequences instead, we estimate the standard deviation using U-statistics (Gilbert, Rossini, Shankarappa, Biometrics, March 2005).

Output

Compressed files: For each input alignment, a fasta file gets created with the consensus sequence followed by a summary sequence that displays all mutations found across the entire sample. This type of output is useful when testing for overall APOBEC enrichment (see below).

Hypermut results: If the user chooses to remove APOBEC positions and/or APOBEC enriched sequences, each alignment is tested for APOBEC enrichment through the tool Hypermut, and p-values below the threshold selected by the user are displayed in the following format:


The tool tests hypermutation for each sequence in the alignment, as well as for "overall" enrichment, that is, a situation in which APOBEC mutations are scattered throughout the sample, but no single sequence is significantly hypermutated. When testing for overall enrichment, the tool creates a "compressed" file, which consists of the consensus sequence together with a sequence containing all the mutations found across the sample. This "summary" sequence is then tested for APOBEC enrichment using Hypermut. Compressed files will appear in the above table in one row, and the sequence name will be "compressedMutations". If the corresponding p value is below the set threshold, it means that the alignment was found to be OVERALL significantly enriched for APOBEC. The user sets a threshold of significance (default p=0.1) below which the sample or sequence is declared hypermutated. The corresponding rows will be highlighted in the above table.

The user can decide how to correct for hypermutation. The options are:

  1. Remove the sequences in the alignment that are found to be significantly hypermutated (hypermut p value below threshold). The created file will be named SampleIDnoAPOBECseq.fasta
  2. Remove APOBEC positions, in which case all positions in the alignment that correspond to a G in the consensus and follow an APOBEC3G/F pattern (Harris et al. Nat. Rev. Imm. 2004 (4) 868-877; Simon et al. PLoS Path. 2005 1:e6; Bourara et al. PLoS Path. 2007 (3) 1477-1485) will be removed throughout the alignment (resulting in shorter sequences). The created file will be named SampleIDnoAPOBECpos.fasta

NOTE: If the user selects a certain hypermutation p-value threshold (first option), then new, APOBEC-removed files will be created only if the hypermut tool will detect APOBEC enrichment below the p-value threshold. If the user wishes to remove APOBEC in all samples regardless of whether hypermutation is significant of not, then the second option should be checked.

In either case both corrected and non-corrected files will be analyzed by the program, so that the user can evaluate the effect of removing APOBEC from the alignment. Links to the APOBEC-cleaned files are provided in the output for the user to download.

Log Likelihood - Estimated Parameters: This is a table with all the computed statistics for each alignment. It can be viewed by clicking on the link in the output page, and can be downloaded as a text file by clicking on the link "View As Data File" next to the table title. The table contains the following columns:
SAMPLE: unique alignment identifier, extrapolated from the name in the consensus sequence.
LAMBDA: parameter of the Poisson distribution that best fits the HD frequency counts.
ST.DEV.: standard deviation on the parameter estimated above.
NSEQ: number of sequences in the alignment.
NBASES: number of bases in the alignment.
meanHD: mean intersequence Hamming Distance (should be equal to or very close to Lambda).
maxHD: maximum intersequence Hamming Distance.
DAYS(CI): number of days since MRCA with 95% Confidence Intervals.
Chi2: χ2 statistic from Goodness of Fit test.
DF: degrees of freedom on the χ2 statistic.
GOF_PVAL: P value from the Goodness of Fit (low p-values indicate divergence from a Poisson). NOTE: sometimes the fit is so poor (for example when the maximum Hamming Distance is too high to be compatible with a homogeneous infection) that the GOF fails and NAs are outputted. The user should take that as an indication of divergence from a Poisson distribution.


Convolution Estimates: This is a text file to be used as an internal check for star-phylogeny. As the program runs, for each alignment the following data is appended to the file:


The first column displays the HD value, the second column the pairwise HD frequency counts (from 0 to the maximum), and the third the expected values if the sample were to follow a star-like phylogeny. In the above example, because column 2 is nearly a perfect match of column 3 (ignore NAs), we deem the sample 1006 star-like. A divergence of 10% or less is allowed between the two columns, but notice that this is not a statistical test, rather a completely arbitrary value. When the two columns don't match perfectly we recommend to also check the actual tree for a more accurate verification on whether the sample follows a star phylogeny or not. If a divergence greater than 10% is found, the above output will display "does not follow a star-phylogeny" in a separate row at the end of the counts.

Figure files: For each alignment, the program outputs two figures. The first one plots the histogram of the pairwise HD frequency counts and the best fitting Poisson distribution (in red, as a continuous line for better visualization). The following is an example:


The second figure plots once again the pairwise HD histogram (in blue) and the theoretical pairwise HD frequency counts (in red) if the sample were to follow a star-phylogeny, as computed in (0.3). The alignment has evolved according to a star-like topology when the red line follows closely the histogram, as illustrated in the following example:


The following instead is an example where the test for star-phylogeny fails:


When in doubt, the user should check the convolution result file where the actual numbers are displayed, to see if they coincide and if not how divergent they are.

Note: When the sample size exceeds 35,000, the star-phylogeny graphs are plotted on a base 10 logarithmic scale. This is motivated by the fact that for large alignments, often the number of identical sequences far exceeds the unique ones, hence a linear scale masks the values at the tail of the distribution. Here is an example of a logarithmic plot:


In the above figure, the actual counts of the HD distribution are represented by the black dots (connected by a line for easier visualization), the star-phylogeny values by the blue dots, and finally the best fitting Poisson by the red line.

Acknowledgements

This tool uses R software. Thanks to the R Team:
R Development Core Team (2005). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, www.R-project.org.

Return to Poisson-Fitter

Go Back to the Poisson Fitter submission form.


last modified: Tue Jan 22 13:27 2013


Questions or comments? Contact us at seq-info@lanl.gov.