1997-98 by
Input sequence:
The sequence that will be analyzed. Multiple sequences can be entered
at once with several sequence formats.
Format:
Input sequence format. Various standards
are supported.
Sequence type:
Input sequence type
Purge sequences (highly recommended)
When checked,
large duplicated regions (>= 40 bp alignment with less than 3
mismatches)) are filtered out before analysis. Purging is essential
for any motif discovery process, to avoid a bias due to
non-independence of sequences. Purging is performed with the programs
mkvtree and vmatch developed by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).
Oligonucleotide size:
The analysis can be performed with oligonuleotides of any size between
1 and 8. Selecting size 1 amounts to counting the alphabet utilization
within the input sequences. For the detection of regulatory sites, w
recommend starting with an analysis of hexanucleotides (size=6), and
scanning sizes between 4 and 8. When a pattern is significantly
overrepresented, it generally appears from the analyses with various
sizes.
Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide
are summed on both strands. This allows to detect elements which act
in an orientation-insensitive way (as is generally the case for yeast
upstream elements).
Group reverse complement together in the output
(only
valid for two strand analysis). This parameter does not affect the
counting itself, but only the format of output. If this option is NOT
checked, two separate lines are used to show a word and its reverse
complement. This is redundant but might be useful for compatibility
with other programs.
Prevent overlapping matches
Periodic patterns (e.g. AAAAAA, ATATAT) have an aggregative tendency,
i.e. each occurrence of such a pattern strongly favors additional
occurrences in its immediate vicinity. This introduces a bias to most
statistics (binomial, log-likelihood). A simple way to correct for
this bias is to prevent counting twice mutually overlapping
occurrences.
For example, TATATATATATA would represent
Background model
Various probabilistic models can be used to estimate the expected
frequency of each oligonucleotide.
Attention ! The results will be dramatically affected by the choice of expected frequency, which is the main specificity of this program. It has been shown that for the detection of regulatory sites in yeast upstream sequences, the best choice is to estimate the expected oligonucleotide frequencies on basis of the frequencies observed in the set of all upstream non-coding sequences from the genome. For the same purpose, choosing "equiprobable residues" would be totally inefficient, and "Residue frequencies from input sequence" works poorly.
Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all oligonucleotide frequencies (from size 1 to 8) in different sequence types, and this for each organism.
For example, with a Markov chain of order 4 :
P(GATAAC) = P(GATAA) * P(C|GATAA) = P(GATAA) * P(ATAAC) / P(ATAA)
Expected(GATAAC) = observed(GATAA) * observed(ATAAC) / observed(ATAA)For words of size k, the highest possible order is k-2. A Markov order of 0 amounts to use observed residue frequencies for calculating expected oligomer frequencies (no dependency between neighbour residues).
The higher the Markov order, the most stringent is the analysis: specificity is increased, but there si a loss of sensitivity, i.e. some relevant patterns might be overlooked. The optimal Markov order depends on the size of the sequence set. For small gene families (e.g. 10 sequences of 800bp), taking an order > 1 would result in a loss of sensitivity. For sequence sets of 1Mb, a Markov chain of 3 is optimal for hexanucleotides.
GATAAG G & ATAAG GA & TAAG GAT & TAG GATA & AG GATAA & GThe expected frequency of each segmented pair is the product of expected frequencies of its members. The expected word frequency is the maximum expected pair frequency.
You can upload your own table of expected frequencies. This option can be useful if you are working with an organism which is not supported on the web server.
File format: The expected frequency file must be a tab-delimited text file, with one row per oligonucleotide. The first column contains the oligonucleotide, the second column the expected frequency. Oligonucleotides must be of the size selected for the analysis. Examples can be found in the Data folder.
How to generate an expected frequency file ?
An expected frequency file can be generated with oligo-analysis
itself.
When the background frequencies are based on a small sequence set, there is a risk to observe in the test sequences some oligomers which were totally absent from the background sequences. This would make a problem since these words are considered to have a 0 probability.
To circumvent this problem, a pseudo-frequency can be defined, which must be a number between 0 and 1. Expected frequencies are then corrected by a pseudo-frequency, which is the pseudo-weight divided by the number of possible patterns.
EXPECTED OCCURRENCES S Exp_occ = p * T = p * SUM (Lj + 1 - k) j=1 where p = probability of the pattern Several models are supported for estimating the prior probability (see options -a, -expfreq and -bg). S = number of sequences in the sequence set. Lj = length of the jth regulatory region k = length of oligomer T = the number of possible matching positions. PROBABILITY OF SEQUENCE MATCHING The probability to find at least one occurrence of the pattern within a single sequence is : T q = 1 - (1-p) with the same abbreviations as above EXPECTED NUMBER OF MATCHING SEQUENCES In this counting mode, only the first occurrence of each sequence is taken into consideration. We have thus to calculate a probability of first occurrence. Exp_ms = n (1 - (1 - p)^T) with the same abbreviations as above Correction for autocorrelation (from Mireille Regnier) Exp_ms_corrected = n (1 - (1 - p/a)^T) Where a is the coefficient of autocorrelation PROBABILITY OF THE OBSERVED NUMBER OF OCCURRENCES (BINOMIAL) The probability to observe exactly obs occurrences in the whole family of sequences is calculated by the binomial obs T-obs P(obs) = bin(p,T,obs) = T! p (1-p) --------------- obs! * (T-obs)! where obs is the observed number of occurrences, p is the expected frequency for the pattern, T is the number of possible matching positions, as defined above. The probability to observe obs or more occurrences in the whole family of sequences is calculated by the sum of binomials: T obs-1 P(>=obs) = SUM P(i) = 1 - SUM P(i) i=obs i=0 OVER/UNDER-REPRESENTATION By default, the program calculates probability to have at least obs occurrences: T P(>=occ) = SUM P(i) i=occ With the option -under, the program calculates the probability of having less than obs occurrences : occ-1 P(<=occ) = SUM P(i) i=0 The option -under does not affect the other statistics (zscore, log-likelihood). For z-score, the negative values can be used to asses word under-representation. SPECIFIC TREATMENT FOR DOUBLE STRAND COUNTS When occurrences are counted on both strands, each pattern is grouped with its reverse complement. For reverse-palindromic patterns, probabilities are calculated on the basis of the single strand count, since the occurrence on the reverse complement strand is completely dependent on that on the direct strand. A more biological justification for this is that, although the word is found on both strands in a string representation of the sequences, at the structural level, there is a single binding site for the factor. On the contrary, for non-palindrommic patterns, occurrences on the direct and reverse complement strand represent distinct binding sites. Thus, Obs_occ(W|Wr) = Obs_occ(W) + Obs_occ(Wr) Exp_freq(W|Wr) = Exp_freq(W) + Exp_freq(Wr) where W is a given word Wr is the reverse complement of W Probabilities are then calculated as above, on the basis of the event W|Wr instead of simply W. E-VALUE The probability of occurrence by itself is not fully informative, because the threshold must be adapted depending on the number of patterns considered. Indeed, a simple hexanucleotide analysis amounts to consider 4096 hypotheses. The E-value represented the expected number of patterns which would be returned at random for a given P-value (probability). E-value = NPO * P(>=obs) where NPO is the number of possible oligomers of the chosen length (eg 4096 for hexanucleotides). Note that when searches are performed on both strands, NPO is corrected for the fact that non-palindromic patterns are grouped by pairs (for example, there are 2080 patterns when hexanucleotides are counted on both strands). SIGNIFICANCE INDEXES The significance index is simply a negative logarithm conversion of the E-value (in base 10). The significance indexes are calculated as follows: Sig_occ = -log10(E-value); This index is very convenient to interpret : highest values correspond to the most exceptional patterns. OVERLAP COEFFICIENT overlap coefficient is calculated as follows (after Pevzner et al.(1989). J. Biomol. Struct & Dynamics 5:1013-1026): l Kov = SUM kj (1/4)^j j=1 where l is the pattern length. j is the overlap position, comprised between 0 and l. kj takes the value 1 if there is an overlap at pos j, 0 otherwise. When counts are performed on both strands, overlaps between the pattern and its reverse complement are also taken into account into the same formula. Z-SCORE The Z-score is calculated in the following way Zsc = (obs_occ - exp_occ)/sd_occ = (obs_occ - exp_occ)/sqrt(var_occ) where obs_occ is the observed number of occurrences exp_occ is the expected number of occurrences sd_occ and var_occ are the estimated standard deviation and variances for the occurrences, respectively. The estimation of the variance is derived from Pevzner et al.(1989). J Biomol Struct & Dynamics 5:1013-1026): var_occ = exp_occ(2*Kov - 1 - (2*w-1)*exp_occ) In random sequences, Z-scores are normally distributed. The probability to observe a given number of occurrences can thus be read in the normal table from any book of statistics. Advantages of the Z-score: - Z-score corrects the bias due to self-overlapping of a word, which often leads to overestimate the overrepresentation of such words (eg AAAAAA, TATATA). - its calculation is very fast. This is especially critical when analyzing very big sequences (whole genomes), where the expected oligo nt occurrences are very high (and binomial calculation very slow). - Z-score provides a way to detect both over- and under-represented patterns. Disadvantages: - the use of Z-score assumes that the sequences are infinite Recommended thresholds: ======================= strand w P(>=oc) z-score ------------------------------- 1str 3 0.98437 2.155 1str 4 0.00609 2.66 1str 5 0.99902 3.095 1str 6 0.99976 3.49 1str 7 0.99994 3.83 1str 8 0.99998 4.1 2str 3 2str 4 2str 5 2str 6 0.99952 3.30 2str 7 2str 8