Calculates the positional distribution of oligonucleotides in a set of sequences, and detects those which significantly discard from a homogeneous distribution.
This program is useful for detecting patterns with a positional bias, in large sets of sequences (e.g. a few hundreds or thousands of sequences) aligned on some referencce (e.g. the start codon). Significant patterns are not likely to be found in smaller sequence sets. This tools is thus typically designed for pattern discovery in full genomes rather than for small fmilies of co-expressed genes.
van Helden, J., Olmo, M. & Perez-Ortin, J. E. (2000). Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 28(4), 1000-1010. Pubmed 10648794
- Increasing the class interval reduces the resolution of the analysis, but it smoothes the distributions.
- When class intervals are too short, the number of occurrences per class may become too low, and the condition of applicability for the chi2 will not be fulfilled for some oligonucleotides.
- For short oligonucleotides, smaller class intervals can be used, since there are less idstinct possible words, and the number of occurrences per word thus increases.
Origin: reference position for class grouping. Positive values specify a position relative to the sequence start, negative values a position relative to sequence end. The default is to consider sequence end, which is more relevant for upstream sequences. Beware ! the default value -0 has a specific meaning. For example :
- 0 : sequence start is used as reference, i.e. the first letter of the sequence is the position 1.
- 100 : the 100th letter of the sequence is the reference (position 0) i.e. the 101th letter is the position 1, the 99th letter position -1, ...
- -100 : the 100th letter of the sequence is the reference (position 0) i.e. the 101th letter is the position 1, the 99th letter position -1, ...
- -0 (default) : sequence end is used as reference, i.e. the last letter of the sequence is the position -1.
Condition of applicability for the chi2 test:
A condition of applicability for the chi2 test is that the expected frequency shuld be >= 5 for each class. By default, this condition is tested for each oligonucleotide, and, for those which do not fulfil the test, the observed chi2 is displayed between curly brackets.
- The test for the condition of applicability can be inactivated, in order to obtain real values in this column. It is not recommended to inactivate it, since this may lead to false positives.
- This option filters out the oligonucleotides which do not fulfil the condition of applicability for the chi2 test. The output is lighter and easier to read.
The analysis can be performed with oligonuleotides of any size between 1 and 8. Selecting size 1 amounts to counting the alphabet utilization within the input sequeces. For the detection of regulatory sites, w recommend starting with an analysis of hexanucleotides (size=6), and scanning sizes between 4 and 8. When a pattern is significantly overrepresented, it generaly appears from the analyses with various sizes.
Count on: (single or both strands)
By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).
Group reverse complement together in the output
(only valid for two strand analysis). This parameter does not affect the counting itself, but only the format of output. If this option is NOT checked, two separate lines are used to show a word and its reverse complement. This is redundant but might be useful for compatibility with other programs.
Prevent overlapping matches
Periodic pattern (e.g. AAAAAA, ATATAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favours additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.
For example, TATATATATATA would represent
Note that Z-score introduce a correction for self-overlapping patterns (see van Helden et al., 1999), but Z-scores are only valid for very large sequences (for example a set of 6000 downstream sequences), and are not appropriate for small gene clusters such as those extracted from DNA chip experiment.
- 2 occurrences of TATATA when self-overlap is prevented
- 5 occurrences of TATATA when self-overlap is allowed