## RSAT - dyad analysis manual

## Name

dyad-analysis

1998 by

## Description

Detects overrepresented spaced dyads in a set of DNA sequences. A dyad is a pair of oligonucleotides of the same size. They can be separated by a fixed number of bases.

This algorithm detects a set of binding sites that is not detected by oligo-analysis, because of the variability within the spacer region. A typical example of patterns that are efficiently detected by the dyad analysis is the binding site for the yeast Gal4p transcription factor, which has the consensus

.CGGNNNNNWNNNNNCCG

## Options

## Title

(facultative)

Title of the data set. This information is returned as title of the result page.

## Input sequence:

The sequence that will be analyzed. Multiple sequences can be entered at once with several sequence formats.## Format:

Input sequence format. Various standards are supported.## Purge sequences (highly recommended)

When checked, large duplicated regions (>= 40 bp alignment with less than 3 mismatches)) are filtered out before analysis. Purging is essential for any motif discovery process, to avoid a bias due to non-independence of sequences. Purging is performed with the programsmkvtreeandvmatchdeveloped by Stefan Kurtz (kurtz@zbh.uni-hamburg.de).## Oligonucleotide size

This is the size of a single element (a half dyade).## Spacing

Spacing between the elements of the dyad. The spacing is the number of bases between the end of the fisrt element and the start of the second one.A single integer value means that the spacing is fixed. Variable spacing can be introdued by entering the min and max values separated by a hyphen. For example 8-12 means that all occurrences of the dyad with a spacing between 8 and 12 qill be counted together and their significance estimated globally. Warning, this is different from scanning one by one th spacing values 8 to 12.

## Dyad type

In order to fasten execution, the program can be asked to restrict its analysis to symmetric dyads, with 3 possibilities :

direct repeats: the second element is the same as the first oneinverted repeats: the second element is the reverse complement of the first one.any repeat: analyse both direct and inverted repeats

When selecting the optionany dyad, the analysis is performed on all dyads, symmetric as well as non-symmetric.Warning: the number of dyads increases dramatically with this option, and it should not be used for elements widers than 3 nucleotides.

## Count on:

(single or both strands)

By selecting "both strands", the occurrences of each oligonucleotide are summed on both strands. This allows to detect elements which act in an orientation-insensitive way (as is generally the case for yeast upstream elements).

## Prevent overlapping matches

Periodic patterns (e.g. AAAn{0}AAA, TATn{1}TAT) have an aggregative tendency, i.e. each occurrence of such a pattern strongly favours additional occurrences in its immediate vicinity. This introduces a bias to most statistics (binomial, log-likelihood). A simple way to correct for this bias is to prevent counting twice mutually overlapping occurrences.

For example, the stringAAAAAAAAAAAAAAwould represent

- 7 occurrences of
AAAn{1}AAAwhen self-overlap is allowed- 2 occurrences of
AAAn{1}AAAwhen self-overlap is prevented## Expected frequency calibration

## Background model

Compare dyad frequencies observed in the query sequence to those of a reference sequence (the background model).Pre-calculted tables are used to estimate expected oligonucleotide frequencies (background frequencies). These tables were obtained by counting all dyad frequencies (monad size 3, spacing from 0 to 20) in different sequence types, and this for each organism.

upstream: all upstream regions, allowing overlap with upstream ORFs.upstream-noorf: all upstream regions, preventing overlap with upstream ORFs (sequences are clipped to discard upstream ORF sequences).

## Monad frequencies from the input sequence

The frequency expected for each dyad is the product of the frequency observed expected for each monad (oligonucleotide) in the sequence file.exp(dyad) = exp(oligo1)*exp(oligo2)## Threshold of significance:

Thresholds can be imposed to select the most significantly overrepresented patterns. A threshold of 0 on occurrence significance index is selected by default. This is the most efficient way we found to automatically select the biologicaly significant regulatory sites, irrespective of oligonucleotide size, number and size of the sequences in the input set.

## Output columns

Expected frequency (exp_frq):the probability to observe the dyad at each position. This value is calculated on basis of the expected frequency table (see below).Observed occurrences (obs_occ):the number of ocurrences observed for each dyad. Overlapping matches are detected and summed in the counting.Expected number of occurrences (exp_occ):the number of ocurrences expected for each dyad. This value is calculated on basis of the oligonucleotide frequency table selected.Occurrence probability (occ_pro):the probability to have N or more occurrences, given the expected number of occurrences (where N is the observed number of occurrences).Occurrence Significance (occ_sig):this is a conversion of the occurrence probability, taking into account the number of possible dyads (which varies with oligo size) and doing a logarithmic transformation. The highest sig correspond to the most overrepresented oligonucleotide. Sig value higher than 0 indicate overrepresentation.## Probabilities

Various calibration models can be used to estimate the probability of each oligonucleotide (see above). From there, and expected number of occurrences is calculated and compared to the observed number of occurrences. The significance of the observed number of occurrences is calculated with the binomial formulae.

EXPECTED DYAD FREQUENCY If exp(oligo1) is the expected frequency for the first element, and exp(oligo1) is the expected frequency for the second element Then exp(dyad) = exp(oligo1)*exp(oligo2) NUMBER OF POSSIBLE DYADS This number depends on the dyad type selected by the user. When the analysis is restricted to inverted repeats, or to direct repeats, the first element univocally determines the second one, thus: nb_poss_dyads = nb_poss_oligo = 4^w where w is the oligonucleotide length. When any dyad is allowed, each oligonucleotide can combine with any other or itself, thus: nb_poss_dyads = nb_poss_oligo * nb_poss_oligo = 4^2w EXPECTED OCCURRENCES r Exp_occ = p * 2 * SUM (Lj + 1 - d) = p * T j=1 where p = expected dyad frequency n = number of input sequences Lj = length of the jth input sequence d = length of the dyad, calculated as follows: d = 2w + s where w is the oligonucleotide length s is the spacer length T = the number of possible matching positions in the whole set of input sequences. The factor 2 stands for the fact that occurrences are summed on both strands (it is omitted when the option -1str is active). PROBABILITY OF THE OBSERVED NUMBER OF OCCURRENCES The probability to observe exactly obs occurrences in the whole set of sequences is calculated by the binomial obs T-obs P(obs) = bin(p,T,obs) = T! p (1-p) --------------- obs! * (T-obs)! where obs is the observed number of dyad occurrences, p is the expected dyad frequency, T is the number of possible matching positions, as defined above. The probability to observe obs or more occurrences in the whole set of of sequences is calculated by the sum of binomials: obs-1 P(>=obs) = 1 - SUM P(j) j=0 SIGNIFICANCE INDEX The significance index is a conversion of the occurrence probability, calculated as follows:. Sig_occ = -log10(NPD * P(>=obs)); where NPD is the number of possible dyads, calculated as above.

## For information, contact