matrix-scan

DESCRIPTION

Scan sequences with one or several position-specific scoring matrices (PSSM) to identify instances of the corresponding motifs (putative sites). This program supports a variety of background models (Bernoulli, Markov chains of any order).

AUTHORS

Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>
Jean Valery Turatsinze <jturatsi@bigre.ulb.ac.be>
Morgane Thomas-Chollier <morgane@bigre.ulb.ac.be>

sequences
pattern matching
PSSM

USAGE

matrix-scan -m matrixfile [-i inputfile] [-o outputfile] [-v] [-bgfile backgroundfile|-bgorder #]

INPUT FORMATS

Sequence file

All the formats supported in RSAT can be used as input (default: fasta).

Matrix file

The matrix format is specified with the option -matrix_format. Supported : tab,cb,consensus,gibbs,meme,assembly. Default : tab.

For a description of these format, see convert-matrix -h

OUTPUT FORMAT

The output is a tab-delimited file, with one row per match. This file can directly be used as input for the program feature-map.

SCORING SCHEME

WEIGHT SCORE

The program scans the input sequences with a position-specific scoring matrix (PSSM) by selecting, at each position, a sequence segment (S) of the same length as the matrix, and assigning a score to this segment.

The segment score (weight) is calculated according to the theory developed by Jerry Hertz and Gary Stormo (1999), but with the capability to use Markov chain-based background models, as proposed by Thijs et al. (2001).

The weight of a sequence segment (Ws) is calculated as the log-ratio between two probabilities:

Ws = log[P(S|M)/P(S|B)]

where

proba_M = P(S|M):

The probability to generate the sequence segment given the matrix.

proba_B = P(S|B):

The probability to generate the sequence segment given the background model.

By default, the program uses natural logarithms, but the option -base allows to specify any alternative base (e.g. 2 to obtain bit units);

MOTIF MODEL

The position-specific scoring matrix is generally built from an alignment of transcription factor binding sites. The matrix indicates the absolute frequency (Nij = number of occurrences) of each residue (i = row) at each position (j = column) of the alignment.

Note: some programs use ``vertical'' matrices, where rows represents positions and columns residues. See convert-matrix for a description of PSSM formats.

Pseudo-counts

Relative frequencies can be corrected by a pseudo-count (b) to reduce the bias due to the small number of observations.

The pseudo-count can be shared either in an equiprobable way,

`  S<F''ij=(Nij + b/A)/[SUMi(Nij)+b]>`

or according to residue prior frequencies.

`  S<F''ij=(Nij + b*Pi)/[SUMi(Nij)+b]>`

where

Pi

is the prior frequency for residue i

A

is the size of the alphabet (A=4 for DNA).

b

is the pseudo-count, which is ``shared'' between residues according to their prior frequencies.

BACKGROUND MODELS

An essential parameter of any pattern detection approach is the choice of an appropriate background model. This model is used to estimate the probability for each site to occur by chance, rather than as an instance of the motif.

The program matrix-scan supports Markov models of arbitrary order as background models. A Markov model of order m means that the probability of each residue depends on the m preceding residues in the sequence. Note that a Markov model of order 0 corresponds to a Bernoulli model (each residue is independent from the preceding ones).

Markov models are represented as transition matrices, where each row represents a prefix and each column a residue (suffix), and each cell represents the conditional probability P(r|prefix) of observing residue r at a given position, given the prefix (the m preceding letters).

Background model specification

The background model can be specified in different ways.

-bgfile

This option allows to enter the background model from a background model file. Background model files are tab-delimited files containing the specification of oligonucleotide frequencies. A Markov model of order m=k-1 is automatically obtained from the frequencies of oligonucleotides of length k. There is thus no need to use the option -markov when the background model is secified with a bg file.

The RSAT data folder contains pre-calibrated background model files for all the supported organisms.

\$RSAT/public_html/data/genomes/My_organism/oligo-frequencies/

-bginput

The backgound model is calculated from the whole set of input sequences. This option requires to specify the order of the background model with the option -markov.

-window

The background model is calculated locally at each step of the scan, by computing transition frequencies from a sliding window centred around the considered segment. The model is thus updated at each scanned position. This model is called ``adaptive''. Note that the sliding window must be large enough to train the local Markov model. The required sequence length increases exponentially with the Markov order. This option is thus usually suitable for low order models only (-markov 0 to 1).

Pseudo-frequencies for the background model

The concept of pseudo-count can be extended to pseudo-frequencies for the background model, in order to increase the robustness of BG models when the training sequence set is too small. This is particularly important for the adaptive models, which are trained on relatively short sliding windows (a few hundreds of bases).

The reason for using pseudo-frequencies rather than pseudo-counts is that background models are usually defined in terms of relative frequencies, considered as estimates of prior frequencies. Since the absolute counts used for estimating those probabilities are not always available, we introduce the correction in terms of pseudo-frequencies.

## Jean Valery & Morgane continuent l'explication

P'(r|prefix)=((P(r|prefix) + b/A)/(SUMi[P(i|prefix)]+b)

REFERENCES

The probabilities use in this program were derived from the following papers.

Aerts, S., Thijs, G., Coessens, B., Staes, M., Moreau, Y. & De Moor, B. (2003).

Toucan: deciphering the cis-regulatory logic of coregulated genes. Nucleic Acids Res 31, 1753-64.

Bailey, T. L. & Gribskov, M. (1998).

Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14, 48-54.

Hertz, G.Z., G.W. Hartzell, 3rd, and G.D. Stormo (1990).

Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci, 6(2): p. 81-92.

Hertz, G.Z. and G.D. Stormo (1999).

Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15(7-8): p. 563-77.

Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci 5, 89-96.

Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze, P. & Moreau, Y. (2001).

A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17, 1113-22.

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Display only program options

-bgfile background_file

Background model file.

-bg_format format

Format for the background model file.

Supported formats: all the input formats supported by convert-background-model.

-bginput

Calculate background model from the input sequence set.

-markov

Order of the markov chain for the background model.

This option is incompatible with the option -bgfile.

-window

Size of the sliding window for the background model calculation. When this option is specified, the matrix pseudo-count is equally distributed.

-m matrixfile

Matrix file.

This argument can be used iteratively to scan the sequence with multiple matrices.

-top_matrices #

Only scan with the top # matrices per matrix file. This option is valid for some file formats containing multiple matrices (e.g. consensus, meme, MotifSampler), where the top matrix is generally the most informative.

If several matrix files are specified, the # top matrices of each file are used for scanning the sequences.

-consensus_name

Use the motif (degenerate) consensus as matrix name.

-mlist matrix_list

Matrix list.

Indicate a file containing a list of matrices to be used for scanning the region. This facilitates the scanning of a sequence with a library of matrices (e.g. all the matrices from RegulonDB, or TRANSFAC).

Format: the matrix list file is a text file. The first word of each row is suppose to indicate a file name. Any further information on the same row is ignored.

-matrix_format matrix_format

Matrix format.

-i inputfile

File containing the sequences to scan.

If no input file is specified, the standard input is used. This allows to use the command within a pipe.

-seq_format sequence_format

Sequence format.

Mask lower or uppercases, respecively, i.e. replace selected case by N characters.

-n [skip|score]

Treatment of N characters. These characters are often used in DNA sequences to represent undefined or masked nucleotides.

skip

N-containing regions are skipped.

score

N-containing regions are scored. The probability of an N is 1 for both the background model and the matrix. The N residues will thus contribute neither positively nor negatively to the weight score of the N-containing fragment. This option can be useful to detect sites which are at the border of N-containing regions, or in cases there are isolated N in the sequences.

-o outputfile

If no output file is specified, the standard output is used. This allows to use the command within a pipe.

-pseudo pseudo_counts

Pseudo-count for the matrix (default: 1).

The pseudo-count reflects the possibility that residues that were not (yet) observed in the model might however be valid for future observations. The pseudo-count is used to compute the corrected residue frequencies.

-equi_pseudo

If this option is called, the pseudo-weight is distributed in an equiprobable way between residues.

By default, the pseudo-weight is distributed proportionally to residue priors, except for the -window option where equipseudo is default.

-bg_pseudo #

Pseudo frequency for the background model. Value must be a real between 0 and 1.

If this option is not specified, the pseudo-frequency value depends on the background calculation.

For -bginput and -window, the pseudo frequency is automatically calculated from the length (L) of the sequence following this formula:

`  sqrt(L)/(L+sqrt(L))`

For -bgfile, default value is 0.01.

In other cases, if the length (L) of the training sequence is known (e.g. all promoters for the considered organism), the value can be set manually by using the option -bg_pseudo. In such case, the background pseudo-frequency might be set, as suggested by Thijs et al., to the following value:

`  sqrt(L)/(L+sqrt(L))`
-origin pos

Define pos as the origin for the calculation of positions.

-origin -0 defines the end of each sequence as the origin. The matching positions are then negative values, providing the distance between the match and the end of the sequence.

-base #

Base for the logarithms used in the scores involving a log-likelihood (weight and information content). Default: `exp(1)` (natural logarithms).

A common alternative to natural logarithms is to use logarithms in base 2, in which case the information content is computed in bits.

-decimals #

Number of decimals displayed for the weight score.

Warning: the computation of P-values increases exponentially with the number of decimals. For matrices wih many columns, this can become non-tractable. We thus recommend to use the default value (2 decimals).

-uth param upper_threshold

Threshold on some parameter (-lth: lower, -uth: upper threshold).

Supported threshold fields for the matches : score, pval, eval, sig, normw, proba_M, proba_B, rank , crer_sites , crer_size

Supported threshold fields for score distributions: occ occ_sum inv_cum exp_occ occ_pval occ_eval occ_sig occ_sig_rank

-2str

Scan both strands for DNA sequences.

-1str

single-strand search for DNA sequences.

-return return_fields

List of fields to return.

Supported fields: sites,pval,seq_scores,p_score,rank,normw,limits,weight_limits,disrib,occ_proba,bg_model,bg_residues,matrix,freq_matrix,weight_matrix,crer

sites: Matching sites.

Return the position of each matching site, in a tab-delimited format which can be sed as input by feature-map (format .ft).

pval: site-wise P-value

The site-wise P-value estimates the significance of the weight associated to each site. It is computed from the matrix, according to the probabilities described in Staden (1989), Bailey (1998).

In addition to the P-value, the program exports a column with the significance, defined as sig = -log(P-value). By default, logarithms are calculated in base 10, but this can be modified with the option -base.

seq_scores: sequence-wise score

Score each sequence according to Bailey (1998), with the difference that, instead of computing the product of P-values, we compute the sum of significances.

p_score: return the p_score

This score is given by -log(Pval(w)/Pval_tresh). Where `Pval(w)` is the P-value of the wheight and Pval_tresh the threshold on P-value given by the user.

rank: Rank.

Sort the sites per decreasing values of score (weight), and return the rank value.

The rank is calculated independently for each sequence. In addition, a matrix-specidic rank is calculated for each sequence (rank_pm). This allows to distinguish between multiple matches for a single matrix (homotypic modules), and separate matches for distinct matrices (heterotypic modules reflecting synergy between distinct transcription factors).

A common usage of the rank is to select the top scoring site per sequence (-uth rank 1) or the 3 top scoring sites per sequence (-uth rank 3).

Another possibility is to define a maximal number of matches per matrix in the same sequence (-uth rank_pm 3).

normw: normalized weights.

Normailzed weights are calculated according to Thijs' formula : normw = (W -Wmin)/(Wmax - Wmin). Note that Wmin and Wmax are approximated using a Bernoulli model, for reasons of commputational efficiency.

limits: limits (start, end) of the input sequences.

This is useful for drawing feature maps with sequences of different lengths.

weight_limits: Wmin and Wmax.

For each site, returns the minimal and maximal weight. This is useful with adaptative background models.

crer

Return Cis-Regulatory elements Enriched-Regions.

Calculate the statistical significance of the number of hits in windows of variable sizes. The number of hits is the sum of matches above a predefined threshold set on hits p-values, for all matrices and on both strands (if -2str). The maximum size for a CRER is defined by the option -crer_max.

The prior probability to find an instance of the motif is the same for all matrices, and corresponds to the chosen pval threshold. Within a region of maximal CRER size, subwindows are defined between each hits, and the observed number of matches in a subwindow is the sum of hits above the threshold. The significance of the observed number of matches in a subwindow is estimated by calculating a P-value using the binomial distribution (Aerts et al., 2003).

Example of CRER search: -return crer -uth pval 0.0001 -lth crer_size 20 -uth crer_size 200

` => the returned CRER lengths are between 20 and 200bp, and are constructed with hits having a pval less than 0.0001`
distrib Score distribution.

Return the score distribution for each matrix.

occ_proba Probability of the number of matches in the input sequence

For each matrix and each score value, calculate the statistical significance of the number of matches. This allows to select the score associated with te maximal significance, on the basis of the matrix-specific distribution, rather than by selecting some a priori threshold.

For each motif (M) and each score value (s), the program estimates the significance of the observed number of matches (x), given the prior probability (p) to find an instance of this motif with at least this score at a given position of the sequence. The P-value is calculated using the binomial distribution (Aerts et al., 2003).

This option requires to specify a background score distribution (option -bg_distrib) to estimate the prior probabilities of motif instances.

bg_model: Background model.

Report the transition matrix of the background model. Note that this option only makes sense for fixed background models (-bgfile or -bginput), since when the background model is adaptive (-mindow), the transition matrix changes along he sequence.

bg_residues

Return for each site the composition in A,C,G and T of the background model.

matrix

Return as comments the count matrix (or matrices) which were used for scanning.

freq_matrix

Return as comments the frequency matrix (or matrices) which were used for scanning.

weight_matrix

Return as comments the weight matrix (or matrices) which were used for scanning.

-sort_distrib

Sort score distribution by decreasing value of significance. By default, the score distributions are sorted by score (weight).

-bg_distrib bg_distrib_file

File specifying the background score distribution used to estimate prior probabilities with the option -return occ_proba. When this file is specified, the prior probabilities of motif occurrences are estimated from the frequencies of the background file, rather than using the theoretical site-wise P-value.

This background distribution can be generated by running matrix-scan on a set of background sequences, with the options

matrix-proba -v 1 -return distrib -mlist matrix_list.txt -i bg_seq.fta [...]

Various types of background sequences can be used as background model: whole genome, whole set of upstream sequences, randomly generated sequences, ... The choice of the background model has a strong effect on the estimated significance, and should thus be done carefully, according to the biological question.

-crer_ids

Assign one separate feature ID per CRER This option is convenient to distinguish separate CRERs, but it can be heavy for feature-map legends, especially when many CRERs are detected.

-recursive

Run matrix-scan separately for each sequence.

-batch #

Dispatch matrix-scan jobs on a cluster. Number of sequences to be analyzed by job (= on each node of the cluster)

INTERNAL PROCEDURES

score_segment

Assign a score to a sequence segment and print it if it passes the thresholds. Return value is 1 if the segment passed the thresholds.

p-score

Compute the p-score as in Bailey 2003.

print_match

Print the matching site.

INTERNAL PROCEDURES

score_crer

Calculate the score of the crer

print_crer

Print the CRER.

check_thresholds

Check the lower and upper threshold for a given hit.

check_distrib_thresholds

Check the lower and upper threshold for a given parameter in the score distribution.