- NAME
- VERSION
- DESCRIPTION
- AUTHORS
- CATEGORY
- USAGE
- INPUT FORMATS
- OUTPUT FORMATS
- ALGORITHM
- (DIS)SIMILARITY METRICS
- Symbols used for the metrics
- Sum of Squared Distances (SSD)
- Sandelin-Wasserman similarity (SW)
- Normalized Sandelin-Wasserman similarity (NSW)
- Euclidian distance (dEucl)
- Normalized Euclidian distance (NdEucl)
- Normalized Euclidian similarity (NsEucl)
- Kullback-Leibler distance (dKL)
- Covariance (cov)
- Coefficient of correlation (cor)
- Normalized correlation (Ncor)
- Correlation of the information content (Icor)
- REFERENCES
- OPTIONS
- SEE ALSO
- WISH LIST

compare-matrices

$program_version

Compare two collections of position-specific scoring matrices (PSSM), and return various similarity statistics + matrix alignments (pairwise, one-to-n).

**Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>**

**sequences****pattern matching****PSSM**

compare-matrices -file1 inputfile1 -file2 inputfile2 [-o outputfile] [-v #] [...]

The user has to specify exactly input files (options *-file1* and
*-file2*), each containing one or several PSSMs. Each matrix of file
one is compared with each matrix of file2.

Any PSSM format supported in RSAT (type *convert-matrix -h* for a
description).

By default, the output format is a tab-delimited file with one row per
matrix comparison, and one column per statistics. Depending on the
requested return fields, *compare-matrices* can also export a series
of additional files.

**[output_prefix].tab**-
Tab-delimited text file containing the primary result (comparison score table): one column per comparison (match or profile position), one row per field (score, matrix descriptor, ...).

**[output_prefix].html**-
HTML file presenting the comparison table in a user-friendly way. The clickable headers allow to re-order the table according to any column.

**[output_prefix_alignments_pairwise.tab]**-
Tab-delimited text file containing the shifted matrices resulting from pairwise alignments.

**[output_prefix_alignments_pairwise.html]**-
HTML file presentig the pairwise alignments in a user-friendly way: motifs are presented as sequence logos.

**[output_prefix_alignments_1ton.tab]**-
Tab-delimited text file containing the shifted matrices resulting from 1-to-n alignments.

**[output_prefix_alignments_1ton.html]**-
HTML file presentig the 1-to-n alignments in a user-friendly way: motifs are presented as sequence logos.

The program successively computes one or several (dis)similiraty metrics between each matrix of the first input file and each matrix of the secnd input file.

Since the matrices are not supposed to be in phase, for each pair of
matrix, the program tests all possible *offset* (shift) values
between the two matrices.

In the formula below, symbols are defined as follows

*m1, m2*-
Two position-specific scoring matrices.

*w1,w2*-
Number of columns of matrices m1 and m2, respectively.

**Row number***r*-
Number of rows in each matrix, which correspond to the number of residues in the alphabet (A,C,G,T for DNA motifs).

**Aligned length***w*-
Number of aligned columns between matrices m1 and m2 (depends on the offset between the two matrices).

w <= w1 w <= w2

**Total length***W*-
Total length of the alignent between matrices m1 and m2.

W = w1 + w2 - w

**Relative length***Wr*-
A measure of the mutual overlap between the aligned matrices.

Wr = w / W

This actually corresponds to the Jaccard coefficient (intersection / union), applied to the alignment lengths.

*s1, s2*-
Number of sites in matrices m1 and m2, respectively.

*n*-
Number of cells in the aligned portion of the matrices.

n = w * r

*i*-
Index of a row of the aligned PSSM (corresponds to a residue).

*j*-
Index of a column of the aligned PSSM (corresponds to an aligned position).

*f1{i,j}*-
Frequency of residue

*r*in the*jth*column of the aligned subset of the first matrix (taking the offset into account). *f2{i,j}*-
Frequency of residue r in the jth column of the aligned subset of the second matrix (taking the offset into account).

*f1m, f2m*-
Mean frequency computed over all cells of matrices m1 and m2, respectively.

BEWARE: this metrics is the real SSD, i.e. the simple sum of squared distance. It is a distance metric, in contrast with the "SSD" defined in STAMP, which is converted to a similarity metrics (see Sandelin-Wasserman below).

SSD = SUM{i=1->r} SUM{j=1->w} [(f1{i,j} - f2{i,j})^2)]

Also implemented in STAMP (under the name SSD) and TOMTOM (under the name Sandelin-Wasserman). This is a distance to similarity conversion of the SSD. The conversion is ensured by substracting each squared distance to a constant 2 (the max distance between two columns containing relative frequencies, i.e. one residue has frequency 1 in one column, and another residue has ffrequency 1 in the other column).

SW = SUM{i=1->r} SUM{j=1->w} [2 - (f1{i,j} - f2{i,j})^2) ]

Source: Sandelin A & Wasserman WW (2004) J Mol Biol 338:207-215.

Sandelin-Wasserman (SW) similarity normalized by the number of aligned
columns (*w*).

NSW = SW / (2*w)

NSW takes a value comprized between 0 (not a single corresponding residue) and 1 (matrices are identical for all the aligned columns).

dEucl = sqrt( SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f2{i,j})^2)

Since relative frequencies can take values from 0 to 1, the Euclidian distance can take values from 0 to sqrt(2)*w.

Euclidian distance normalized by the number of aligned columns
(*w*).

NdEucl = dEucl / w

NdEucl can take values from 0 to `sqrt(2)`

.

Note that this differs from the definition provided in Pape et al. (2008).

A similarity metrics derived from the normalized Euclidian distance.

NsEucl = (Max(NdEucl) - NdEucl) / Max(NdEucl) = (sqrt(2) - NdEucl) / sqrt(2)

where *Max(NdEucl)*=sqrt(2) is the maximal possible Euclidian
distance for the current pair of matrices. The Normalized Euclidian
similarity can vary from 0 (idential matrices) to 1 (matrices with a
single residue per column, and those residues systematically differ
between the two matrices).

As defined in Aerts et al. (2003). Also called **Mutual Information**.

dKL = 1/(2w) * SUM{i=1->r} SUM{j=1->w} ( f1{i,j}*log(f1{i,j}/f2{i,j}) + f2{i,j}*log(f2{i,j}/f1{i,j}))

Note that the KL distance is problematic for matrices containing zero values: for example, if f1(i,k)=0 and f2(i,j)=1, we have : KL(i,j) = (0*log(0) + 1*log(1/0)) = 0 + log(Inf) = Inf

One can circumvent this problem by using pseudo-count corrected matrices (f'(i,j)), but then the KL distance is strongly dependent on the somewhat arbitrary choice of the pseudo-count value.

cov = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m) * (f2{i,j} - f2m)

Beware : this is the classical covariance defined in statistical textbooks. It has nothing to do with the "natural covariance" of Pape (which still needs to be implemented here). What we compute here is simply the covariance between the counts in the aligned cells of the respective matrices.

v1 = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m)^2 v2 = 1/n * SUM{i=1->r} SUM{j=1->w} (f2{i,j} - f2m)^2 cor = cov/ sqrt(v1*v2)

The normalized correlation prevents matches covering only a small fraction of the matrix (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).

The normalization factor is the relative length (Wr), i.e. the number of aligned columns divided by the total columns of the alignment.

Ncor = cor * Wr

This correction is particularly important to avoid selecting spurious
alignments between short fragments of the flanks of the matrices
(e.g. single-column alignments). For this reasons, *Ncor* generally
gives a better estimation of motif similarity than *cor*, and we
recommend it as similarity score.

Imposing a too stringent lower threshold on Ncor may however reduce the sensitivity, and in particular prevent from detecting matches between half-motifs (e.g. in the case of dimeric transcription factor recognizing composite motifs).

An alternative would be to use as normalizing factor the length of the alignment (w) relative to the length of the shorter motif.

Ncor = cor * w / min(w1,w2)

This however tends to favour matches between very short motifs (4-5 residues) which cover only a fraction of the query motif.

Pearson's correlation computed on the information content matrices (I1, I2) rather than on the frequencies.

Icov = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m) * (I2{i,j} - f2m) Iv1 = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m)^2 Iv2 = 1/n * SUM{i=1->r} SUM{j=1->w} (I2{i,j} - f2m)^2 cor = Icov/ sqrt(Iv1*Iv2)

The *Icor* score fixes a weakness of the *cor* score and all
other other metrics above, which only take into account the residue
frequencies whilst ignoring the background frequencies.

A typical manifestation of this problem is that the *cor* score
occasionally returns alignements between non-informative pieces of the
matrices , which appear flat on the aligned logos. The reason why
uninformative columns may have a good correlation is that, if both
matrices have the same compositional bias (for example 30%A, 20%C,
20%G and 30%T), they will be correlated. Consequently, the columns
reflecting the background will contribute to increase the correlation
coefficient.

The information content corrects this bias by relativizing the matrix frequencies with respect to the background residue probaiblities.

I{i,j} = f{i,j} log (f{i,j}/p{j})

where *p{j}* is the prior probability of residue *j*.

Distances between PSSMs have been treated in many ways. The most recent and relevant articles are cited hereafter.

**Aerts et al. Computational detection of cis -regulatory modules. Bioinformatics (2003) vol. 19 Suppl 2 pp. ii5-14****Gupta et al. Quantifying similarity between motifs. Genome Biol (2007) vol. 8 (2) pp. R24.****Pape, U.J., Rahman, S., and Vingron, M. (2008). Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics 24 (3) pp. 350-7.**

**-v #**-
Level of verbosity (detail in the warning messages during execution)

**-h**-
Display full help message

**-help**-
Same as -h

**-file1 matrix_file1**-
The first input file containing one or several matrices.

**-file2 matrix_file2**-
The second input file containing one or several matrices.

**-file single_matrix_file**-
Use a single matrix file as input. Each matrix of this file is compared to each other. This is equivalent to: -file1 single_matrix_file -file2 single_matrix_file

**-mlist1 matrix list**-
The fisrt input file contaning a list of matrix files (given as paths)

**-mlist2 matrix list**-
The second input file contaning a list of matrix files (given as paths) The reverse complement is computed for this set of matrices.

**-format1 matrix_format1**-
Specify the matrix format for the first input file only (requires -format2).

**-format2 matrix_format2**-
Specify the matrix format for the second input file only (requires -format1).

**-format matrix_format**-
Specify the matrix format for both input files (alternatively, see options -format1 and -format2).

**-bgfile background_file**-
Background model file.

**-bg_format format**-
Format for the background model file.

Supported formats: all the input formats supported by convert-background-model.

**-top1 X**-
Only analyze the first X motifs of the first file. This options is convenient for quick testing before starting the full analysis.

**-top2 X**-
Only analyze the first X motifs of the second file. This options is convenient for quick testing before starting the full analysis.

**-o output_prefix**-
Prefix for the output files. The output prefix is mandatory for some return fields (alignments, graphs, ...).

This prefix will be appended with a series of suffixes for the different output types (see section OUTPUT FORMATS above for the detail).

**-mode matches | profiles***-format matches*(default)-
Return matches between any matrix of the file1 and any matrix of file2.

This is the typical use of

*compare-matrices*: comparing one or several query motifs (e.g. obtained from motif discovery) with a collection of reference motifs (e.f. a database of experimentally characterized transcription factor binding motifs, such as JASPAR, TRANSFAC, RegulonDB, ...).For a given pair of matrices (one from file1 and one from file2), the program tests all possible offsets, and measures one or several matching scores (see section "(Dis)similarity metrics" above). The program only returns the sore of the best alignemnt between the two matrices. The "best" alignement is the combination of offset and strand (with the option -strand DR) that maximizes the default score (Ncor). Alternative scores can be used as optimality criteria with the option -sort.

*-format profiles*-
Return a table with one row for each possible alignment offset between two matrices, and various columns indicating the matching parameters (offset, strand, aligned width,...), the matching scores, and the consensus of the aligned columns of the matrices.

Matching profiles are convenient for drawing the similarity profiles, or for analyzing the correlations between various similarity metrics, but they are too verbosy for the typical use of

*compare-matrices*(detect matches between a query matrix and a database of reference matrices). The formats "matches" and "table" are more convenient for basic use. **-distinct**-
Skip comparison between a matrix and itself.

This option is useful when the program is sused to compare all matrices of a given file to all matrices of the same file, to avoid comparing each matrix to itself.

Beware: the criterion for considering two matrices identical is that they have the same identifier. If two matrices have exactly the same content (in terms of occurrences per position) but different identifiers, they will be compared.

**-strand D | R | DR**-
Perform matrix comparisons in direct (D) reverse complementary (R) or both orientations (DR, default option).

When the R or DR options are activated, all matrices of the second matrix file are converted to the reverse complementary matrix.

This option is useful to answer very particular questions, for example

**Comparing motifs in a strand-insensitive way (-strand DR)**-
DNA-binding motifs are usually strand-insensitive. A motif may be detected in one given orientation by a motif-discovery algorithm, but annotated in the reverse complementary orientation in a motif database. For DNA binding motifs, we thus recomment the DR option.

On the contrary, RNA-related signals (termination, poly-adenylation, miRNA) are strand-sensitive, and should be compared in a single orientation (-strand D).

**Detecting reverse complementary palindromic motifs**-
An example of reverse complementary palindromic motif is tCAGswwsGTGa. When a motif is reverse complementary palindromic, the matrix is correlated to its own reverse complement.

*Remark about a frequent misconception of biological palindromes*Reverse complementary palindroms are frequent in DNA signals (e.g. transcription factor binding sites, restriction sites, ...) because they correspond to a rotational symmetry in the 3D structure. Such symmetrical motifs are often characteristic of sites recognized by homodimeric complexes.

By contrast, simple string-based palindromes (e.g. CAGTTGAC) do absolutely not correspond to any symmetry on the biochemical point of view, because the 3D structure of the corresponding double helix is not symmetrical. The apparent symmetry is an artifact of the string-based representation, but the corresponding molecule has neither rotational nor translational symmetry.

DNA signals can either be symmetrical (reverse complementary palindromes, tandem repeats) or asymmetrical.

**-matrix_id #**-
Obsolete option for returning matrix names, Replaced by -return matrix_name. Maintained for backward compatibility.

**-return return_fields**-
List of fields to return (only valid for the formats "profiles" and "matches").

Supported return fields:

*offset*-
ascending (default for the profile mode)

*Ncor*-
decreasing (default for the matching mode)

*cor*-
decreasing

*cov*-
decreasing

*SSD*-
ascending

*NSW*-
decreasing

*SW*-
decreasing

*dEucl*-
ascending

*NdEucl*-
ascending

*NsEucl*-
decreasing

*dKL*-
ascending

*matrix_number*-
Number of the matrices in the input files

*matrix_id*-
Identifiers of the matrices

*matrix_name*-
Names of the matrices

*matrix_ac**width*-
Width of the matrices and the alignment

*strand*-
Direct (D) or Reverse complementary (R) comparison

*offset*-
Offset between the positions of the first and second matrix

*pos*-
Relative positions the aligned matrices (start, end, strand, width)

*consensus**rank**alignments_pairwise*-
Shifted matrices resulting from the pairwise alignments.

*alignments_1ton*-
Shifted matrices resulting from the 1-to-N alignments.

*alignments*-
Shifted matrices resulting from the alignments (pairwise and 1-to-N).

*all*-
All supported output fields, including all metrics.

**-sort sort_field**-
Field to sort the results. The sorting direction depends on the metric: ascending for dissimilarity metrics, decreasing for similarity metrics.

Supported sort fields:

*offset*-
ascending (default for the profile mode)

*Ncor*-
decreasing (default for the matching mode)

*cor*-
decreasing

*cov*-
decreasing

*SSD*-
ascending

*SW*-
decreasing

*NSW*-
decreasing

*dEucl*-
ascending

*NdEucl*-
ascending

*NsEucl*-
decreasing

*dKL*-
ascending

**-lth param lower_threshold****-uth param upper_threshold**-
Threshold on some parameter (-lth: lower, -uth: upper threshold).

Supported threshold fields : rank, dEucl, cor, cov, ali_len, offset

**convert-matrix****matrix-scan**

**Additional metrics****Mutual information**-
We should check if this fixes the problems of 0 values that we have with the KL distance.

**The "natural covariance"**-
Pape, U. J., Rahmann, S. and Vingron, M. (2008). Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics 24, 350-7.

This metrics measures the covariance between hits of two matrices above a given threshold for each of them.

**chi2 P-value (for the sake of comparison).**-
Note that a condition of applicability of the chi2 P-value is that the expected value should be >= 5 for each cell of the matrix. This condition is usually not fulfilled for the PSSM we use for motif scanning.

**Average Log Likelihood Ratio (ALLR)**-
Source: Wang T & Stormo GD (2003) Bioinformatics 19:2369-2380 Also implemented in STAMP.

**-pseudo**-
Pseudo-counts to be added to all matrices.

**-comparison_mode table | consensus***-return clusters*-
Cluster motifs (only valid with a single input file).

*-return crosstable field*-
Export a table with one row per matrix of the file 1, one column per matrix of file 2, where each cell indicates the value of the selected field for the corresponding pair of matrices.

*-return graph*-
Export a graph where nodes correspond to input matrices, and edges indicate similarities between them.