RSAT - compare-matrices manual

NAME
VERSION
DESCRIPTION
AUTHORS
CATEGORY
USAGE
INPUT FORMATS
OUTPUT FORMATS
ALGORITHM
(DIS)SIMILARITY METRICS

Symbols used for the metrics
Sum of Squared Distances (SSD)
Sandelin-Wasserman similarity (SW)
Normalized Sandelin-Wasserman similarity (NSW)
Euclidian distance (dEucl)
Normalized Euclidian distance (NdEucl)
Normalized Euclidian similarity (NsEucl)
Kullback-Leibler distance (dKL)
Covariance (cov)
Coefficient of correlation (cor)
Normalized correlation (Ncor)

Note

Correlation of the information content (Icor)

REFERENCES
OPTIONS
SEE ALSO
WISH LIST

NAME

compare-matrices

VERSION

$program_version

DESCRIPTION

Compare two collections of position-specific scoring matrices (PSSM), and return various similarity statistics + matrix alignments (pairwise, one-to-n).

AUTHORS

Jacques van Helden <Jacques.van-Helden\@univ-amu.fr>

USAGE

compare-matrices -file1 inputfile1 -file2 inputfile2 [-o outputfile] [-v #] [...]

INPUT FORMATS

The user has to specify exactly input files (options -file1 and -file2), each containing one or several PSSMs. Each matrix of file one is compared with each matrix of file2.

Any PSSM format supported in RSAT (type convert-matrix -h for a description).

OUTPUT FORMATS

By default, the output format is a tab-delimited file with one row per matrix comparison, and one column per statistics. Depending on the requested return fields, compare-matrices can also export a series of additional files.

[output_prefix].tab: Tab-delimited text file containing the primary result (comparison score table): one column per comparison (match or profile position), one row per field (score, matrix descriptor, ...).
[output_prefix].html: HTML file presenting the comparison table in a user-friendly way. The clickable headers allow to re-order the table according to any column.
[output_prefix_alignments_pairwise.tab]: Tab-delimited text file containing the shifted matrices resulting from pairwise alignments.
[output_prefix_alignments_pairwise.html]: HTML file presentig the pairwise alignments in a user-friendly way: motifs are presented as sequence logos.
[output_prefix_alignments_1ton.tab]: Tab-delimited text file containing the shifted matrices resulting from 1-to-n alignments.
[output_prefix_alignments_1ton.html]: HTML file presentig the 1-to-n alignments in a user-friendly way: motifs are presented as sequence logos.

ALGORITHM

The program successively computes one or several (dis)similiraty metrics between each matrix of the first input file and each matrix of the secnd input file.

Since the matrices are not supposed to be in phase, for each pair of matrix, the program tests all possible offset (shift) values between the two matrices.

(DIS)SIMILARITY METRICS

Symbols used for the metrics

In the formula below, symbols are defined as follows

m1, m2

Two position-specific scoring matrices.

w1,w2

Number of columns of matrices m1 and m2, respectively.

Row number r

Number of rows in each matrix, which correspond to the number of residues in the alphabet (A,C,G,T for DNA motifs).

Aligned length w

Number of aligned columns between matrices m1 and m2 (depends on the offset between the two matrices).

 w <= w1
 w <= w2

Total length W

Total length of the alignent between matrices m1 and m2.

 W = w1 + w2 - w

Relative length Wr

A measure of the mutual overlap between the aligned matrices.

Wr = w / W

This actually corresponds to the Jaccard coefficient (intersection / union), applied to the alignment lengths.

s1, s2

Number of sites in matrices m1 and m2, respectively.

n

Number of cells in the aligned portion of the matrices.

 n = w * r

i

Index of a row of the aligned PSSM (corresponds to a residue).

j

Index of a column of the aligned PSSM (corresponds to an aligned position).

f1{i,j}

Frequency of residue r in the jth column of the aligned subset of the first matrix (taking the offset into account).

f2{i,j}

Frequency of residue r in the jth column of the aligned subset of the second matrix (taking the offset into account).

f1m, f2m

Mean frequency computed over all cells of matrices m1 and m2, respectively.

Sum of Squared Distances (SSD)

BEWARE: this metrics is the real SSD, i.e. the simple sum of squared distance. It is a distance metric, in contrast with the "SSD" defined in STAMP, which is converted to a similarity metrics (see Sandelin-Wasserman below).

 SSD = SUM{i=1->r} SUM{j=1->w} [(f1{i,j} - f2{i,j})^2)]

Sandelin-Wasserman similarity (SW)

Also implemented in STAMP (under the name SSD) and TOMTOM (under the name Sandelin-Wasserman). This is a distance to similarity conversion of the SSD. The conversion is ensured by substracting each squared distance to a constant 2 (the max distance between two columns containing relative frequencies, i.e. one residue has frequency 1 in one column, and another residue has ffrequency 1 in the other column).

 SW = SUM{i=1->r} SUM{j=1->w} [2 - (f1{i,j} - f2{i,j})^2) ]

Source: Sandelin A & Wasserman WW (2004) J Mol Biol 338:207-215.

Normalized Sandelin-Wasserman similarity (NSW)

Sandelin-Wasserman (SW) similarity normalized by the number of aligned columns (w).

 NSW = SW / (2*w)

NSW takes a value comprized between 0 (not a single corresponding residue) and 1 (matrices are identical for all the aligned columns).

Euclidian distance (dEucl)

 dEucl = sqrt( SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f2{i,j})^2)

Since relative frequencies can take values from 0 to 1, the Euclidian distance can take values from 0 to sqrt(2)*w.

Normalized Euclidian distance (NdEucl)

Euclidian distance normalized by the number of aligned columns (w).

 NdEucl = dEucl / w

NdEucl can take values from 0 to sqrt(2).

Note that this differs from the definition provided in Pape et al. (2008).

Normalized Euclidian similarity (NsEucl)

A similarity metrics derived from the normalized Euclidian distance.

 NsEucl = (Max(NdEucl) - NdEucl) / Max(NdEucl)
        = (sqrt(2) - NdEucl) / sqrt(2)

where Max(NdEucl)=sqrt(2) is the maximal possible Euclidian distance for the current pair of matrices. The Normalized Euclidian similarity can vary from 0 (idential matrices) to 1 (matrices with a single residue per column, and those residues systematically differ between the two matrices).

Kullback-Leibler distance (dKL)

As defined in Aerts et al. (2003). Also called Mutual Information.

 dKL = 1/(2w) * SUM{i=1->r} SUM{j=1->w} (
                   f1{i,j}*log(f1{i,j}/f2{i,j})
                   + f2{i,j}*log(f2{i,j}/f1{i,j}))

Note that the KL distance is problematic for matrices containing zero values: for example, if f1(i,k)=0 and f2(i,j)=1, we have : KL(i,j) = (0*log(0) + 1*log(1/0)) = 0 + log(Inf) = Inf

One can circumvent this problem by using pseudo-count corrected matrices (f'(i,j)), but then the KL distance is strongly dependent on the somewhat arbitrary choice of the pseudo-count value.

Covariance (cov)

 cov = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m) * (f2{i,j} - f2m)

Beware : this is the classical covariance defined in statistical textbooks. It has nothing to do with the "natural covariance" of Pape (which still needs to be implemented here). What we compute here is simply the covariance between the counts in the aligned cells of the respective matrices.

Coefficient of correlation (cor)

 v1 = 1/n * SUM{i=1->r} SUM{j=1->w} (f1{i,j} - f1m)^2
 v2 = 1/n * SUM{i=1->r} SUM{j=1->w} (f2{i,j} - f2m)^2
 cor = cov/ sqrt(v1*v2)

Normalized correlation (Ncor)

The normalized correlation prevents matches covering only a small fraction of the matrix (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).

The normalization factor is the relative length (Wr), i.e. the number of aligned columns divided by the total columns of the alignment.

Ncor = cor * Wr

This correction is particularly important to avoid selecting spurious alignments between short fragments of the flanks of the matrices (e.g. single-column alignments). For this reasons, Ncor generally gives a better estimation of motif similarity than cor, and we recommend it as similarity score.

Imposing a too stringent lower threshold on Ncor may however reduce the sensitivity, and in particular prevent from detecting matches between half-motifs (e.g. in the case of dimeric transcription factor recognizing composite motifs).

Note

An alternative would be to use as normalizing factor the length of the alignment (w) relative to the length of the shorter motif.

 Ncor = cor * w / min(w1,w2)

This however tends to favour matches between very short motifs (4-5 residues) which cover only a fraction of the query motif.

Correlation of the information content (Icor)

Pearson's correlation computed on the information content matrices (I1, I2) rather than on the frequencies.

 Icov = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m) * (I2{i,j} - f2m)
 Iv1 = 1/n * SUM{i=1->r} SUM{j=1->w} (I1{i,j} - f1m)^2
 Iv2 = 1/n * SUM{i=1->r} SUM{j=1->w} (I2{i,j} - f2m)^2
 cor = Icov/ sqrt(Iv1*Iv2)

The Icor score fixes a weakness of the cor score and all other other metrics above, which only take into account the residue frequencies whilst ignoring the background frequencies.

A typical manifestation of this problem is that the cor score occasionally returns alignements between non-informative pieces of the matrices , which appear flat on the aligned logos. The reason why uninformative columns may have a good correlation is that, if both matrices have the same compositional bias (for example 30%A, 20%C, 20%G and 30%T), they will be correlated. Consequently, the columns reflecting the background will contribute to increase the correlation coefficient.

The information content corrects this bias by relativizing the matrix frequencies with respect to the background residue probaiblities.

 I{i,j} = f{i,j} log (f{i,j}/p{j})

where p{j} is the prior probability of residue j.

REFERENCES

Distances between PSSMs have been treated in many ways. The most recent and relevant articles are cited hereafter.

Aerts et al. Computational detection of cis -regulatory modules. Bioinformatics (2003) vol. 19 Suppl 2 pp. ii5-14
Gupta et al. Quantifying similarity between motifs. Genome Biol (2007) vol. 8 (2) pp. R24.
Pape, U.J., Rahman, S., and Vingron, M. (2008). Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics 24 (3) pp. 350-7.

OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

-file1 matrix_file1

The first input file containing one or several matrices.

-file2 matrix_file2

The second input file containing one or several matrices.

-file single_matrix_file

Use a single matrix file as input. Each matrix of this file is compared to each other. This is equivalent to: -file1 single_matrix_file -file2 single_matrix_file

-mlist1 matrix list

The fisrt input file contaning a list of matrix files (given as paths)

-mlist2 matrix list

The second input file contaning a list of matrix files (given as paths) The reverse complement is computed for this set of matrices.

-format1 matrix_format1

Specify the matrix format for the first input file only (requires -format2).

-format2 matrix_format2

Specify the matrix format for the second input file only (requires -format1).

-format matrix_format

Specify the matrix format for both input files (alternatively, see options -format1 and -format2).

-bgfile background_file

Background model file.

-bg_format format

Format for the background model file.

Supported formats: all the input formats supported by convert-background-model.

-top1 X

Only analyze the first X motifs of the first file. This options is convenient for quick testing before starting the full analysis.

-top2 X

Only analyze the first X motifs of the second file. This options is convenient for quick testing before starting the full analysis.

-o output_prefix

Prefix for the output files. The output prefix is mandatory for some return fields (alignments, graphs, ...).

This prefix will be appended with a series of suffixes for the different output types (see section OUTPUT FORMATS above for the detail).

-mode matches | profiles

-format matches (default)

Return matches between any matrix of the file1 and any matrix of file2.

This is the typical use of compare-matrices: comparing one or several query motifs (e.g. obtained from motif discovery) with a collection of reference motifs (e.f. a database of experimentally characterized transcription factor binding motifs, such as JASPAR, TRANSFAC, RegulonDB, ...).

For a given pair of matrices (one from file1 and one from file2), the program tests all possible offsets, and measures one or several matching scores (see section "(Dis)similarity metrics" above). The program only returns the sore of the best alignemnt between the two matrices. The "best" alignement is the combination of offset and strand (with the option -strand DR) that maximizes the default score (Ncor). Alternative scores can be used as optimality criteria with the option -sort.

-format profiles

Return a table with one row for each possible alignment offset between two matrices, and various columns indicating the matching parameters (offset, strand, aligned width,...), the matching scores, and the consensus of the aligned columns of the matrices.

Matching profiles are convenient for drawing the similarity profiles, or for analyzing the correlations between various similarity metrics, but they are too verbosy for the typical use of compare-matrices (detect matches between a query matrix and a database of reference matrices). The formats "matches" and "table" are more convenient for basic use.

-distinct

Skip comparison between a matrix and itself.

This option is useful when the program is sused to compare all matrices of a given file to all matrices of the same file, to avoid comparing each matrix to itself.

Beware: the criterion for considering two matrices identical is that they have the same identifier. If two matrices have exactly the same content (in terms of occurrences per position) but different identifiers, they will be compared.

-strand D | R | DR

Perform matrix comparisons in direct (D) reverse complementary (R) or both orientations (DR, default option).

When the R or DR options are activated, all matrices of the second matrix file are converted to the reverse complementary matrix.

This option is useful to answer very particular questions, for example

Comparing motifs in a strand-insensitive way (-strand DR)

DNA-binding motifs are usually strand-insensitive. A motif may be detected in one given orientation by a motif-discovery algorithm, but annotated in the reverse complementary orientation in a motif database. For DNA binding motifs, we thus recomment the DR option.

On the contrary, RNA-related signals (termination, poly-adenylation, miRNA) are strand-sensitive, and should be compared in a single orientation (-strand D).

Detecting reverse complementary palindromic motifs

An example of reverse complementary palindromic motif is tCAGswwsGTGa. When a motif is reverse complementary palindromic, the matrix is correlated to its own reverse complement.

Remark about a frequent misconception of biological palindromes

Reverse complementary palindroms are frequent in DNA signals (e.g. transcription factor binding sites, restriction sites, ...) because they correspond to a rotational symmetry in the 3D structure. Such symmetrical motifs are often characteristic of sites recognized by homodimeric complexes.

By contrast, simple string-based palindromes (e.g. CAGTTGAC) do absolutely not correspond to any symmetry on the biochemical point of view, because the 3D structure of the corresponding double helix is not symmetrical. The apparent symmetry is an artifact of the string-based representation, but the corresponding molecule has neither rotational nor translational symmetry.

DNA signals can either be symmetrical (reverse complementary palindromes, tandem repeats) or asymmetrical.

-matrix_id #

Obsolete option for returning matrix names, Replaced by -return matrix_name. Maintained for backward compatibility.

-return return_fields

List of fields to return (only valid for the formats "profiles" and "matches").

Supported return fields:

offset: ascending (default for the profile mode)
Ncor: decreasing (default for the matching mode)
cor: decreasing
cov: decreasing
SSD: ascending
NSW: decreasing
SW: decreasing
dEucl: ascending
NdEucl: ascending
NsEucl: decreasing
dKL: ascending
matrix_number: Number of the matrices in the input files
matrix_id: Identifiers of the matrices
matrix_name: Names of the matrices
matrix_ac
width: Width of the matrices and the alignment
strand: Direct (D) or Reverse complementary (R) comparison
offset: Offset between the positions of the first and second matrix
pos: Relative positions the aligned matrices (start, end, strand, width)
consensus
rank
alignments_pairwise: Shifted matrices resulting from the pairwise alignments.
alignments_1ton: Shifted matrices resulting from the 1-to-N alignments.
alignments: Shifted matrices resulting from the alignments (pairwise and 1-to-N).
all: All supported output fields, including all metrics.

-sort sort_field

Field to sort the results. The sorting direction depends on the metric: ascending for dissimilarity metrics, decreasing for similarity metrics.

Supported sort fields:

offset: ascending (default for the profile mode)
Ncor: decreasing (default for the matching mode)
cor: decreasing
cov: decreasing
SSD: ascending
SW: decreasing
NSW: decreasing
dEucl: ascending
NdEucl: ascending
NsEucl: decreasing
dKL: ascending

-lth param lower_threshold

-uth param upper_threshold

Threshold on some parameter (-lth: lower, -uth: upper threshold).

Supported threshold fields : rank, dEucl, cor, cov, ali_len, offset

WISH LIST

Additional metrics
-pseudo: Pseudo-counts to be added to all matrices.
-comparison_mode table | consensus