Name
Description
Performs interconversions between various formats of
positionspecific scoring matrices (PSSM).
The program also performs a statistical analysis of the original
matrix to provide different positionspecific scores (weight,
frequencies, information contents), general statistics (Evalue, total
information content), and synthetic descriptions (consensus).
PSSM can be used to represent the binding specificity of a transcription
factor or the conserved residues of a protein domain.
Each row of the matrix corresponds to one residue (nucleotide or
aminoacid depending on the sequence type). Each column corresponds
to one position in the alignment. The value within each cell
represents the frequency of each residue at each position.
INPUT/OUTPUT FORMATS
Some formats are supported only for input, others for output. There
are more formats accepted for input, because the general use of this
program is to convert a PSSM obtained from a database (e.g. TRANSFAC)
or a patterndiscovery program (e.g. consensus, gibbs, meme,
MotifSampler, ...) and obtain a matrix either for scanning (with
matrixscan) or for computing statistical parameters (see the return
fields below). We generally use the TRANSFAC (tf) format, in which we can specify identifiers and names for the matrix.
 TRANSFAC (input/output)
Format used in the TRANSFAC database
AC MA0001.1
XX
ID AGL3
XX
DE MA0001.1 AGL3; from JASPAR
PO A C G T
1 0 94 1 2
2 3 75 0 19
3 79 4 3 11
4 40 3 4 50
5 66 1 1 29
6 48 2 0 47
7 65 5 5 22
8 11 2 3 81
9 65 3 28 1
10 0 3 88 6
XX
CC program: jaspar
XX
//
tab (input/output) tabdelimited
file. One row per residue, one column per position. The first
column of each row indicates the residue, the following columns
give the frequency of that residue at the corresponding position
of the matrix.
The tab format accepts a userspecific set of return fields (option
return), proviging different statistics on the matrix (counts,
frequencies, weights, information, other parameters: see description
below).
; MET4 matrix, from Gonze et al. (2005). Bioinformatics 21, 3490500.
A  7 9 0 0 16 0 1 0 0 11 6 9 6 1 8
C  5 1 4 16 0 15 0 0 0 3 5 5 0 2 0
G  4 4 1 0 0 0 15 0 16 0 3 0 0 2 0
T  0 2 11 0 0 1 0 16 0 2 2 2 10 11 8
//
JASPAR (input/output)
http://jaspar.genereg.net/html/TEMPLATES/help.html
> Mycn
A [ 0 29 0 2 0 0 ]
C [31 0 30 1 3 0 ]
G [ 0 0 0 28 0 31]
T [ 0 2 1 0 28 0 ]
MSCAN (input)
http://www.cisreg.ca/cgibin/mscan/MSCAN
>mef2
10 0 0 0 22 0 6 2 3 4 22 10
0 2 12 0 0 0 0 0 0 0 0 0
9 20 2 0 0 0 0 0 0 0 0 10
3 0 8 22 0 22 16 20 19 18 0 2
>myf
7 9 4 0 16 7 0 6 0 0 6 0
8 0 2 15 0 0 15 0 0 10 0 0
1 7 10 1 0 9 1 0 16 6 0 16
0 0 0 0 0 0 0 10 0 0 10 0
meme (input)
Output file from MEME, the patterndiscovery program developed by
tim Bailey.This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
http://meme.nbcr.net/meme/doc/memeformat.html
Background letter frequencies
A 0.303 C 0.183 G 0.209 T 0.306
MOTIF crp alternative name
letterprobability matrix: alength= 4 w= 19 nsites= 17 E= 4.1e009
0.000000 0.176471 0.000000 0.823529
0.000000 0.058824 0.647059 0.294118
0.000000 0.058824 0.000000 0.941176
0.176471 0.000000 0.764706 0.058824
0.823529 0.058824 0.000000 0.117647
0.294118 0.176471 0.176471 0.352941
0.294118 0.352941 0.235294 0.117647
0.117647 0.235294 0.352941 0.294118
0.529412 0.000000 0.176471 0.294118
0.058824 0.235294 0.588235 0.117647
0.176471 0.235294 0.294118 0.294118
0.000000 0.058824 0.117647 0.823529
0.058824 0.882353 0.000000 0.058824
0.764706 0.000000 0.176471 0.058824
0.058824 0.882353 0.000000 0.058824
0.823529 0.058824 0.058824 0.058824
0.176471 0.411765 0.058824 0.352941
0.411765 0.000000 0.000000 0.588235
0.352941 0.058824 0.000000 0.588235
meme_block (input) older format from MEME
CISBP (input)
Format used in the CISBP database.
Similar to transfac, but without the AC/ID lines, and Position
line labeled with Pos instead of PO.
ClusterBuster (cb) (input/output)
clusterbuster output file (usual extention .cb), which can be
used as input by various other programs (clover, trap). The
header line starts with a > (like in fasta format). The matrix
is then printed "vertically" on the following lines: each
column corresponds to one residue, and each row to a position
in the alignment. For TRAP (Roider et al, Bioinformatics,
2007), the "/name=" is necessary for the program to work.
>element1 /name=element1
0 4 2 14
12 0 0 8
8 0 1 11
20 0 0 0
....
STAMP and STAMPtransfac (input/output)
Converts the matrix from/to a string in STAMP format
(http://www.benoslab.pitt.edu/stamp/help.html).
STAMP is a dialect of the TRANSFAC format, with important differences:
  the fields ID and AC are absent, and the matrix ID comes in the field DE
  the header row (PO) is not supported
  the positions start at 0 instead of 1
  there is no matrix delimiter (the double slash)
In addition, STAMP admits two variants:
sequences (input)
Create a matrix from a FASTA sequence file containing the prealigned sites.
The method just reads the sequences and counts the residue frequencies
at each position.
patser (output)
This format can be used as input to scan sequences with
patser, the patternmatching program developed by Jerry Hertz.
This is actually the same format as tab (described above), but
the only return field is the count matrix.
consensus (input/output)
Output file from consensus, the patterndiscovery program
developed by Jerry Hertz (Hertz et al., Comput Appl Biosci,
1990:6, 8192). This file contains one or several matrices, +
additional information on the parameters used for pattern
discovery (e.g. prior residue frequencies).
gibbs (input)
Output file from gibbs, the patterndiscovery program
developed by Andrew Neuwald (Lawrence et al. Science, 1993:
262, 208214; Neuwald, et al. Protein Sci, 1995: 4, 16181632)
MotifSampler (input/output)
Output file from MotifSampler, the patterndiscovery program
developed by Gert Thijs (Thijs et al. Bioinformatics, 2001:17,
11131122).
alignAce (input)
Uniprobe (input)
http://the_brain.bwh.harvard.edu/uniprobe/downloads.php
Protein: Cbf1 Seed kmer: ATCACGTG Enrichment Score: 0.499010437669239
A: 0.251714422716682 0.231020715440932 0.371175995676819 0.343515826416987 0.189181911178663 0.373249743142318 0.159425685466501 0.387398837326962 0.160370450851774 0.00579566973382471 0.984310428811586 0.000520578518462409 0.0512168242470759 0.00554791387069823 0.00108871328362558 0.436684281349379 0.106429865986653 0.0872652424535894 0.2779359708333 0.222894293683715 0.366796870220836 0.226022414885529
C: 0.10226033847082 0.315992694980937 0.148489261324769 0.182792315701972 0.406736016253256 0.213860951366744 0.324485360588445 0.0418650553618826 0.045745403962552 0.99055073718171 0.0105040001038691 0.975244090256611 0.0064124125195024 0.00433861322483728 0.0016092288172844 0.262472975122313 0.184027817720692 0.549338793818378 0.127202171464537 0.198102864294932 0.306135553163069 0.321957177096839
G: 0.130212757211399 0.266959960914667 0.154092799416608 0.241158374156534 0.15046928890062 0.0930127274890034 0.354968645980097 0.554463521652852 0.0956960762910429 0.00242503126912309 0.0010685902059978 0.00922202192948052 0.94180586404824 0.00940766558113401 0.973242502288466 0.0857562483507934 0.0600857360004131 0.129208087201377 0.253172180381345 0.437904236128241 0.1940215817188 0.116641876697499
T: 0.515812481601099 0.186026628663464 0.326241943581804 0.232533483724507 0.253612783667461 0.319876578001935 0.161120307964957 0.0162725856583034 0.698188068894632 0.00122856181534235 0.00411698087854669 0.0150133092954459 0.000564899185181348 0.98070580732333 0.0240595556106241 0.215086495177514 0.649456580292242 0.234187876526655 0.341689677320817 0.141098605893113 0.133045994897295 0.335378531320133
Encode
http://compbio.mit.edu/encodemotifs/
>SIX5_disc1 SIX5_GM12878_encodeMyers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234
infogibbs (input/output)
Output file from RSAT infogibbs.
infogibbs is a gibbs sampler based on the optimization of the
information content of the matrix (rather than the weight of
the sampled segments). infogibbs was developed by Matthieu De France.
assembly (input)
Output file from the program RSAT patternassembly. One assembly
file can contain zero, one or several assemblies. Each
assembly is converted to a positionspecific scoring matrix by
taking, for each residue at each position, the score of the
most significant pattern (oligonucleotide) containing that
residue in this position of the assembly.
feature (input)
Output file from RSAT convertfeatures.
This format allows to obtain a PSSM from a list of (supposedly
prealigned) sites. These sites can themselves have been
collected by scanning sequences with a matrix (matrixscan) or
by searching stringbased patterns in a sequence
(dnapattern).
Converting features to matrices can for example be useful for
iterative refinment of a matrix (colecting sites from a
matrix, and building a matrix from those sites).
Another application is to detect oligomers or dyads in a
sequence set, and build a matrix from these.
clustal (input)
The popular multiple alignemnt program clustalw.
RETURN FIELDS FOR THE TABDELIMITED OUTPUT FORMAT
 counts

Each cell of the matrix indicates the number of occurrences of the
residue at a given position of the alignment.
 profile

The matrix is printed vertically (each matrix column becomes a row in
the output text). Additional parameters (consensus, information) are
indicated besides each position, and a histogram is drawed.
 crude frequencies

Relative frequencies are calculated as the counts of residues divided
by the total count of the column.

Fij=Cij/SUMi(Cij)

where
 Cij

is the absolute frequency (counts) of residue i at position j of the alignment
 Fij

is the relative frequency of residue i at position j of the alignment
 frequencies corrected with pseudoweights

Relative frequencies can be corrected by a pseudoweight (b) to reduce
the bias due to the small number of observations.

F''ij=Cij+b*Pi/[SUMi(Cij)+b]

where
 Pi

is the prior frequency for residue i
 b

is the pseudoweight, which is ``shared'' between residues according to
their prior frequencies.
 weights

Weights are calculated according to the formula from Hertz (1999), as
the natural logarithm of the ratio between the relative frequency
(corrected for pseudoweights) and the prior residue probability.

Wij=ln(F''ij/Pi)
 information

The crude information content is calculated according to the formula
from Hertz (1999).

Iij = Fij*ln(Fij/Pi)

In addition, we calculate a ``corrected'' information content which
takes pseudoweights into account.

I''ij = F''ij*ln(F''ij/Pi)
 Pvalue

The Pvalue indicates the probability to observe at least Cij
occurrences of a residue at a given position of the matrix. It is
calculated with the binomial formula:

k=C.j C.j! k Cijk
Pij= SUM  Pi (1Pi)
k=Cij k!(C.jk)!

where
 Cij

is the number of occurrences of residue i at position j of
the matrix.
 C.j

is the sum of all residue occurrences at position j of the
matrix.
 Pi

is the prior probability of residue i.
 parameters

Returns a series of parameters associated to the matrix. The list of
parameters to be exported depends on the input formats (each pattern
discovery program returns specific parameters, which are more or less
related to each others but not identical).

Some additional parameters are optionally calculated
 consensus

The degenerate consensus is calculated by collecting, at each
position, the list of residues with a positive weight. Contrarily to
most applications, this consensus is thus weighted by prior residue
frequencies: a residue with a high frequency might not be represented
in the consensus if this frequency does not significantly exceed the
expected frequency. Uppercases are used to highlight weights >= 1.

The consensus is exported as regular expression, and with the IUPAC
code for ambiguous nucleotides (http://www.chem.qmw.ac.uk/iupac/misc/naseq.html).

A (Adenine)
C (Cytosine)
G (Guanine)
T (Thymine)
R = A or G (puRines)
Y = C or T (pYrimidines)
W = A or T (Weak hydrogen bonding)
S = G or C (Strong hydrogen bonding)
M = A or C (aMino group at common position)
K = G or T (Keto group at common position)
H = A, C or T (not G)
B = G, C or T (not A)
V = G, A, C (not T)
D = G, A or T (not C)
N = G, A, C or T (aNy)

The strict consensus indicates, at each position, the residue with the
highest positive weight.
 information

The total information is calculated by summing the information content
of all the cells of the matrix. This parameters is already returned by
the program consensus (Hertz), but not by other programs.
 logo

Sequence logo, a visual representation of the motif, where each column
of the matrix is represented as a stack of letters whose size is
proportional to the corresponding residue frequency. The total height
of each column is proportional to its information content.
Sequence logo are generated using the freeware
program Weblogo.