RSAT - Help about compare-classes
Description
Compare two class files (the query file and the reference file). Each
class of the query file is compared to each class of the reference
file. The number of common elements is reported, as well as the
probability to observe at least this number of common elements by
chance alone.
Authors
Jacques van Helden, with a contribution of Joseph Tran for a prototype
version.
Interpretation of the results
Options
Warning: the following text is the help message from the
on-line program, we still need to format it for the html form. There
re possibly some differencs in option names, but the help already
provides information about the statistics used by the program, and the
meaning of the different options.
-r ref_classes
A tab-delimited text file containing the dscription of
reference classes (see format description below).
-q query_classes
A tab-delimited text file containing the dscription of
query classes (see format description below).
-i input_file
This file will be used as both reference and query.
This is equivalent to
-q input_file -r input_file
-sc score column
Specify a column of the input file containing a score associated to
each member. The score is used for some metrics like the dot product.
-o outputfile
if not specified, the standard output is used.
This allows to place the command within a pipe.
-rnames class_name_file
File containing names for the reference classes.
Associate a name to each class, when the classes are
specified by an ID (ex Gene Ontology term IDs).
The class name file contains two columns
1) class ID
2) class name
-qnames class_name_file
File containing names for the query classes.
Same format as for -rnames.
-cat catalog
Compare the query file to pre-defined catalogs
(e.g. GO, MIPS functional classes, ...). These
catalogs associate each gene of a genome to one or
several classes. The organism must be specified
(option -org).
Supported catalogs:
GO,MIPS,TF_targets,ms_complexes,regulons
This option is currently supported for Saccharomyces
cerevisiae.
-org organism (for pre-defined catalogs)
-return return_fields
List of fields to return. Supported field :
dotprod,entropy,freq,jac_sim,members,occ,proba,rank
Return fields are grouped, so that each request will return
several fields. For example, the group "proba" returns the
P-value, the E-value and the significance.
group field description
rank rank Rank of the comparison
occ Q Number of elements in class Q
occ QR Number of elements found in the intersecion between classes R and Q
occ QvR Number of elements found in the union of classes R and Q. This is R or Q.
occ R Number of elements in class R
freq E(QR) Expected number of elements in the intersection
freq F(!Q!R) Frequency of !Q!R elements relative to population size. F(!Q!R)=!Q!R/P
freq F(Q!R) Frequency of Q!R elements relative to population size. F(Q!R)=Q!R/P
freq F(Q) Frequency of Q elements relative to population size. F(Q)=Q/P
freq F(QR) Frequency of QR elements relative to population size. F(QR)=QR/P
freq F(R!Q) Frequency of R!Q elements relative to population size. F(R!Q)=R!Q/P
freq F(R) Frequency of R elements relative to population size. F(R)=R/P
freq P(QR) Probability of Q and R (Q^R), assuming independence. P(QR) = F(Q)*F(R)
freq P(Q|R) Probability of Q given R. P(Q|R) = F(QR)/F(R)
freq P(R|Q) Probability of R given Q. P(R|Q) = F(QR)/F(Q)
proba E_val E-value of the intersection. E_val = P_val * nb_tests
proba P_val P-value of the intersection, calculated witht he hypergeometric function. Pval = P(X >= QR).
proba sig Significance of the intersection. sig = -log10(E_val)
jac_sim jac_sim Jaccard' similarity. jac_sim = intersection/union = (Q and R)/(Q or R)
dotprod dotprod Dot product (using the score column)
entropy H(Q) Entropy of class Q. H(Q) = - F(Q)*log[F(Q)] - F(!Q)*log[F(!Q)]
entropy H(Q,R) Join entropy for classes Q and R. H(Q,R) = - F(QR)*log[F(QR)] - F(Q!R)*log[F(Q!R)] - F(R!Q)*log[F(R!Q)] - F(!Q!R)*log[F(!Q!R)]
entropy H(Q|R) Conditional entropy of Q given R. H(Q|R) = H(Q,R) - H(R)
entropy H(R) Entropy of class R. H(R) = - F(R)*log[F(R)] - F(!R)*log[F(!R)]
entropy H(R|Q) Conditional entropy of R given Q. H(R|Q) = H(Q,R) - H(Q)
entropy I(Q,R) Mutual information of classs Q and R. I(Q,R) = H(Q) + H(R) - H(Q,R)
entropy IC Information content (as defined by Schneider, 1986). IC = F(QR) log[F(QR)/F(Q)F(R)]
entropy U(Q|R)
entropy U(R|Q)
entropy dH(Q,R) Entropy distance between classes Q and R. dH(Q,R) = H(Q,R) - H(Q)/2 - H(R)/2
-uth field #
upper threshold value for a given field
Supported_fields: E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,rank,sig
-lth field #
lower threshold value for a given field
(same fields as -uth)
-pop #
Population size. If not specified, the population size
is estimated as the number of distinct elemenst in the
whole set of referenc classes.
-sort key
sort on the basis of the specified key.
Supported keys: E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,names,sig
-rep replacement
Sampling was performed with replacement, i.e. a given
element can appear several times in the same class.
In this case, the binomial distribution is used
instead of the hypergeometric.
-sym symmetric comparison
(only useful when -rep is activated, because th
hypergeometric is by definition symmetric)
-distinct
Prevent to compare each class with itself (when the
reference and query files contain the same classes).
-triangle
(ony valid if query file and reference file are the same)
Do not perform the reciprocal comparisons: if
reference A has already been compared to query B, then
reference B does not need to be compared to query A. .
With matrix output, this returns only the lower
triangle fo the matrix.
-matrix [key]
Return a pairwise matrix, where each row corresponds
to a reference class, each column to a query class,
and each cell contains a comparison between the two
classes. The next argument indicates which statistics
has to be returne in the matrix (default = sig).
Supported : E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,rank,sig
The argument -matrix can be used iteratively to export several
matrices in the same output file.
Example:
-matrix QR -matrix sig -matrix 'I(Q,R)'
-margins
Print the marginal values (total, sum, average) for 'return'
table and the matricces.
-null
null string (default NA) displayed for undefined values.
-base
logarithm base (Default: 2.71828182845905)
-dot dot_file
Export a graph with the associations in a dot
file. Dot files can be visualized and modified with
the GraphViz package (http://www.graphviz.com/), which
contains several methods of automatic layout.
-gml gml_file
Export a graph with the associations in a gml
file, which can be visualized and modified with
various visualization packages, including
GraphViz (http://www.graphviz.com/)
yed (http://www.yworks.com/en/products_yed_about.htm)
-multi_cor
Factor used for the multi-testing correction.
Supported values:
nt number of significance tests (default)
nq number of query classes
nr number of reference classes
nc number of comparisons (nc = nq * nr)
The differences between these four options are
explained below (section E-value).
PROBABILITIES
P-VALUE
The P-value is the probability for one comparison to return a
false positive. In other words, it is the probability to
observe at least c common elements between a given query class
and a given reference class. It can be calculated with
different formulae, depending on the underlying random model.
Let us assume that we have :
q size of the query class
r size of the reference class
c number of common elements
n population size
HYPERGEOMETRIC
q i q-i q
P_value = P(X >= c) = SUM ( C C / C )
i=c r n-r n
BINOMIAL
When the option -rep (replacement) is active,
probabilities are calculated on the basis of the
binomial distribution instead of the hypergeometric.
The binomial formula is applied with
p_r = r/n probability of success at each trial
nb of trials = q
nb of successes = c
q
P_value = P(X >= c) = SUM (binom(i,q,p_r))
i=c
Beware: the binomial gives an assymmetric result,
i.e. the fact to swap query and reference classes
changes the probability. This can be circumvented by
using the option -sym, described below.
SYMMETRICAL COMPARISON WITH THE BINOMIAL
When the comparison is assumed to be symmetrical, the
program calculates the joint probability fo at least c
elements to belong to both the query set and the
reference set.
In this case, the binomial is applied with :
p_qr = q/n * r/n
= proba of success at each trial
nb of trials = n
nb of successes = c
q
P_value = P(X >= c) = SUM (binom(i,n,p_qr))
i=c
E-VALUE
Assuming that there are x query classes and y
reference classes, each analysis consists in x*y
comparisons. Thus, the P-value can be misleading,
because even low P-values are expected to emerge by
chance alone when the number of query and/or reference
classes is very high. The E-value reflects better the
degree of exceptionality.
The option -multi_cor allows to choose among 4
possible multi-testing correction factors. The choice
is left to the user, depending on the question he/she
wants to answer.
-multi_cor nr
How many false positives do we expect per query
class ?
E-value = P-value * nr
Where nr is the number of reference classes
(e.g. the number of classes in the GO
classification).
-multi_cor nq
How many false positives do we expect per
reference class ?
E-value = P-value * nq
Where nq is the number of query classes
(e.g. the number of co-expression cluster).
-multi_cor nc
How many false positives do we expect for the
whole set of comparisons ?
nc = nq * nr
E-value = P-value * nc
Where nc is the number of comparisons between a
query class and a reference class.
-multi_cor nt
How many false positives do we expect for the
whole set of significance tests actually
performed ?
E-value = P-value * nt
where nt is the number of significance tests
actually performed (the number of times the
hypergeometric or binomial formula was used)
If you do not use any "pre-filtering" threshold, the
options nc and nt give the same result (nt = nq * nr),
since a significance test is performed for each pair
of query class - reference class.
If you use pre-filtering thresholds (for example -lth occ 1,
to select only the pairs with at least one common member), the
actual number of tests can in some cases be much smaller than
the number of comparisons (nt <= nc = nq*nr).
SIG
The significance index is the minus log of the
E-value. It is calculated in base 10.
sig = -log10(E_val)
This index gives an intuitive perception of the
exceptionality of the common elements : a negative sig
indicates that the common matches are likely to come by
chanc alone, a positive value that they are
significant. Higher sig values indicate a higher
significance.
OUTPUT FORMAT
The program returns a tabdelimited file with one row per
combination of reference-query class, and one column per
return field.
Default return fields:
1) ref reference class
2) query query class
3) ref# number of members in the ref class
4) query# number of members in the query class
5) common number of elements in common between the query
and reference class
Additional return values are optionally returned, and can be
specified with the -return option.
FILE FORMATS
CLASS FILE
The class file specifies the composition of several gene
families. It is a text file containing 2 columns separated by
a tab character.
col 1: class member
col 2: class name
Additional columns are ignored.
Lines starting with a semicolumn (;) are ignored, allowing to
document the class files with comments.
A given element (e.g. gene) can belong simultaneously to
several families. In such a case, the element will appear on
several rows (one per class),
Example
; genes responding to Phosphate stress
pho5 PHO
pho8 PHO
; genes responding to nitrogen starvation
DAL5 NIT
GAP1 NIT
...