compare-classes help

RSAT - Help about compare-classes

Description

Authors

Jacques van Helden, with a contribution of Joseph Tran for a prototype version.
Interpretation of the results

Options

Warning
        -r ref_classes
        		A tab-delimited text file containing the dscription of
        		reference classes (see format description below).
        		
        -q query_classes
        		A tab-delimited text file containing the dscription of
        		query classes (see format description below).

    	-i input_file
      		  	This file will be used as both reference and query. 
     		  	This is equivalent to 
         	   	-q input_file -r input_file

   	 	-sc score column
        		Specify a column of the input file containing a score associated to
       			each member. The score is used for some metrics like the dot product.

    	-o outputfile
        		if not specified, the standard output is used.
        		This allows to place the command within a pipe.

    	-rnames class_name_file
        		File containing names for the reference classes.
        		Associate a name to each class, when the classes are
        		specified by an ID (ex Gene Ontology term IDs).  
        		The class name file contains two columns

            	1) class ID
            	2) class name


    	-qnames class_name_file
        		File containing names for the query classes.
        		Same format as for -rnames.
        		
       	-cat catalog
        		Compare the query file to pre-defined catalogs
        		(e.g. GO, MIPS functional classes, ...). These
        		catalogs associate each gene of a genome to one or
        		several classes.  The organism must be specified
        		(option -org).
        
        		Supported catalogs:
        		GO,MIPS,TF_targets,ms_complexes,regulons

        		This option is currently supported for Saccharomyces
        		cerevisiae.

    	-org organism (for pre-defined catalogs)        

    	-return return_fields
        		List of fields to return. Supported field :
        		dotprod,entropy,freq,jac_sim,members,occ,proba,rank

        		Return fields are grouped, so that each request will return
        		several fields.  For example, the group "proba" returns the
        		P-value, the E-value and the significance.

        		group   field   description
        		rank    rank    Rank of the comparison
        		occ     Q       Number of elements in class Q
        		occ     QR      Number of elements found in the intersecion between classes R and Q
        		occ     QvR     Number of elements found in the union of classes R and Q. This is R or Q.
        		occ     R       Number of elements in class R
        		freq    E(QR)   Expected number of elements in the intersection
        		freq    F(!Q!R) Frequency of !Q!R elements relative to population size. F(!Q!R)=!Q!R/P
        		freq    F(Q!R)  Frequency of Q!R elements relative to population size. F(Q!R)=Q!R/P
        	    freq    F(Q)    Frequency of Q elements relative to population size. F(Q)=Q/P
        		freq    F(QR)   Frequency of QR elements relative to population size. F(QR)=QR/P
        		freq    F(R!Q)  Frequency of R!Q elements relative to population size. F(R!Q)=R!Q/P
        		freq    F(R)    Frequency of R elements relative to population size. F(R)=R/P
        		freq    P(QR)   Probability of Q and R (Q^R), assuming independence. P(QR) = F(Q)*F(R)
        		freq    P(Q|R)  Probability of Q given R. P(Q|R) = F(QR)/F(R)
        		freq    P(R|Q)  Probability of R given Q. P(R|Q) = F(QR)/F(Q)
        		proba   E_val   E-value of the intersection. E_val = P_val * nb_tests
        		proba   P_val   P-value of the intersection, calculated witht he hypergeometric function. Pval = P(X >= QR). 
        		proba   sig     Significance of the intersection. sig = -log10(E_val)
        		jac_sim jac_sim Jaccard' similarity. jac_sim = intersection/union = (Q and R)/(Q or R)
        		dotprod dotprod Dot product (using the score column)
        		entropy H(Q)    Entropy of class Q. H(Q) = - F(Q)*log[F(Q)] - F(!Q)*log[F(!Q)]
        		entropy H(Q,R)  Join entropy for classes Q and R. H(Q,R) = - F(QR)*log[F(QR)] - F(Q!R)*log[F(Q!R)] - F(R!Q)*log[F(R!Q)] - F(!Q!R)*log[F(!Q!R)]
        		entropy H(Q|R)  Conditional entropy of Q given R. H(Q|R) = H(Q,R) - H(R)
        		entropy H(R)    Entropy of class R. H(R) = - F(R)*log[F(R)] - F(!R)*log[F(!R)]
        		entropy H(R|Q)  Conditional entropy of R given Q. H(R|Q) = H(Q,R) - H(Q)
        		entropy I(Q,R)  Mutual information of classs Q and R. I(Q,R) = H(Q) + H(R) - H(Q,R)
        		entropy IC      Information content (as defined by Schneider, 1986). IC = F(QR) log[F(QR)/F(Q)F(R)]
        		entropy U(Q|R)  
        		entropy U(R|Q)  
        		entropy dH(Q,R) Entropy distance between classes Q and R. dH(Q,R) = H(Q,R) - H(Q)/2 - H(R)/2


    	-uth field #
        		upper threshold value for a given field
        		Supported_fields: E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,rank,sig

    	-lth field #
        		lower threshold value for a given field
                (same fields as -uth)

    	-pop #
        		Population size. If not specified, the population size
        		is estimated as the number of distinct elemenst in the
        		whole set of referenc classes.

    	-sort key
        		sort on the basis of the specified key.
        		Supported keys: E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,names,sig

    	-rep    replacement
        		Sampling was performed with replacement, i.e. a given
        		element can appear several times in the same class.

        		In this case, the binomial distribution is used
        		instead of the hypergeometric.

    	-sym    symmetric comparison
        		(only useful when -rep is activated, because th
        		hypergeometric is by definition symmetric)

    	-distinct
        		Prevent to compare each class with itself (when the
        		reference and query files contain the same classes).

    	-triangle
        		(ony valid if query file and reference file are the same)
        		Do not perform the reciprocal comparisons: if
        		reference A has already been compared to query B, then
        		reference B does not need to be compared to query A. .
        	    With matrix output, this returns only the lower
        		triangle fo the matrix.

    	-matrix [key]
        		Return a pairwise matrix, where each row corresponds
        		to a reference class, each column to a query class,
        		and each cell contains a comparison between the two
        		classes. The next argument indicates which statistics
        		has to be returne in the matrix (default = sig).
        		Supported : E(QR),E_val,F(!Q!R),F(Q!R),F(Q),F(QR),F(R!Q),F(R),H(Q),H(Q,R),H(Q|R),H(R),H(R|Q),I(Q,R),IC,P(QR),P(Q|R),P(R|Q),P_val,Q,QR,QvR,R,U(Q|R),U(R|Q),dH(Q,R),dotprod,jac_sim,rank,sig

        		The argument -matrix can be used iteratively to export several
        		matrices in the same output file.
        		Example: 
                	-matrix QR -matrix sig -matrix 'I(Q,R)' 

    	-margins
        		Print the marginal values (total, sum, average) for 'return'
        		table and the matricces.

    	-null 
        		null string (default NA) displayed for undefined values.

    	-base
        		logarithm base (Default: 2.71828182845905)

    	-dot dot_file
        		Export a graph with the associations in a dot
        		file. Dot files can be visualized and modified with
        		the GraphViz package (http://www.graphviz.com/), which
        		contains several methods of automatic layout.
        
    	-gml gml_file
        		Export a graph with the associations in a gml
        		file, which can be visualized and modified with
        		various visualization packages, including
         		GraphViz (http://www.graphviz.com/)
        		yed (http://www.yworks.com/en/products_yed_about.htm)

    	-multi_cor
        		Factor used for the multi-testing correction. 
        		Supported values: 
           			nt  number of significance tests (default)
           			nq  number of query classes
           			nr  number of reference classes
           			nc  number of comparisons (nc = nq * nr)
        		The differences between these four options are
        		explained below (section E-value).

PROBABILITIES

    P-VALUE

        The P-value is the probability for one comparison to return a
        false positive. In other words, it is the probability to
        observe at least c common elements between a given query class
        and a given reference class. It can be calculated with
        different formulae, depending on the underlying random model.

        Let us assume that we have :
                q       size of the query class
                r       size of the reference class
                c       number of common elements
                n       population size

    HYPERGEOMETRIC
                       q     i  q-i     q
        P_value = P(X >= c) = SUM ( C  C     / C  )
                      i=c    r  n-r     n

    BINOMIAL

        When the option -rep (replacement) is active,
        probabilities are calculated on the basis of the
        binomial distribution instead of the hypergeometric.


        The binomial formula is applied with
                p_r = r/n probability of success at each trial
                nb of trials = q
                nb of successes = c

                       q
        P_value = P(X >= c) = SUM (binom(i,q,p_r))
                      i=c

        Beware: the binomial gives an assymmetric result,
        i.e. the fact to swap query and reference classes
        changes the probability. This can be circumvented by
        using the option -sym, described below.

    SYMMETRICAL COMPARISON WITH THE BINOMIAL

        When the comparison is assumed to be symmetrical, the
        program calculates the joint probability fo at least c
        elements to belong to both the query set and the
        reference set.

        In this case, the binomial is applied with :
                p_qr = q/n * r/n
                  = proba of success at each trial
                nb of trials = n
                nb of successes = c

                       q
        P_value = P(X >= c) = SUM (binom(i,n,p_qr))
                      i=c

    E-VALUE

        Assuming that there are x query classes and y
        reference classes, each analysis consists in x*y
        comparisons. Thus, the P-value can be misleading,
        because even low P-values are expected to emerge by
        chance alone when the number of query and/or reference
        classes is very high. The E-value reflects better the
        degree of exceptionality.

        The option -multi_cor allows to choose among 4
        possible multi-testing correction factors. The choice
        is left to the user, depending on the question he/she
        wants to answer.

        -multi_cor nr

            How many false positives do we expect per query
            class ?

                E-value = P-value * nr

            Where nr is the number of reference classes
            (e.g. the number of classes in the GO
            classification).

        -multi_cor nq

            How many false positives do we expect per
            reference class ?

                E-value = P-value * nq

            Where nq is the number of query classes
            (e.g. the number of co-expression cluster).


        -multi_cor nc

            How many false positives do we expect for the
            whole set of comparisons ?

                nc = nq * nr
                E-value = P-value * nc

            Where nc is the number of comparisons between a
            query class and a reference class.


        -multi_cor nt
            How many false positives do we expect for the
            whole set of significance tests actually
            performed ?

                  E-value = P-value * nt

            where nt is the number of significance tests
            actually performed (the number of times the
            hypergeometric or binomial formula was used)

        If you do not use any "pre-filtering" threshold, the
        options nc and nt give the same result (nt = nq * nr),
        since a significance test is performed for each pair
        of query class - reference class.

        If you use pre-filtering thresholds (for example -lth occ 1,
        to select only the pairs with at least one common member), the
        actual number of tests can in some cases be much smaller than
        the number of comparisons (nt <= nc = nq*nr).

    SIG

        The significance index is the minus log of the
        E-value. It is calculated in base 10.

        sig = -log10(E_val)

        This index gives an intuitive perception of the
        exceptionality of the common elements : a negative sig
        indicates that the common matches are likely to come by
        chanc alone, a positive value that they are
        significant. Higher sig values indicate a higher
        significance.

OUTPUT FORMAT

        The program returns a tabdelimited file with one row per
        combination of reference-query class, and one column per
        return field.

        Default return fields:
        1) ref      reference class
        2) query    query class
        3) ref#     number of members in the ref class
        4) query#   number of members in the query class
        5) common   number of elements in common between the query 
                    and reference class

        Additional return values are optionally returned, and can be
        specified with the -return option.

FILE FORMATS
    CLASS FILE
        The class file specifies the composition of several gene
        families.  It is a text file containing 2 columns separated by
        a tab character.

            col 1:   class member
            col 2:   class name

        Additional columns are ignored. 

        Lines starting with a semicolumn (;) are ignored, allowing to
        document the class files with comments.

        A given element (e.g. gene) can belong simultaneously to
        several families. In such a case, the element will appear on
        several rows (one per class),

        Example
                ; genes responding to Phosphate stress
                pho5    PHO
                pho8    PHO
                ; genes responding to nitrogen starvation
                DAL5    NIT
                GAP1    NIT
                ...