next up previous contents
Next: Path finding Up: Network Analysis Tools (NeAT) Previous: Graph clustering   Contents

Subsections

Influence of graph alteration and randomization on clustering

Introduction

Although negative controls and method evaluation are crucial points to the experimental biologist, this is far from being the same in bioinformatics where, too often, no negative control is associated to the predictions, so that one cannot estimate the probability of these predictions to biogically valid.

For this reason, in NeAT we developped programs allowing to randomize and to add some specified levels of noise to networks. This allows the user to apply the techniques used to find relevant results on networks where there is less or no signal and thus were no interesting result should emerge.

NeAT programs are able to generate randomized networks according to three methods.

  1. Node degree conservation : this approach consists in shuffling the edges, each node keeping the same number of neighbors as in the original graph.
  2. Node degree distribution conservation : in which the global distribution of the node degree is conserved but each node presents a different degree than in the original graph.
  3. Erdos-Renyi randomization : where edges are distributed between pairs of nodes with equal probability.

Quantitative assessment of a clustering algorithm

Study case

In this demonstration, we will use the approach developped in [2] where we evaluated the performances of different graph clustering algorithms. Graph clustering algorithms allow to retrieve in a graph the groups of nodes that contain more connections between them than with the rest of the nodes of the graph. Clustering algorithms are often used in biology in order to extract coherent groups of nodes from networks (complexes detection (e.g. see [29,19,2,24]), protein families detection [7], co-expressed genes detection in co-expression networks (e.g see [20]), ...). The NeAT web server proposes the MCL (Markov Cluster algorithm) clustering algorithm developped by Stijn van Dongen [33,7]. To follow the command-line tools instructions, you should have MCL installed on your computer (available at http://micans.org/mcl/).

MCL simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. At each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The value of the inflation parameter strongly influences the number of clusters. According to [2], the optimal inflation value for clustering protein interaction networks is 1.8.

We will use an artificial interaction network created from the complexes annotated in the MIPS database by creating an edge between all the nodes belonging to the same complex [23]. This network contains 12262 edges between 1095 nodes. We will then use the MCL clustering algorithm on this network, on a little altered network, on a highly altered network and finally on a randomized network.

We will then compare these clusters to the MIPS complexes and estimate how well MCL can retrieve protein complexes from a protein-protein interaction and the influence of the noise on the results.

In this example, we will only use random alteration, i.e., the edges that are removed are randomly chosen. This is done to mimick what happens really in biological experiments where some inter-relationships between the nodes (genes, proteins, metabolites, ...) may not be discovered (false negatives) or are erroneously discovered (false positives). However the alter-graph program also allows to alterate the network with targeted attack on user-selected nodes. In their study, Spirin and Mirny [31] showed the affect of node targeted attacks on clustering results.

Protocol for the web server

Dataset download

Go on the demo dataset web page.http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/ and download the MIPS complex network file (complexes_rm_00_ad_00.tab) and the complexes (mips_complexes.tab).

Network alteration

  1. In the NeATmenu, select the command network alteration.
  2. In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab containing the MIPS complexes network that you just downloaded.
  3. In the edges to add text area, enter 10%.
  4. In the edges to remove text area, enter 10%.
  5. Click on the button GO.
  6. Right click on the resulting file and save it with name complexes_rm_10_ad_10.tab.

    Re-do the this alteration procedure using 50% of edges removal and 100% of edges addition. Save the resulting file with name complexes_rm_50_ad_100.tab.

Network randomization

  1. In the NeATmenu, select the command network randomization.
  2. In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab.
  3. Select the Node degree conservation randomization type.
  4. Click on the button GO.
  5. Right click on the resulting file and save with name complexes_rm_00_ad_00_random.tab.

Networks clustering and clustering assessment

  1. In the NeATmenu, select the command graph-based clustering MCL.
  2. In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab.
  3. Click on the button GO. You should now obtain a link to the clustering results and the distribution of the sizes of the different clusters.
  4. In the Next step pannel, click on the button Compare these clusters to other clusters.
  5. In the Upload reference classes from file text area, load the mips_complexes.tab file.
  6. Choose the matrix file output format
  7. Click on the button GO. You now obtain a contingency table, i.e, a table with $N$ rows and $M$ columns ($N$ being the number of MIPS complexes and $M$, the number of clusters). Each cell contains the number of protein common to one complex and one cluster.
  8. To calculate some statistics on this contingency table, click on the contingency-table statistics button in the Next step pannel.
  9. The contingency-stats form appears. As the contingency table is already uploaded, just lick on the GO button.
  10. Save the resulting file under name contigency_stats_rm_00_ad_00.tab

Repeat these steps for complexes_rm_10_ad_10.tab, complexes_rm_50_ad_100.tab and complexes_rm_00_ad_00_random.tab and save the resulting files under the names contigency_stats_ad_10_rm_10.tab, contigency_stats_ad_50_rm_100.tab, contigency_stats_ad_00_rm_00_random.tab, respectively.

Protocol for the command-line tools

If you have installed a stand-alone version of the NeAT distribution, you can use the programs random-graph and alter-graph on the command-line. This requires to be familiar with the Unix shell interface. If you don't have the stand-alone tools, you can skip this section and read the next section (Interpretation of the results).

We will now describe the use of random-graph, alter-graph, compare-classes and contingency-stats as command line tools. For this tutorial, you need to have the MCL program installed.

Start by going on the demo dataset download web page.http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/ and downloading the MIPS complex network file (complexes_rm_00_ad_00.tab) and the complexes (mips_complexes.tab).

Network alteration

  1. Go in the directory where you downloaded the file.
  2. Use the following commands to alter the graph. Note that MCL is not an RSAT / NeAT program and thus cannot treat RSAT comments lines (starting with ``#'' or with ``;''). We thus have to suppress them in the command.
    			alter-graph	-v 1 -i complexes_rm_00_ad_00.tab \
    					-rm_edges 10% -add_edges 10% \
    					| cut -f 1,2 | grep -v ';' > complexes_rm_10_ad_10.tab
    
    Re-use this command, but modify the percentage of removed (-rm_edges 50%) and added edges (-add_edges 100%). Save the resulting file with name complexes_rm_50_ad_100.tab.

Network randomization

  1. Use the following commands to randomize the graph by shuffling the edges. The node degrees will be conserved.
    		random-graph 	-v 1 -i complexes_rm_00_ad_00.tab \
    				-random_type node_degree \
    				| cut -f 1,2 | grep -v ';' > complexes_rm_00_ad_00_random.tab
    

Networks clustering and clustering assessment

  1. Use the following commands to apply MCL on the network
      		mcl 	complexes_rm_00_ad_00.tab \
      			--abc -I 1.8 -o complexes_rm_00_ad_00_clusters.mcl
    
  2. Convert the cluster file obtained with MCL with the program convert-classes into a file that is readable by NeAT / RSAT (two column cluster file).
      		convert-classes	-i complexes_rm_00_ad_00_clusters.mcl 
      				-from mcl -to tab -o complexes_rm_00_ad_00_clusters.tab
    
  3. Compare the obtained clusters to the MIPS complexes with the program compare-classes
      		compare-classes	-q complexes_rm_00_ad_00_clusters.tab \
      				-r mips_complexes.tab \
      				-matrix QR \
      				-o complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab
    
  4. Study the obtained matrix with the contingency-stats program
    		contingency-stats -i complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab \
    				  -o contigency_stats_ad_00_rm_00.tab
    

Repeat these steps for complexes_rm_10_ad_10.tab, complexes_rm_50_ad_100.tab and complexes_rm_00_ad_00_random.tab and save the resulting files under the names contigency_stats_ad_10_rm_10.tab, contigency_stats_ad_50_rm_100.tab, contigency_stats_ad_00_rm_00_random.tab, respectively.

Interpretation of the results

We will now compare the performances of MCL when applied to networks containing an increasing proportion of noise or no signal at all.

Files description

Randomized network

As the real MIPS complexes network, this randomized network contains 12262 edges between 1095 nodes. With our parameter choice, no edge should be duplicated. However, as in random-graph the iterative process designed to avoid duplicated edges may not be totally efficient, some duplicated edges may subsist in the randomized network.

Altered networks

This file is a classical NeAT tab-delimited edge list. However, there is a fifth column that indicates whether the edge comes from the original graph (original) or was added randomly (random).

Contingency table

See the previous chapter (Graph clustering) for a complete description of a contigency table.

Contingency table metrics

A list of metrics and their value. These will be described in the next section.

Metrics description

Sensitivity, Positive predictive value and geometric accuracy

See the previous chapter (Graph clustering) for a complete description of a contigency table.

Score comparaison

The table summarizes the kind of values that should be obtained for the metrics described in the previous section. As the alteration and the randomization procedure are random processes, you should not obtain exactly the same results.

# true ad10 / rm10 ad100 / rm50 random
ncol 125 114 713 361
nrow 220 220 220 220
mean 0.0569 0.0624 0.00998 0.0197
Sn 0.998 0.985 0.418 0.291
PPV 0.884 0.836 0.867 0.459
acc 0.941 0.91 0.642 0.375
acc_g 0.939 0.907 0.602 0.365
Sn_w 0.997 0.992 0.502 0.157
PPV_w 0.621 0.62 0.688 0.244
acc_g_w 0.787 0.785 0.588 0.196
sep_r 0.567 0.507 0.676 0.192
sep_c 0.998 0.979 0.208 0.117
sep 0.752 0.704 0.375 0.15

As expected, the value of the global parameters, the geometric accuracy (row acc_g), the weighted geometric accuracy (row acc_g_w) and the separation (row sep) decrease drastically as the network contain less and less relevant information.

We can observe that the sensitivity is more affected than the PPV and that the complex wise separation (sep_r) is more affected than the cluster wise separation. This is due to the fact that by increasing the noise, MCL increases the number of small sized clusters (ncol) too and, as we saw in previous section, this has an impact on the sensitivity.

Note that with a random graph, we would have a separation of 0.15 but an unweighted geometric accuracy of 0.365 which is far from being 0. The relatively good performances of MCL on the highly altered graph must thus be taken with caution as the gain in performances is only of 23%. This illustrates the interest of using negative controls.


next up previous contents
Next: Path finding Up: Network Analysis Tools (NeAT) Previous: Graph clustering   Contents
RSAT 2009-09-04