Next: Path finding Up: Network Analysis Tools (NeAT) Previous: Graph clustering Contents

Subsections

Influence of graph alteration and randomization on clustering

Introduction

Although negative controls and method evaluation are crucial points to the experimental biologist, this is far from being the same in bioinformatics where, too often, no negative control is associated to the predictions, so that one cannot estimate the probability of these predictions to biogically valid.

For this reason, in NeAT we developped programs allowing to randomize and to add some specified levels of noise to networks. This allows the user to apply the techniques used to find relevant results on networks where there is less or no signal and thus were no interesting result should emerge.

NeAT programs are able to generate randomized networks according to three methods.

Node degree conservation : this approach consists in shuffling the edges, each node keeping the same number of neighbors as in the original graph.
Node degree distribution conservation : in which the global distribution of the node degree is conserved but each node presents a different degree than in the original graph.
Erdos-Renyi randomization : where edges are distributed between pairs of nodes with equal probability.

Quantitative assessment of a clustering algorithm

Study case

In this demonstration, we will use the approach developped in [2] where we evaluated the performances of different graph clustering algorithms. Graph clustering algorithms allow to retrieve in a graph the groups of nodes that contain more connections between them than with the rest of the nodes of the graph. Clustering algorithms are often used in biology in order to extract coherent groups of nodes from networks (complexes detection (e.g. see [29,19,2,24]), protein families detection [7], co-expressed genes detection in co-expression networks (e.g see [20]), ...). The NeAT web server proposes the MCL (Markov Cluster algorithm) clustering algorithm developped by Stijn van Dongen [33,7]. To follow the command-line tools instructions, you should have MCL installed on your computer (available at http://micans.org/mcl/).

MCL simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. At each iteration, an inflation step is applied to enhance the contrast between regions of strong or weak flow in the graph. The process converges towards a partition of the graph, with a set of high-flow regions (the clusters) separated by boundaries with no flow. The value of the inflation parameter strongly influences the number of clusters. According to [2], the optimal inflation value for clustering protein interaction networks is 1.8.

We will use an artificial interaction network created from the complexes annotated in the MIPS database by creating an edge between all the nodes belonging to the same complex [23]. This network contains 12262 edges between 1095 nodes. We will then use the MCL clustering algorithm on this network, on a little altered network, on a highly altered network and finally on a randomized network.

We will then compare these clusters to the MIPS complexes and estimate how well MCL can retrieve protein complexes from a protein-protein interaction and the influence of the noise on the results.

In this example, we will only use random alteration, i.e., the edges that are removed are randomly chosen. This is done to mimick what happens really in biological experiments where some inter-relationships between the nodes (genes, proteins, metabolites, ...) may not be discovered (false negatives) or are erroneously discovered (false positives). However the alter-graph program also allows to alterate the network with targeted attack on user-selected nodes. In their study, Spirin and Mirny [31] showed the affect of node targeted attacks on clustering results.

Protocol for the web server

Dataset download

Go on the demo dataset web page.http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/ and download the MIPS complex network file (complexes_rm_00_ad_00.tab) and the complexes (mips_complexes.tab).

Network alteration

In the NeATmenu, select the command network alteration.
In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab containing the MIPS complexes network that you just downloaded.
In the edges to add text area, enter 10%.
In the edges to remove text area, enter 10%.
Click on the button GO.
Right click on the resulting file and save it with name complexes_rm_10_ad_10.tab.
Re-do the this alteration procedure using 50% of edges removal and 100% of edges addition. Save the resulting file with name complexes_rm_50_ad_100.tab.

Network randomization

In the NeATmenu, select the command network randomization.
In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab.
Select the Node degree conservation randomization type.
Click on the button GO.
Right click on the resulting file and save with name complexes_rm_00_ad_00_random.tab.

Networks clustering and clustering assessment

In the NeATmenu, select the command graph-based clustering MCL.
In the Upload graph from file text area, load the file complexes_rm_00_ad_00.tab.
Click on the button GO. You should now obtain a link to the clustering results and the distribution of the sizes of the different clusters.
In the Next step pannel, click on the button Compare these clusters to other clusters.
In the Upload reference classes from file text area, load the mips_complexes.tab file.
Choose the matrix file output format
Click on the button GO. You now obtain a contingency table, i.e, a table with rows and columns ( being the number of MIPS complexes and , the number of clusters). Each cell contains the number of protein common to one complex and one cluster.
To calculate some statistics on this contingency table, click on the contingency-table statistics button in the Next step pannel.
The contingency-stats form appears. As the contingency table is already uploaded, just lick on the GO button.
Save the resulting file under name contigency_stats_rm_00_ad_00.tab

Repeat these steps for complexes_rm_10_ad_10.tab, complexes_rm_50_ad_100.tab and complexes_rm_00_ad_00_random.tab and save the resulting files under the names contigency_stats_ad_10_rm_10.tab, contigency_stats_ad_50_rm_100.tab, contigency_stats_ad_00_rm_00_random.tab, respectively.

Protocol for the command-line tools

If you have installed a stand-alone version of the NeAT distribution, you can use the programs random-graph and alter-graph on the command-line. This requires to be familiar with the Unix shell interface. If you don't have the stand-alone tools, you can skip this section and read the next section (Interpretation of the results).

We will now describe the use of random-graph, alter-graph, compare-classes and contingency-stats as command line tools. For this tutorial, you need to have the MCL program installed.

Start by going on the demo dataset download web page.http://rsat.scmbb.ulb.ac.be/rsat/data/neat_tuto_data/ and downloading the MIPS complex network file (complexes_rm_00_ad_00.tab) and the complexes (mips_complexes.tab).

Network alteration

Go in the directory where you downloaded the file.
Use the following commands to alter the graph. Note that MCL is not an RSAT / NeAT program and thus cannot treat RSAT comments lines (starting with ``#'' or with ``;''). We thus have to suppress them in the command.
```
			alter-graph	-v 1 -i complexes_rm_00_ad_00.tab \
					-rm_edges 10% -add_edges 10% \
					| cut -f 1,2 | grep -v ';' > complexes_rm_10_ad_10.tab
```
Re-use this command, but modify the percentage of removed (-rm_edges 50%) and added edges (-add_edges 100%). Save the resulting file with name complexes_rm_50_ad_100.tab.

Network randomization

Use the following commands to randomize the graph by shuffling the edges. The node degrees will be conserved.

		random-graph 	-v 1 -i complexes_rm_00_ad_00.tab \
				-random_type node_degree \
				| cut -f 1,2 | grep -v ';' > complexes_rm_00_ad_00_random.tab

Networks clustering and clustering assessment

Use the following commands to apply MCL on the network

  		mcl 	complexes_rm_00_ad_00.tab \
  			--abc -I 1.8 -o complexes_rm_00_ad_00_clusters.mcl

Convert the cluster file obtained with MCL with the program convert-classes into a file that is readable by NeAT / RSAT (two column cluster file).
```
  		convert-classes	-i complexes_rm_00_ad_00_clusters.mcl 
  				-from mcl -to tab -o complexes_rm_00_ad_00_clusters.tab
```

Compare the obtained clusters to the MIPS complexes with the program compare-classes

  		compare-classes	-q complexes_rm_00_ad_00_clusters.tab \
  				-r mips_complexes.tab \
  				-matrix QR \
  				-o complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab

Study the obtained matrix with the contingency-stats program

		contingency-stats -i complexes_rm_00_ad_00_clusters_cc_complexes_matrix.tab \
				  -o contigency_stats_ad_00_rm_00.tab

Interpretation of the results

We will now compare the performances of MCL when applied to networks containing an increasing proportion of noise or no signal at all.

Files description

Randomized network

As the real MIPS complexes network, this randomized network contains 12262 edges between 1095 nodes. With our parameter choice, no edge should be duplicated. However, as in random-graph the iterative process designed to avoid duplicated edges may not be totally efficient, some duplicated edges may subsist in the randomized network.

Altered networks

This file is a classical NeAT tab-delimited edge list. However, there is a fifth column that indicates whether the edge comes from the original graph (original) or was added randomly (random).

As the MIPS complex newtwork, the network with 10% of added and removed edges contains 12262 edges between 1095 nodes, which is logical as we removed and added the same number of edge (in this case 1226).
The network with 100% of added edges (+ 12262 edges) and 50% of removed edges (-6131 edges) contains 18393 edges between 1095 nodes. This graph contains thus more noisy than relevant edges.

Contingency table

See the previous chapter (Graph clustering) for a complete description of a contigency table.

Contingency table metrics

A list of metrics and their value. These will be described in the next section.

Metrics description

Sensitivity, Positive predictive value and geometric accuracy

See the previous chapter (Graph clustering) for a complete description of a contigency table.

Score comparaison

The table summarizes the kind of values that should be obtained for the metrics described in the previous section. As the alteration and the randomization procedure are random processes, you should not obtain exactly the same results.

# true ad10 / rm10 ad100 / rm50 random

ncol 125 114 713 361

nrow 220 220 220 220

mean 0.0569 0.0624 0.00998 0.0197

Sn 0.998 0.985 0.418 0.291

PPV 0.884 0.836 0.867 0.459

acc 0.941 0.91 0.642 0.375

acc_g 0.939 0.907 0.602 0.365

Sn_w 0.997 0.992 0.502 0.157

PPV_w 0.621 0.62 0.688 0.244

acc_g_w 0.787 0.785 0.588 0.196

sep_r 0.567 0.507 0.676 0.192

sep_c 0.998 0.979 0.208 0.117

sep 0.752 0.704 0.375 0.15

#	true	ad10 / rm10	ad100 / rm50	random
ncol	125	114	713	361
nrow	220	220	220	220
mean	0.0569	0.0624	0.00998	0.0197
Sn	0.998	0.985	0.418	0.291
PPV	0.884	0.836	0.867	0.459
acc	0.941	0.91	0.642	0.375
acc_g	0.939	0.907	0.602	0.365
Sn_w	0.997	0.992	0.502	0.157
PPV_w	0.621	0.62	0.688	0.244
acc_g_w	0.787	0.785	0.588	0.196
sep_r	0.567	0.507	0.676	0.192
sep_c	0.998	0.979	0.208	0.117
sep	0.752	0.704	0.375	0.15

As expected, the value of the global parameters, the geometric accuracy (row acc_g), the weighted geometric accuracy (row acc_g_w) and the separation (row sep) decrease drastically as the network contain less and less relevant information.

We can observe that the sensitivity is more affected than the PPV and that the complex wise separation (sep_r) is more affected than the cluster wise separation. This is due to the fact that by increasing the noise, MCL increases the number of small sized clusters (ncol) too and, as we saw in previous section, this has an impact on the sensitivity.

Note that with a random graph, we would have a separation of 0.15 but an unweighted geometric accuracy of 0.365 which is far from being 0. The relatively good performances of MCL on the highly altered graph must thus be taken with caution as the gain in performances is only of 23%. This illustrates the interest of using negative controls.

Next: Path finding Up: Network Analysis Tools (NeAT) Previous: Graph clustering Contents

RSAT 2009-09-04