RSA-tools - Tutorials - Combining RSAT and NeAt to predict metabolic pathways and their regulation

Prerequisite
Introduction
General goal and specific questions
Browsing annotated pathways with EcoCyc
Two-ends path finding
Operon prediction
Footprint discovery
Extracting a pathway from an operon
Bibliography

Introduction

We now dispose of the complete genome for more than 1,000 Bacteria, and this number is expected to grow even faster with the advent of Next Generation Sequencing (NGS) methods. For most of these bacteria, almost nothing is known about metabolism, regulation. These bacteria were sequenced because they present some interest for medicine, biotechnologies, agriculture, but they have not yet been characterized by geneticists or biochemists, and it is very likely that they will never be, just because there are not enough people and financial resources to perform all the required experiments.

Bioinformatics is thus the most reasonable way to at least try understanding the composition of these genomes, the function of their genes, and the way they interact to allow an organism of interest to survive in its environment. The first annotation step consists in comparing the sequence of each gene product to a database of protein functions (e.g. Uniprot), and to infer individual gene function by sequence similarity.

Gene-wise analysis is however restricted to a one-by-one prediction, and does not inform us on how genes and proteins are integrated into complex processes. However, several predictive methods have been developed to address this question, in various ways.

Operon prediction:. In Bacteria, genes involved in the same process are often regrouped in operons (defined here as poly-cistronic transcription units). Genes belonging to the same operon are generally separated by a very short intergenic space (less than 50bp) or can even overlap, since the stop codon of a gene (TAA, TAG, TGA) can overlap with the start codon (ATG) of the next gene in two ways. Operons can be predicted with a reasonable accuracy (~80%) by grouping genes separated by a distance shorter than a given limit (e.g. 55bp), using a simplified approach derived from Salgado et al., 2000; Moreno-Hagelsieb et al., 2002.

Inference of co-regulation networks and prediction of regulons. Transcription (single genes or operons) involved in a same function are generally co-regulated at the transcriptional level by some specific transcription factor. For example, in Escherichia coli enzymes involved in methionine biosynthesis are regulated by the MetJ repressor, genes involved in arginine biosynthesis by the ArgR repressor, and so on. This regulation is relatively well conserved across species, up to a given taxonomical level (Gammaproteobacteria). We can predict the cis-regulatory elements of a gene by detecting phylogenetic footprints, i.e. sequence motifs that are conserved in the upstream sequences of its orthologs (Janky and van Helden, 2008). We can further link genes having similar footprints, in order to infer a co-regulation network (Brohée et al., 2011), and extract from this network the modules of highly connected genes, which are likely to reveal regulons.

Prediction of metabolic pathways. Starting from a set of enzymes supposed to be involved in the same process (e.g. found in a predicted operon or regulon), we can identify the reactions they can catalyze, and apply path finding methods (Croes et al.; 2005; 2006; Faust et al., 2009a, 2009b) to find the link between pairs of reactions. More interestingly, we can use this set of reactions as seeds to extract a subgraph from the reaction/compound network, in order to predict the metabolic pathway that can be catalyzed by the products of a given operon or regulon (Faust et al., 200.

General goal and specific questions

The general goal of the tutorial is to learn the use of a series of bioinformatics tools to predict metabolic pathways and their regulation. More specifically, we would like to address the following question.

Browsing EcoCyc. The first step of the tutorial will consists in selecting a pathway of interest, and analyzing the corresponding annotation in the E.coli pathway database EcoCyc. This will be considered as the reference pathway for the subsequent steps.

Two-ends path finding. Can we recover our pathway of interest by finding the shortest path between its start and end compounds ?

Operon prediction. Are the genes of the pathway regrouped in one or several operons, or are they scattered over the genome ?

Footprint discovery. Using comparative genomics, can we predict the cis-acting elements of the genes coding for the enzymes involved in a pathway of interest ? How widely are these elements conserved across the taxonomy ? Do we detect similar conserved motifs (phylogenetic footprints) in the promoters of different genes of our pathway of interest ?

Pathway extraction. The distance-based method provides a quick and easy way to perform a rough prediction of all the operons of a given genome. some operons will contain a set of enzymes, supposedly involved in a common metabolic pathway. If an operon contains several enzyme-coding genes, can we predict the pathway catalyzed by these enzymes ?

In this tutorial, we will combine the Regulatory Sequence Analysis Tools (RSAT, http://rsat.ulb.ac.be/rsat/) and Network Analysis Tools (NeAT, http://rsat.ulb.ac.be/neat/) to analyze metabolic pathways and their regulation.

Each student will choose a different metabolic pathway as study case, and run various bioinformatics tools to analyze this pathway and its regulation.

Browsing annotated pathways with EcoCyc

Protocol

Open a connection to EcoCyc (http://www.ecocyc.org/)
Enter the name of a pathway of interest (for example an "Lysine biosynthesis") and follow the link to that pathway. you should try to find a pathway with a sufficient number of reaction steps (at least 4).
By clicking on the button More detail, you can obtain a progressively more refined view of the pathway, with details about compound structure etc.
Read carefully the record, and pay a particular attention to the location of mapped genes and the genetic regulation schematic.
Note: For the sake of this tutorial, you should better use a pathway that is regulated at the transcriptional level (e.g. methionine biosynthesis), or whose genes are grouped in operons (e.g. histidine biosynthesis).

Two-ends path finding

Protocol

Open a connection to the Network Analysis Tools (NeAT, http://rsat.ulb.ac.be/neat/) and select the tool Metabolic path finding.
Try to find the shorter path between the initial substrate and end product of your study case pathway. For this, first leave all options to their default values.
Beware: the first step of the analysis returns the list of comounds that match queries for the start and end compounds. Make sure you select the rigth compound (e.g. L-apsartate and L-lysine for the lysine biosynthesis pathway).
Compare the predicted pathways with the reference pathway (the one annotated in EcoCyc).
Redo the analysis by changing the option Graph type to Reaction network (compounds and reactions)
Reset the option Graph type to RPAIR network (compounds and reactant pairs), and test the ipmpact of the Weighting scheme. What happens when you disable compound weighting ?
For each parameter setting, compare the predicted pathway and the annotated one, by computing the numbers of TP, FP, TN, as well as the Sensitivity and Positive Predictive Value.

Tips

To compute the accuracy statistics, you can use the following definitions.

Number of true positives (TP): reactions that are both found by path finding and annotated in the EcoCyc pathway.
Number of false positives (FP): reactions found in the predicted path but not in the annotated path.
Number of false negatives (FN): reactions present in the annotated pathway but not detected in the rpedicted pathway.

You can then derive the statistics.

Sensitivity: Sn = TP / (TP + FN). Fraction of the annotated reactions that were correctly recovered by path finding.
Positive Predictive Value: PPV = TP / (TP + FP). Fraction of the predicted reactions that were part of the annotated pathway.
Accuracy: Acc_geom = sqrt(Sn * PPV). The geometric accuracy of sensitivity and positive predictive value.

Questions

What is the impact of the following parameters on Sn, PPV and accuracy ?
- Network type: RPAIR versus reaction network.
- Weighting scheme: activating or not the compound weightinf.
Are there interactions between the parameters ? More precisely, has the weighting scheme (unweighted versus weighted compounds) the same impact on the search in the RPAIR network as in the weighted reaction network ?

Operon prediction

Protocol

Open a connection to the Regulatory Sequence Analysis Tools (RSAT, http://rsat.ulb.ac.be/rsat/) open the toolbox Genomes and genes and select the tool Infer operon.
Enter the name of all the genes involved in your patwhay of interest, leave all other parameters to their default value and click GO.

Questions

Compare the predicted operons with the annotations in EcoCyc. How accurate was the simple distance-based method ? Can you optimize the distance threshold to recover the annotated operons for your genes ?

Footprint discovery

Protocol

Open a connection to the Regulatory Sequence Analysis Tools (RSAT, http://rsat.ulb.ac.be/rsat/).
Open the toolbox Comparative genomics, and select the tool footprint-discovery.
Enter the name of one of the enzyme-coding genes of your pathway of interest. Set the Organism to Escherichia coli K12 and the Taxon option to Enterobacteriales.
Activate the option predict operon leader genes.
Enter your email address and click GO.
Once you have received the task completion email, open the result page and click on the link "Final matrices (transfac format)". Copy the whole content of the file, and open a new connection to RSAT. In the toolbox Matrix tools, open compare-matrices. Paste your matrices, make sure that the rigth matrix format is selected (transfac). For the option Reference matrices, select RegulondDB. Leave all other parameters unchanged and type GO.

Questions

Did the footprint-discovery program report significant dyads ?
Analyze the feature map : do you see conserved sites ? Do they occupy fixed positions or are they spread over the sequences ?
Are there know factors that are likey to bind the discovered motifs ? How do the discovered matrices compare with the known motifs in RegulonDB ? In case you found matches, are they consistent with the pathway of interest?

Extracting a pathway from an operon

Questions

Bibliography

Pathway prediction by subgraph extraction from metabolic networks

Faust, K. and van Helden, J. (2012). Predicting Metabolic Pathways by Sub-network Extraction. Methods Mol Biol 804, 107-30. [PMID 22144151]
Faust, K., Croes, D. and van Helden, J. (2011). Prediction of metabolic pathways from genome-scale metabolic networks. Biosystems 105, 109-21. [PMID 21645586] [doi:10.1016/j.biosystems.2011.05.004]
Faust, K., Dupont, P., Callut, J. and van Helden, J. (2010). Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics 26:1211-8. [Pubmed 20228128]] [Open access]

Two-ends path finding

Faust, K., Croes, D. and van Helden, J. (2009b). In response to "Can sugars be produced from fatty acids? A test case for pathway analysis tools". Bioinformatics 2009 Sept 23. [PMID 19776213]
Faust, K., Croes, D. and van Helden, J. (2009a). Metabolic Pathfinding Using RPAIR Annotation. J Mol Biol. [PMID 19281817]
Croes, D., F. Couche, S.J. Wodak, J. van Helden (2006). Inferring Meaningful Pathways in Weighted Metabolic Networks. J. Mol. Biol. 356:222-36. [PMID 16337962].
Croes, D., F. Couche, S.J. Wodak, and J. van Helden. 2005. Metabolic PathFinding: inferring relevant pathways in biochemical networks. Nucleic Acids Res 33: W326-330.[PMID 15980483].

Analysis of conserved cis-regulatory elements (phylogenetic footprints)

Brohee, S., Janky, R., Abdel-Sater, F., Vanderstocken, G., Andre, B. and van Helden, J. (2011). Unraveling networks of co-regulated genes on the sole basis of genome sequences. Nucleic Acids Res 39, 6340-58. [PMID 21572103] [Open access]
Janky, R. and van Helden, J. Evaluation of phylogenetic footprint discovery for the prediction of bacterial cis-regulatory elements (2008). BMC Bioinformatics 2008, 9:37doi:10.1186/1471-2105-9-37. [PMID 18215291]. [Open access].

Prediction of operons

Moreno-Hagelsieb, G. & Collado-Vides, J. (2002). A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18 Suppl 1, S329-36.
Salgado, H., Moreno-Hagelsieb, G., Smith, T. F. & Collado-Vides, J. (2000). Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci U S A 97, 6652-7.

Next steps

You can now come back to the tutorial main page and follow the next tutorials.

Last update 15 Jan 2012 - by

RSA-tools - Tutorials - Combining RSAT and NeAt to predict metabolic pathways and their regulation

Contents

Introduction

General goal and specific questions

Browsing annotated pathways with EcoCyc

Protocol

Two-ends path finding

Protocol

Tips

Questions

Operon prediction

Protocol

Questions

Footprint discovery

Protocol

Questions

Extracting a pathway from an operon

Questions

Bibliography

Pathway prediction by subgraph extraction from metabolic networks

Two-ends path finding

Analysis of conserved cis-regulatory elements (phylogenetic footprints)

Prediction of operons

Next steps