We now dispose of the complete genome for more than 1,000
Bacteria, and this number is expected to grow even faster with
the advent of Next Generation Sequencing (NGS) methods. For most
of these bacteria, almost nothing is known about metabolism,
regulation. These bacteria were sequenced because they present
some interest for medicine, biotechnologies, agriculture, but
they have not yet been characterized by geneticists or
biochemists, and it is very likely that they will never be, just
because there are not enough people and financial resources to
perform all the required experiments.
Bioinformatics is thus the most reasonable way to at least try
understanding the composition of these genomes, the function of
their genes, and the way they interact to allow an organism of
interest to survive in its environment. The first annotation
step consists in comparing the sequence of each gene product to
a database of protein functions (e.g. Uniprot), and to infer
individual gene function by sequence similarity.
Gene-wise analysis is however restricted to a one-by-one
prediction, and does not inform us on how genes and proteins are
integrated into complex processes. However, several predictive
methods have been developed to address this question, in various
Operon prediction:. In Bacteria, genes involved
in the same process are often regrouped in operons
(defined here as poly-cistronic transcription
units). Genes belonging to the same operon are generally
separated by a very short intergenic space (less than
50bp) or can even overlap, since the stop codon of a gene
(TAA, TAG, TGA) can overlap
with the start codon (ATG) of the next gene in
two ways. Operons can be predicted with a reasonable
accuracy (~80%) by grouping genes separated by a distance
shorter than a given limit (e.g. 55bp), using a simplified
approach derived from Salgado et al., 2000;
Moreno-Hagelsieb et al., 2002.
Inference of co-regulation networks and prediction
of regulons. Transcription (single genes or operons)
involved in a same function are generally co-regulated
at the transcriptional level by some specific
transcription factor. For example, in Escherichia
coli enzymes involved in methionine biosynthesis are
regulated by the MetJ repressor, genes involved in
arginine biosynthesis by the ArgR repressor, and so
on. This regulation is relatively well conserved across
species, up to a given taxonomical level
(Gammaproteobacteria). We can predict the cis-regulatory
elements of a gene by detecting phylogenetic footprints,
i.e. sequence motifs that are conserved in the upstream
sequences of its orthologs (Janky and van Helden,
2008). We can further link genes having similar
footprints, in order to infer a co-regulation network
(Brohée et al., 2011), and extract from this
network the modules of highly connected genes, which are
likely to reveal regulons.
Prediction of metabolic pathways. Starting from
a set of enzymes supposed to be involved in the same
process (e.g. found in a predicted operon or regulon), we
can identify the reactions they can catalyze, and apply
path finding methods (Croes et al.; 2005; 2006; Faust et
al., 2009a, 2009b) to find the link between pairs of
reactions. More interestingly, we can use this set of
reactions as seeds to extract a subgraph from the
reaction/compound network, in order to predict the
metabolic pathway that can be catalyzed by the products of
a given operon or regulon (Faust et al., 200.
General goal and specific questions
The general goal of the tutorial is to learn the use of a
series of bioinformatics tools to predict metabolic pathways and
their regulation. More specifically, we would like to address
the following question.
Browsing EcoCyc. The first step of the tutorial
will consists in selecting a pathway of interest, and
analyzing the corresponding annotation in the E.coli pathway
database EcoCyc. This will be considered as the reference
pathway for the subsequent steps.
Two-ends path finding. Can we recover our pathway
of interest by finding the shortest path between its start
and end compounds ?
Operon prediction. Are the genes of the pathway
regrouped in one or several operons, or are they scattered over
the genome ?
Footprint discovery. Using comparative genomics,
can we predict the cis-acting elements of the genes coding
for the enzymes involved in a pathway of interest ? How
widely are these elements conserved across the taxonomy ? Do
we detect similar conserved motifs (phylogenetic footprints)
in the promoters of different genes of our pathway of
Pathway extraction. The distance-based method
provides a quick and easy way to perform a rough prediction
of all the operons of a given genome. some operons will
contain a set of enzymes, supposedly involved in a common
metabolic pathway. If an operon contains several
enzyme-coding genes, can we predict the pathway catalyzed by
these enzymes ?
Enter the name of a pathway of interest (for example an
"Lysine biosynthesis") and follow the link to that
pathway. you should try to find a pathway with a sufficient
number of reaction steps (at least 4).
By clicking on the button More detail, you can
obtain a progressively more refined view of the pathway,
with details about compound structure etc.
Read carefully the record, and pay a particular attention
to the location of mapped genes and the genetic regulation
Note: For the sake of this tutorial, you should
better use a pathway that is regulated at the
transcriptional level (e.g. methionine biosynthesis), or
whose genes are grouped in operons (e.g. histidine
Try to find the shorter path between the initial substrate
and end product of your study case pathway. For this, first
leave all options to their default values.
the first step of the analysis returns the list of comounds
that match queries for the start and end compounds. Make
sure you select the rigth compound (e.g. L-apsartate and
L-lysine for the lysine biosynthesis pathway).
Compare the predicted pathways with the reference pathway
(the one annotated in EcoCyc).
Redo the analysis by changing the option Graph type
to Reaction network (compounds and reactions)
Reset the option Graph type to RPAIR network
(compounds and reactant pairs), and test the ipmpact
of the Weighting scheme. What happens when you
disable compound weighting ?
For each parameter setting, compare the predicted
pathway and the annotated one, by computing the numbers of
TP, FP, TN, as well as the Sensitivity and Positive
To compute the accuracy statistics, you can use the following
Number of true positives (TP): reactions that are both
found by path finding and annotated in the EcoCyc
Number of false positives (FP): reactions found in the
predicted path but not in the annotated path.
Number of false negatives (FN): reactions present in
the annotated pathway but not detected in the rpedicted
You can then derive the statistics.
Sensitivity: Sn = TP / (TP + FN). Fraction of the
annotated reactions that were correctly recovered by
Positive Predictive Value: PPV = TP / (TP + FP).
Fraction of the predicted reactions that were part of
the annotated pathway.
Accuracy: Accgeom = sqrt(Sn * PPV). The geometric accuracy
of sensitivity and positive predictive value.
What is the impact of the following parameters on Sn, PPV
and accuracy ?
Network type: RPAIR versus reaction network.
Weighting scheme: activating or not the compound weightinf.
Are there interactions between the parameters ? More
precisely, has the weighting scheme (unweighted versus
weighted compounds) the same impact on the search in the
RPAIR network as in the weighted reaction network ?
Open a connection to the Regulatory Sequence Analysis Tools (RSAT,
open the toolbox Genomes and genes and select the
tool Infer operon.
Enter the name of all the genes involved in your patwhay
of interest, leave all other parameters to their default
value and click GO.
Compare the predicted operons with the annotations in
EcoCyc. How accurate was the simple distance-based method ?
Can you optimize the distance threshold to recover the
annotated operons for your genes ?
Open the toolbox Comparative genomics, and select
the tool footprint-discovery.
Enter the name of one of the enzyme-coding genes of your
pathway of interest. Set the Organism
to Escherichia coli K12 and the Taxon option
Activate the option predict operon leader genes.
Enter your email address and click GO.
Beware, this analysis can take a bit of time (1-2
minutes). While it is running, you can already perform the next
section, and come back to the result after having received the
email notification of the task completion.
Once you have received the task completion email, open the
result page and click on the link "Final matrices (transfac
format)". Copy the whole content of the file, and open a new
connection to RSAT. In the
toolbox Matrix tools,
open compare-matrices. Paste your matrices, make sure
that the rigth matrix format is selected
(transfac). For the option Reference matrices,
select RegulondDB. Leave all other parameters
unchanged and type GO.
Did the footprint-discovery program report significant dyads ?
Analyze the feature map : do you see conserved sites ? Do
they occupy fixed positions or are they spread over the
Are there know factors that are likey to bind the
discovered motifs ? How do the discovered matrices compare
with the known motifs in RegulonDB ? In case you found
matches, are they consistent with the pathway of interest?
Extracting a pathway from an operon
Pathway prediction by subgraph extraction from metabolic networks
Faust, K. and van Helden, J. (2012). Predicting Metabolic Pathways
by Sub-network Extraction. Methods Mol Biol 804, 107-30.
Faust, K., Dupont, P., Callut, J. and van Helden,
J. (2010). Pathway discovery in metabolic networks by subgraph
extraction. Bioinformatics 26:1211-8.
Two-ends path finding
Faust, K., Croes, D. and van Helden, J. (2009b). In response to
"Can sugars be produced from fatty acids? A test case for pathway
analysis tools". Bioinformatics 2009 Sept 23.
Faust, K., Croes, D. and van Helden, J. (2009a). Metabolic Pathfinding Using
RPAIR Annotation. J Mol
Croes, D., F. Couche, S.J. Wodak, J. van Helden (2006). Inferring
Meaningful Pathways in Weighted Metabolic Networks. J. Mol.
Biol. 356:222-36. [PMID 16337962].
Croes, D., F. Couche, S.J. Wodak, and J. van
Helden. 2005. Metabolic PathFinding: inferring relevant pathways in
biochemical networks. Nucleic Acids Res 33:
Analysis of conserved cis-regulatory elements (phylogenetic footprints)
Brohee, S., Janky, R., Abdel-Sater, F., Vanderstocken, G., Andre,
B. and van Helden, J. (2011). Unraveling networks of co-regulated
genes on the sole basis of genome sequences. Nucleic Acids Res 39,
Janky, R. and van Helden, J. Evaluation of phylogenetic
footprint discovery for the prediction of bacterial cis-regulatory
elements (2008). BMC Bioinformatics 2008, 9:37doi:10.1186/1471-2105-9-37.
[PMID 18215291].[Open access].
Prediction of operons
Moreno-Hagelsieb, G. & Collado-Vides, J. (2002). A powerful
non-homology method for the prediction of operons in
prokaryotes. Bioinformatics 18 Suppl 1, S329-36.
Salgado, H., Moreno-Hagelsieb, G., Smith, T. F. & Collado-Vides,
J. (2000). Operons in Escherichia coli: genomic analyses and
predictions. Proc Natl Acad Sci U S A 97, 6652-7.