RSA-tools - tutorials

RSA-tools - Tutorials - patser

Contents

Prerequisite
Introduction
Reference
Example of utilization
Interpreting the result
Additional exercises

Prerequisite

This tutorial assumes that you already read the introduction to Position-Specific Scoring Matrices (PSSM).

Introduction

The program patser scans a sequence with a PSSM and return matching positions.
patser was developed by Jerry Hertz, and Jacques van Helden wrote the web interface in order to integrate patser in the RSAT.
We will use a PSSM in order to predict putative binding sites for the yeast transcription factor Pho4p.

Reference

Although patser was developed since 1989, the appropriate reference is a more recent article from Jerry Hertz.

Hertz and Stormo, 1999, Bioinformatics, 15:563-577

Example of utilization
Retrieve upstream sequences from -800 to -1 for the following yeast genes (as you have seen in the tutorial on sequence retrieval).
Since we are working with a eukaryote, make sure that the option Prevent overlaps with neighbour genes is inactivated before retrieving the sequence.
PHO5
PHO8
PHO11
PHO12
PHO80
PHO84
PHO86
PHO87
PHO89
Click on the button 'patser' at the bottom of the result page, in order to send the sequences to the patser form.
In the Matrix box, paste the PSSM for Pho4p (this matrix was obtained from SCPD).
A |  3   2   0  12   0   0   0   0   1   3
C |  5   2  12   0  12   0   1   0   2   1
G |  3   7   0   0   0  12   0   7   5   4
T |  1   1   0   0   0   0  11   5   4   4
For the lower threshold estimation:

select the option adjusted information content;
write auto in the box for threshold value.

Leave all other parameters unchanged and click GO.
The program returns one row per matching position. Only a few matches are returned.
Remarks

On the web interface, positions are by default returned as negative coordinates, which indicates the location of the binding site relative to the start codon. This differs from the original program which always returns positive positions (i.e. calculated from the sequence left).
The web interface also retrieves the matching sequences (in uppercase) together with a few flanking residues (in lowercases).

You can now transfer the results of patser to feature-map, in order to obtain a frawing of the matching positions. In the feature-map form, select the display limits from -800 to 0.
Interpreting the results

In this first trial, the threshold on score was calculated automatically, due to the choice of the option adjusted information content. This methods takes into account the information content of the matrix, and the size of the sequence set, in order to choose a good comprimise between selectivity and specificity.
The matching positions probably contain several false positives. In particular, sites with a relatively small score (e.g. 6) are likely to be false predictions. The whole problem with matrix-based pattern matching is rpecisely to choose an appropriate threshold.
One approach to select the threshold is to collect a set of experimentally proven binding sites (e.g. from TRANSFAC), and to scan them with the matrix. This will provide some information about the scores assigned by patser for bona fide binding sites. These scores can then be used to select an appropriate threshold for predictions in new sequences.
Notice that it is not always a good idea to take the minimal score of proven binding sites as lower threshold for patser. Indeed, the literature and databases may also contain errors, sot that some of the annotated binding sites are not correct. It is preferible to check the scores assigned to each experimentally proven binding site, and to see whether the colleciton contains outliers, i.e. some sites with a much lower score than the other ones. These outliers should be ignored for the selection of the threshold.

Additional exercises

Collect all known binding sites for Pho4p from SCPD. These are available via the link Regulatory elements and transcriptional factors.
Warning: For some reason, the site sequences in SCPD do not comply to the fasta format.

Normally, the row starting with a ">" character contains the ID and optional comments, and the sequence only starts at the next row.
In SCPD, the sequence is on the same row as the ID. If you copy it as it is, the programs will consider the sequence as a comment.
To avoid this problem,m you need to insert a newline character between the sequence name and the sequence itself, before performing the next steps.

Apply the strategy described above to define an appropriate threshold :

collect all binding sites for Pho4p. Make sure they are in fasta format.
Open the patser form, and paste the binding site sequences.
Paste the Pho4p matrix in the matrix box.
For the return option, select 1 top value for each sequence
As lower threshold estimation, set the minimum weight to 0. This is a quite permissive value.
Click GO.
Analyze the result : which scores are assigned to experimentally proven binding sites for Pho4p ?
Come back to the beginning of this tutorial, retrieve the upstream regions for the PHO genes, and scan them with teh adapted threshold on the weight.

You can now come back to the tutorial main page and follow the next tutorials.

For suggestions or information request, please contact