Open the toolset *Build control sets* of the RSAT
toolbox, and click **random sequence**. Adapt
the options to get 1000 sequences of 5,000bp each.

**Choice of the background model:** select the
organism *Saccharomyces cerevisiae*, check the
option *"DNA sequences calibrated on non-coding upstream
sequences"* with an *oligonucleotide size*
of *6* (this corresponds to a Markov model of
order *m = k-1 = 5*), and click *GO*.

After a few seconds, the result page appears. The sequences
are not displayed to avoid massive transfer (you don't
specially need to transfer 5Mb before submitting the
resulting sequences to the next analysis steps). If you
want, you can check the sequences by clicking on the link
under *Result files(s)*.

In the result page, click the
button *dna-pattern*.

- Enter
`GATACA` in
the box *Query pattern(s)*.
- For the
*Search strands* option,
select *direct only*, and disactivate the
option *prevent overlapping matches*.
**Disactivate** the default return
options *match positions* and *sequence
limits*.
**Activate** the options *match count
table*, *match rank* and *sort*.

#### Justification for the chosen options

- The option
*match count table* will return a
table with one row per sequence, and one column per
pattern (in this case we submitted a single pattern,
but the tool would also allow you to analyze different
patterns in a single run),
- The option
*sort* will sort the sequences y
decreasing occurrences, you will thus immeiately see the
maximum number of occurrences in the random trial.
- The option
*match rank* will indicate the rank
of each sequence accordnig to the number of pattern
occurrences, and thus allow us to check the number of
sequences haing 0, 1, 2, ... occurrences.

At this stage, you can already count the number of
sequences having 6, 5, 4, ... 0 occurrences of the
pattern by browing the result table from top to
bottom. We will however use another program to compute
it automatically, and display the result
graphically.

At the bottom of the *dna-pattern* result page,
click the button **Frequency distribution**. Set
the *class interval* to 1, the *Data column* to
3 (this column contains the counts of `GATACA`) and
click *GO*. The result table shows you the number of
sequences containing 0, 1, 2, ... occurrences,
respectively. Read the header to understand the content of
the columns. At the bottm of the frequency table, you have
some statistics, including the mean number of occurrences
per sequences (as I run this test, I obtain 1.331, but
this number is supposed to fluctuate between
trials). Compare this observed mean to the expected number
of occurrences.

We will now generate a graphical representation of the
frequency distribution. For this,
click **XYgraph** at the bottom of the frequency
distribution result page. Set the *Data column for X
axis* to 1, leave all other parameters unchanged and
click *GO*.

On the resulting graph,

- the blue curve indicates displays the
number
*n* of sequences (ordinate) presenting X
occurrences (0, 1, 2, ...); of `GATACA`;
- the green curve
(
*n_cum*) indicates the cumulative distribution,
i.e. number of sequences containing *at most* X
occurrences;
- the pink curve
(
*n_dcum*) indicates the decrasing cumulative
distribution, i.e. the number of sequences
containing *at least* X occurrences.

The distribution graph above indicating the absolute
frequencies (i.e. number of sequences). In order to
display the distributions of relative frequencies, come
back to the XYgraph form, and type *7,8,9* in the
box *Data columns for Y axis*. Optionally, you can
also choose to speficy a *log base* of 10 for the Y
axis. This will better highlight the lower frequencies
associated to large occurrence numbers.