RSAT - random-genome-fragments manual



NAME

random-genome-fragments


DESCRIPTION

Select a set of fragments with random positions in a given genome, and return their coordinates and/or sequences. The supported organisms are etiher installed in RSAT or from Ensembl. Makes use of EnsEMBL API (www.ensembl.org) for EnsEMBL genomes.


AUTHORS

mthomas@biologie.ens.fr


CATEGORY

sequences


USAGE


random-genome-fragments -org organism -l length -r repetitions [-o outputfile] [-v # -rm -lf length_file] [..]


OUTPUT FORMATS

The program outputs a file containing the genomic coordinates or the sequences.


OPTIONS

-v #

Level of verbosity (detail in the warning messages during execution)

-h

Display full help message

-help

Same as -h

Output formats

-o outputfile

If no output file is specified, the standard output is used. This allows to use the command within a pipe.

-return returned_type

Type of data to return. Supported values: seq | coord
By default, coordinates (coord) are returned. For RSAT organisms, the return type can be 'seq' to retrieve sequences. The sequence format is fasta. For Ensembl organisms, use the coordinate file (in ft format) as input to retrieve-ensembl-seq.pl with the options -ftfile YourCoordFile -ftfileformat ft. You can also use the tools of sequence providers (UCSC, Galaxy, Ensembl) to efficently extract the sequences from the coordinates.

-coord_format coordinates_format

Supported values: ft | bed
Default is ft. To convert to another supported feature format, type the following command: convert-features -h For very big files, you might consider using the output format BED, which is adapted to UCSC database. You can thus use the tools of sequence providers (UCSC, Galaxy, Ensembl) to efficently extract the sequences. The genomic intervals in this BED file are 0-based, as specified in UCSC. Chromosome thus start at position 0 (not 1). This BED file is compatible with UCSC, Galaxy and Ensembl (On the Ensembl website, the bed file is automatically converted from 0-based into 1-based)

-rm

Will use the version of genome with repeat masked

Organisms options

-org organism_name

Specifies an organism, installed in RSAT. To have the list of supported organism in RSAT, type the following command: supported-organism

-org_ens ensembl_organism_name

Specifies an organism, from EnsEMBL database. No caps, underscore between words (eg 'homo_sapiens')

-ensemblhost mysql_server_name

Uses a local EnsEMBL server. (Advanced users)

Fragments options

-r repetitions

Allows to generate a set r of sequences, each of length l.

-l sequence_length

Sequence length of random genomic fragments.

-iseq reference_sequences

Allows to generate random sequences with the same lengths as a set of reference sequences. The difference with the -lf option is that the sequence lengths are automatically calculated.

-lf length file

Allows to generate random sequences with the same lengths as a set of reference sequences. The sequence length file can be obtained with the command sequence-lengths

On the website, it is possible to directly use the reference (=template) sequence set. The website automatically uses the program sequence-lengths to genere a length file with two columns :

sequence ID (ignored)
sequence length


SEE ALSO

random-genes