Workflow: Generate genome indices for STAR & bowtie
Creates indices for: * [STAR](https://github.com/alexdobin/STAR) v2.5.3a (03/17/2017) PMID: [23104886](https://www.ncbi.nlm.nih.gov/pubmed/23104886) * [bowtie](http://bowtie-bio.sourceforge.net/tutorial.shtml) v1.2.0 (12/30/2016) It performs the following steps: 1. `STAR --runMode genomeGenerate` to generate indices, based on [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) and [GTF](http://mblab.wustl.edu/GTF2.html) input files, returns results as an array of files 2. Outputs indices as [Direcotry](http://www.commonwl.org/v1.0/CommandLineTool.html#Directory) data type 3. Separates *chrNameLength.txt* file from Directory output 4. `bowtie-build` to generate indices requires genome [FASTA](http://zhanglab.ccmb.med.umich.edu/FASTA/) file as input, returns results as a group of main and secondary files
- Selected
- |
- Default Values
- Nested Workflows
- Tools
- Inputs/Outputs
Inputs
ID | Type | Title | Doc |
---|---|---|---|
genome | String | Genome type |
Genome type, such as mm10, hg19, hg38, etc |
threads | Integer (Optional) | Number of threads to run tools |
Number of threads for those steps that support multithreading |
cytoband | File [TSV] | Compressed cytoBand file for IGV browser |
Compressed tab-separated cytoBand file for IGV browser |
genome_file | File [2bit] | Reference genome file (*.2bit, *.fasta, *.fa, *.fa.gz, *.fasta.gz) |
Reference genome file (*.2bit, *.fasta, *.fa, *.fa.gz, *.fasta.gz). All chromosomes are included |
genome_label | String (Optional) | Genome label | |
annotation_tab | File [TSV] | Compressed tsv.gz annotation file |
Compressed tab-separated annotation file. Doesn't include chrM |
genome_details | String (Optional) | Genome details | |
chromosome_list | String[] (Optional) | Chromosome list to be included into the reference genome FASTA file |
Filter chromosomes while extracting FASTA from 2bit |
fasta_ribosomal | File (Optional) [FASTA] | Ribosomal DNA file (*.fasta, *.fa) |
Ribosomal DNA file (*.fasta, *.fa). Default: hg19 |
genome_description | String (Optional) | Genome description | |
genome_sa_sparse_d | Integer (Optional) | Suffix array sparsity for reference genome and mitochondrial DNA indices |
Suffix array sparsity, i.e. distance between indices: use bigger numbers to decrease needed RAMat the cost of mapping speed reduction\" |
effective_genome_size | String | Effective genome size |
MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9 |
genome_chr_bin_n_bits | Integer (Optional) | Number of bins allocated for each chromosome of reference genome |
If you are using a genome with a large (>5,000) number of references (chrosomes/scaffolds), you may need to reduce the --genomeChrBinNbits to reduce RAM consumption. For a genome with large number of contigs, it is recommended to scale this parameter as min(18, log2[max(GenomeLength/NumberOfReferences,ReadLength)]). default: 18 |
genome_sa_index_n_bases | Integer (Optional) | Length of SA pre-indexing string for reference genome indices |
Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14 |
limit_genome_generate_ram | Long (Optional) | Limit maximum available RAM (bytes) for reference genome indices generation |
Maximum available RAM (bytes) for genome generation. Default 31000000000 |
mitochondrial_annotation_tab | File [TSV] | Compressed tsv.gz mitochondrial DNA annotation file |
Compressed mitochondrial DNA tab-separated annotation file. Includes only chrM |
genome_sa_index_n_bases_mitochondrial | Integer (Optional) | Length of SA pre-indexing string for mitochondrial DNA indices |
Length (bases) of the SA pre-indexing string. Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter –genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7. default: 14 |
Steps
ID | Runs | Label | Doc |
---|---|---|---|
index_fasta |
../tools/samtools-faidx.cwl
(CommandLineTool)
|
Generates FAI index file for input FASTA file Output file has the same basename, as input file, but with updated `.fai` extension. `samtools faidx` exports output file alognside the input file. To prevent tool from failing, `input_file` should be staged into output directory using `\"writable\": true`. Setting `writable: true` makes cwl-runner to make a copy of input file and mount it to docker container with `rw` mode as part of `--workdir` (if set to false, the file staged into output directory will be mounted to docker container separately with `ro` mode) |
|
extract_fasta |
../tools/ucsc-twobit-to-fa.cwl
(CommandLineTool)
|
twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing |
|
extract_cytoband |
genome-indices.cwl#extract_cytoband/0897e575-8a23-4a6c-8928-1c29451a1d59
(CommandLineTool)
|
||
prepare_annotation |
genome-indices.cwl#prepare_annotation/881c8fff-6b2f-468f-b297-f36ab9f71602
(CommandLineTool)
|
||
sort_annotation_bed |
../tools/linux-sort.cwl
(CommandLineTool)
|
Tool sorts data from `unsorted_file` by key |
|
star_generate_indices |
../tools/star-genomegenerate.cwl
(CommandLineTool)
|
Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder. |
|
bowtie_generate_indices |
../tools/bowtie-build.cwl
(CommandLineTool)
|
Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>) |
|
annotation_bed_to_bigbed |
../tools/ucsc-bedtobigbed.cwl
(CommandLineTool)
|
Tool converts bed file to bigBed |
|
convert_annotation_to_bed |
genome-indices.cwl#convert_annotation_to_bed/d890ffab-ac59-41f7-982e-eb92029f3673
(CommandLineTool)
|
||
ribosomal_generate_indices |
../tools/bowtie-build.cwl
(CommandLineTool)
|
Tool runs bowtie-build Not supported parameters: -c - reference sequences given on cmd line (as <seq_in>) |
|
extract_mitochondrial_fasta |
../tools/ucsc-twobit-to-fa.cwl
(CommandLineTool)
|
twoBitToFa - Convert all or part of .2bit file to fasta. Outputs only those chromosomes that are set in chr_list intput. Tool will fail if you include in chr_list those chromosomes that are absent in 2bit file. If gz is provided - use gunzip instead of twoBitToFa If FASTA file is provided, do nothing |
|
mitochondrial_generate_indices |
../tools/star-genomegenerate.cwl
(CommandLineTool)
|
Tool returns directory with indices generated by STAR. If genome_dir input is not provided, use default output directory name star_indices. Output chr_name_length should not be moved outside the indices folder. |
Outputs
ID | Type | Label | Doc |
---|---|---|---|
annotation | File [TSV] | TSV annotation file |
Tab-separated annotation file. Includes reference genome and mitochondrial DNA annotations |
genome_size | String | Effective genome size |
MACS2 effective genome sizes: hs, mm, ce, dm or number, for example 2.7e9 |
chrom_length | File [Textual format] | Genome chromosome length file |
Genome chromosome length file |
fasta_output | File [FASTA] | Reference genome FASTA file |
Reference genome FASTA file. Includes only selected chromosomes |
star_indices | Directory | STAR genome indices |
STAR generated genome indices folder |
annotation_bed | File [BED] | Sorted BED annotation file |
Sorted BED annotation file |
annotation_gtf | File [GTF] | GTF annotation file |
GTF annotation file. Includes reference genome and mitochondrial DNA annotations |
bowtie_indices | Directory | Bowtie genome indices |
Bowtie generated genome indices folder |
cytoband_output | File [TSV] | CytoBand file for IGV browser |
Tab-separated cytoBand file for IGV browser |
fasta_fai_output | File [TSV] | FAI index for genome FASTA file |
Tab-separated FAI index file |
ribosomal_indices | Directory | Bowtie ribosomal DNA indices |
Bowtie generated ribosomal DNA indices folder |
annotation_bed_tbi | File [bigBed] | Sorted bigBed annotation file |
Sorted bigBed annotation file |
mitochondrial_indices | Directory | STAR mitochondrial DNA indices |
STAR generated mitochondrial DNA indices folder |
star_indices_stderr_log | File | STAR stderr log for genome indices |
STAR generated stderr log for genome indices |
star_indices_stdout_log | File | STAR stdout log for genome indices |
STAR generated stdout log for genome indices |
bowtie_indices_stderr_log | File | Bowtie stderr log genome indices |
Bowtie generated stderr log for genome indices |
bowtie_indices_stdout_log | File | Bowtie stdout log for genome indices |
Bowtie generated stdout log for genome indices |
ribosomal_indices_stderr_log | File | Bowtie stderr log for ribosomal DNA indices |
Bowtie generated stderr log for ribosomal DNA indices |
ribosomal_indices_stdout_log | File | Bowtie stdout log for ribosomal DNA indices |
Bowtie generated stdout log for ribosomal DNA indices |
mitochondrial_indices_stderr_log | File | STAR stderr log for mitochondrial DNA indices |
STAR generated stderr log for mitochondrial DNA indices |
mitochondrial_indices_stdout_log | File | STAR stdout log for mitochondrial DNA indices |
STAR generated stdout log for mitochondrial DNA indices |
https://w3id.org/cwl/view/git/a8eaf61c809d76f55780b14f2febeb363cf6373f/workflows/genome-indices.cwl