Workflow: 03-map-pe-umis.cwl

Fetched 2023-01-14 23:02:45 GMT

STARR-seq 03 mapping - reads: PE

children parents
Workflow as SVG
  • Selected
  • Default Values
  • Nested Workflows
  • Tools
  • Inputs/Outputs

Inputs

ID Type Title Doc
nthreads Integer
fgbio_jar_path String

fgbio Java jar file

picard_jar_path String

Picard Java jar file

picard_java_opts String (Optional)

JVM arguments should be a quoted, space separated list (e.g. \"-Xms128m -Xmx512m\")

regions_bed_file File

Regions bed file used to filter-in reads (used in samtools)

genome_sizes_file File

Genome sizes tab-delimited file (used in samtools)

input_fastq_umi_files File[]

Input fastq with UMIs files

input_fastq_read1_files File[]

Input fastq paired-end read 1 files

input_fastq_read2_files File[]

Input fastq paired-end read 2 files

ENCODE_blacklist_bedfile File

Bedfile containing ENCODE consensus blacklist regions to be excluded.

genome_ref_first_index_file File

Bowtie first index files for reference genome (e.g. *1.bt2). The rest of the files should be in the same folder.

Steps

ID Runs Label Doc
bowtie2
../map/bowtie2.cwl (CommandLineTool)

Bowtie 2 version 2.2.8 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea) Usage: bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

<bt2-idx> Index filename prefix (minus trailing .X.bt2). NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible. <m1> Files with #1 mates, paired with files in <m2>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <m2> Files with #2 mates, paired with files in <m1>. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <r> Files with unpaired reads. Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2). <sam> File for SAM output (default: stdout)

<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times. E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

Input: -q query input files are FASTQ .fq/.fastq (default) --qseq query input files are in Illumina's qseq format -f query input files are (multi-)FASTA .fa/.mfa -r query input files are raw one-sequence-per-line -c <m1>, <m2>, <r> are sequences themselves, not files -s/--skip <int> skip the first <int> reads/pairs in the input (none) -u/--upto <int> stop after first <int> reads/pairs (no limit) -5/--trim5 <int> trim <int> bases from 5'/left end of reads (0) -3/--trim3 <int> trim <int> bases from 3'/right end of reads (0) --phred33 qualities are Phred+33 (default) --phred64 qualities are Phred+64 --int-quals qualities encoded as space-delimited integers

Presets: Same as: For --end-to-end: --very-fast -D 5 -R 1 -N 0 -L 22 -i S,0,2.50 --fast -D 10 -R 2 -N 0 -L 22 -i S,0,2.50 --sensitive -D 15 -R 2 -N 0 -L 22 -i S,1,1.15 (default) --very-sensitive -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

For --local: --very-fast-local -D 5 -R 1 -N 0 -L 25 -i S,1,2.00 --fast-local -D 10 -R 2 -N 0 -L 22 -i S,1,1.75 --sensitive-local -D 15 -R 2 -N 0 -L 20 -i S,1,0.75 (default) --very-sensitive-local -D 20 -R 3 -N 0 -L 20 -i S,1,0.50

Alignment: -N <int> max # mismatches in seed alignment; can be 0 or 1 (0) -L <int> length of seed substrings; must be >3, <32 (22) -i <func> interval between seed substrings w/r/t read len (S,1,1.15) --n-ceil <func> func for max # non-A/C/G/Ts permitted in aln (L,0,0.15) --dpad <int> include <int> extra ref chars on sides of DP table (15) --gbar <int> disallow gaps within <int> nucs of read extremes (4) --ignore-quals treat all quality values as 30 on Phred scale (off) --nofw do not align forward (original) version of read (off) --norc do not align reverse-complement version of read (off) --no-1mm-upfront do not allow 1 mismatch alignments before attempting to scan for the optimal seeded alignments --end-to-end entire read must align; no clipping (on) OR --local local alignment; ends might be soft clipped (off)

Scoring: --ma <int> match bonus (0 for --end-to-end, 2 for --local) --mp <int> max penalty for mismatch; lower qual = lower penalty (6) --np <int> penalty for non-A/C/G/Ts in read/ref (1) --rdg <int>,<int> read gap open, extend penalties (5,3) --rfg <int>,<int> reference gap open, extend penalties (5,3) --score-min <func> min acceptable alignment score w/r/t read length (G,20,8 for local, L,-0.6,-0.6 for end-to-end)

Reporting: (default) look for multiple alignments, report best, with MAPQ OR -k <int> report up to <int> alns per read; MAPQ not meaningful OR -a/--all report all alignments; very slow, MAPQ not meaningful

Effort: -D <int> give up extending after <int> failed extends in a row (15) -R <int> for reads w/ repetitive seeds, try <int> sets of seeds (2)

Paired-end: -I/--minins <int> minimum fragment length (0) -X/--maxins <int> maximum fragment length (500) --fr/--rf/--ff -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr) --no-mixed suppress unpaired alignments for paired reads --no-discordant suppress discordant alignments for paired reads --no-dovetail not concordant when mates extend past each other --no-contain not concordant when one mate alignment contains other --no-overlap not concordant when mates overlap at all

Output: -t/--time print wall-clock time taken by search phases --un <path> write unpaired reads that didn't align to <path> --al <path> write unpaired reads that aligned at least once to <path> --un-conc <path> write pairs that didn't align concordantly to <path> --al-conc <path> write pairs that aligned concordantly at least once to <path> (Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g. --ungz <path>, to gzip compress output, or add '-bz2' to bzip2 compress output.) --quiet print nothing to stderr except serious errors --met-file <path> send metrics to file at <path> (off) --met-stderr send metrics to stderr (off) --met <int> report internal counters & metrics every <int> secs (1) --no-unal suppress SAM records for unaligned reads --no-head suppress header lines, i.e. lines starting with @ --no-sq suppress @SQ header lines --rg-id <text> set read group id, reflected in @RG line and RG:Z: opt field --rg <text> add <text> (\"lab:value\") to @RG line of SAM header. Note: @RG line only printed when --rg-id is set. --omit-sec-seq put '*' in SEQ and QUAL fields for secondary alignments.

Performance: -p/--threads <int> number of alignment threads to launch (1) --reorder force SAM output order to match order of input reads --mm use memory-mapped I/O for index; many 'bowtie's can share

Other: --qc-filter filter out reads that are bad according to QSEQ filter --seed <int> seed for random number generator (0) --non-deterministic seed rand. gen. arbitrarily instead of using read attributes --version print version information and quit

basename
../utils/basename.cwl (ExpressionTool)
sort_bams
../map/samtools-sort.cwl (CommandLineTool)
index_bams
../map/samtools-index.cwl (CommandLineTool)
bam_to_bepe
../map/bedtools-bamtobed.cwl (CommandLineTool)
cut_to_bepe
../utils/cut.cwl (CommandLineTool)

Cut columns from input file.

sort_to_bepe
../utils/sort.cwl (CommandLineTool)

Usage: sort [OPTION]... [FILE]... Write sorted concatenation of all FILE(s) to standard output.

Mandatory arguments to long options are mandatory for short options too. Ordering options:



-b, --ignore-leading-blanks ignore leading blanks

-d, --dictionary-order consider only blanks and alphanumeric characters

-f, --ignore-case fold lower case to upper case characters

-g, --general-numeric-sort compare according to general numerical value

-i, --ignore-nonprinting consider only printable characters

-M, --month-sort compare (unknown) < `JAN' < ... < `DEC'

-n, --numeric-sort compare according to string numerical value

-r, --reverse reverse the result of comparisons



Other options:



-c, --check check whether input is sorted; do not sort

-k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)

-m, --merge merge already sorted files; do not sort

-o, --output=FILE write result to FILE instead of standard output

-s, --stable stabilize sort by disabling last-resort comparison

-S, --buffer-size=SIZE use SIZE for main memory buffer

-t, --field-separator=SEP use SEP instead of non-blank to blank transition

-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;

multiple options specify multiple directories

-u, --unique with -c, check for strict ordering;

without -c, output only the first of an equal run

-z, --zero-terminated end lines with 0 byte, not newline

--help display this help and exit

--version output version information and exit



POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.

SIZE may be followed by the following multiplicative suffixes: % 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.

With no FILE, or when FILE is -, read standard input.

*** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.

Report bugs to <bug-coreutils@gnu.org>.

preseq-c-curve
../map/preseq-c_curve.cwl (CommandLineTool)

Usage: c_curve [OPTIONS] <sorted-bed-file>

Options: -o, -output yield output file (default: stdout) -s, -step step size in extrapolations (default: 1e+06) -v, -verbose print more information -P, -pe input is paired end read file -H, -hist input is a text file containing the observed histogram -V, -vals input is a text file containing only the observed counts -B, -bam input is in BAM format -l, -seg_len maximum segment length when merging paired end bam reads (default: 5000)

Help options: -?, -help print this help message -about print about message

mark_duplicates
../map/picard-MarkDuplicates.cwl (CommandLineTool)
sort_dedup_bams
../map/samtools-sort.cwl (CommandLineTool)
index_dedup_bams
../map/samtools-index.cwl (CommandLineTool)
sort_masked_bams
../map/samtools-sort.cwl (CommandLineTool)
index_masked_bams
../map/samtools-index.cwl (CommandLineTool)
remove_duplicates
../map/samtools-view.cwl (CommandLineTool)
extract_basename_1
../utils/extract-basename.cwl (CommandLineTool)

Extracts the base name of a file

extract_basename_2
../utils/remove-extension.cwl (CommandLineTool)

Extracts the base name of a file

sort_bams_by_tag_name
../map/samtools-sort.cwl (CommandLineTool)
index_dups_marked_bams
../map/samtools-index.cwl (CommandLineTool)
annotate_bams_with_umis
../map/fgbio-AnnotateBamWithUmis.cwl (CommandLineTool)

AnnotateBamWithUmis ------------------------------------------------------------------------------------------------------------------------ Annotates existing BAM files with UMIs (Unique Molecular Indices, aka Molecular IDs, Molecular barcodes) from a separate FASTQ file. Takes an existing BAM file and a FASTQ file consisting of UMI reads, matches the reads between the files based on read names, and produces an output BAM file where each record is annotated with an optional tag (specified by 'attribute') that contains the read sequence of the UMI. Trailing read numbers ('/1' or '/2') are removed from FASTQ read names, as is any text after whitespace, before matching.

At the end of execution, reports how many records were processed and how many were missing UMIs. If any read from the BAM file did not have a matching UMI read in the FASTQ file, the program will exit with a non-zero exit status. The '--fail-fast' option may be specified to cause the program to terminate the first time it finds a records without a matching UMI.

In order to avoid sorting the input files, the entire UMI fastq file is read into memory. As a result the program needs to be run with memory proportional the size of the (uncompressed) fastq.

remove_encode_blacklist
../map/bedtools-pairtobed.cwl (CommandLineTool)

Tool: bedtools pairtobed (aka pairToBed) Version: v2.25.0 Summary: Report overlaps between a BEDPE file and a BED/GFF/VCF file.

Usage: bedtools pairtobed [OPTIONS] -a <bedpe> -b <bed/gff/vcf>

Options: -abam The A input file is in BAM format. Output will be BAM as well. Replaces -a. - Requires BAM to be grouped or sorted by query.

-ubam Write uncompressed BAM output. Default writes compressed BAM.

is to write output in BAM when using -abam.

-bedpe When using BAM input (-abam), write output as BEDPE. The default is to write output in BAM when using -abam.

-ed Use BAM total edit distance (NM tag) for BEDPE score. - Default for BEDPE is to use the minimum of of the two mapping qualities for the pair. - When -ed is used the total edit distance from the two mates is reported as the score.

-f Minimum overlap required as fraction of A (e.g. 0.05). Default is 1E-9 (effectively 1bp).

-s Require same strandedness when finding overlaps. Default is to ignore stand. Not applicable with -type inspan or -type outspan.

-S Require different strandedness when finding overlaps. Default is to ignore stand. Not applicable with -type inspan or -type outspan.

-type Approach to reporting overlaps between BEDPE and BED.

either Report overlaps if either end of A overlaps B. - Default. neither Report A if neither end of A overlaps B. both Report overlaps if both ends of A overlap B. xor Report overlaps if one and only one end of A overlaps B. notboth Report overlaps if neither end or one and only one end of A overlap B. That is, xor + neither.

ispan Report overlaps between [end1, start2] of A and B. - Note: If chrom1 <> chrom2, entry is ignored.

ospan Report overlaps between [start1, end2] of A and B. - Note: If chrom1 <> chrom2, entry is ignored.

notispan Report A if ispan of A doesn't overlap B. - Note: If chrom1 <> chrom2, entry is ignored.

notospan Report A if ospan of A doesn't overlap B. - Note: If chrom1 <> chrom2, entry is ignored.

Refer to the BEDTools manual for BEDPE format.

filter_quality_alignments
../map/samtools-view.cwl (CommandLineTool)

Outputs

ID Type Label Doc
output_bowtie_log File[]

Bowtie log file.

output_data_bam_files File[]

BAM files with aligned reads.

output_templates_files File[]

Tags/templates coordinates, sorted by chromosome and position (sort -k1,1 -k2,2g).

output_data_dedup_bam_files File[]

Dedup BAM files with aligned reads.

output_preseq_c_curve_files File[]

Preseq c_curve output files.

output_data_unmapped_fastq_files File[]

FASTQ gzipped files with unmapped reads.

output_picard_mark_duplicates_files File[]

Picard MarkDuplicates metrics files.

Permalink: https://w3id.org/cwl/view/git/4e568335133405d28f4b73ae11e7f51f2900dfa3/v1.0/STARR-seq_pipeline/03-map-pe-umis.cwl