rnalysis.fastq.shortstack_align_smallrna
- rnalysis.fastq.shortstack_align_smallrna(fastq_folder: str | Path, output_folder: str | Path, genome_fasta: str | Path, shortstack_installation_folder: str | Path | Literal['auto'] = 'auto', new_sample_names: List[str] | Literal['auto'] = 'auto', known_rnas: str | Path | None = None, trim_adapter: str | Literal['autotrim'] | None = None, autotrim_key: str = 'TCGGACCAGGCTTCATTCCCC', multimap_mode: Literal['fractional', 'unique', 'random'] = 'fractional', align_only: bool = False, show_secondary_alignments: bool = False, dicer_min_length: PositiveInt = 21, dicer_max_length: PositiveInt = 24, loci_file: str | Path | None = None, locus: str | None = None, search_microrna: None | Literal['de-novo', 'known-rnas'] = 'known-rnas', strand_cutoff: Fraction = 0.8, min_coverage: float = 2, pad: PositiveInt = 75, threads: PositiveInt = 1)
Align small RNA single-end reads from FASTQ files to a reference sequence using the ShortStack aligner (version 4). ShortStack is currently not supported on computers running Windows.
- Parameters:
fastq_folder (str or Path) – Path to the folder containing the FASTQ files you want to quantify
output_folder (str/Path to an existing folder) – Path to a folder in which the aligned reads, as well as the log files, will be saved.
genome_fasta (str or Path) – Path to the FASTA file which contain the reference sequences to be aligned to.
shortstack_installation_folder (str, Path, or 'auto' (default='auto')) – Path to the installation folder of ShortStack. For example: ‘/home/myuser/anaconda3/envs/myenv/bin’. if installation folder is set to ‘auto’, RNAlysis will attempt to find it automatically.
new_sample_names (list of str or 'auto' (default='auto')) – Give a new name to each quantified sample (optional). If sample_names=’auto’, sample names will be given automatically. Otherwise, sample_names should be a list of new names, with the order of the names matching the order of the file pairs.
known_rnas (str, Path, or None (default=None)) – Path to FASTA-formatted file of known small RNAs. FASTA must be formatted such that a single RNA sequence is on one line only. ATCGUatcgu characters are acceptable. These RNAs are typically the sequences of known microRNAs. For instance, a FASTA file of mature miRNAs pulled from https://www.mirbase.org. Providing these data increases the accuracy of MIRNA locus identification.
trim_adapter (str, 'autotrim', or None (default=None)) – Determines whether ShortStack will attempt to trim the supplied reads. If trim_adapter is not provided (default), no trimming will be run. If trim_adapter is set to ‘autotrim’, ShortStack will automatically infer the 3’ adapter sequence of the untrimmed reads, and the uses that to coordinate read trimming. If trim_adapter is a DNA sequence, ShortStack will trim the reads using the given DNA sequence as the 3’ adapter.
autotrim_key (str (default="TCGGACCAGGCTTCATTCCCC" (miR166))) – A DNA sequence to use as a known suffix during the autotrim procedure. This parameter is used only if trim_adapter is set to ‘autotrim’. ShortStack’s autotrim discovers the 3’ adapter by scanning for reads that begin with the sequence given by autotrim_key. This should be the sequence of a small RNA that is known to be highly abundant in all the libraries. The default sequence is for miR166, a microRNA that is present in nearly all plants at high levels. For non-plant experiments, or if the default is not working well, consider providing an alternative to the default.
multimap_mode ('fractional', 'unique', or 'random' (default='fractional')) – Sets the mode by which multi-mapped reads are handled. These modes are described in Johnson et al. (2016). The default mode (‘fractional’) has the best performance. In ‘fractional’ mode, ShortStack will use a fractional weighting scheme for placement of multi-mapped reads. In ‘unique’ mode, only uniquely-aligned reads are used as weights for placement of multi-mapped reads. In ‘random’ mode, multi-mapped read placement is random.
align_only (bool (default=False)) – if True, ShortStack will terminate after the alignment phase; no additional analysis will occur.
show_secondary_alignments (bool (default=False)) – if True, ShortStack will retain secondary alignments for multimapped reads. This will increase bam file size, possibly by a lot.
dicer_min_length (positive int (default=21)) – the minimum size (in nucleotides) of a valid small RNA. Together with dicer_max_length, this option sets the bounds to discriminate Dicer-derived small RNA loci from other loci. At least 80% of the reads in a given cluster must be in the range indicated by dicer_min_length and dicer_max_length.
dicer_max_length (positive int (default=24)) – the maximun size (in nucleotides) of a valid small RNA. Together with dicer_min_length, this option sets the bounds to discriminate Dicer-derived small RNA loci from other loci. At least 80% of the reads in a given cluster must be in the range indicated by dicer_min_length and dicer_max_length.
loci_file (str, Path, or None (default=None)) – Path to a file of pre-determined loci to analyze. This will prevent de-novo discovery of small RNA loci. The file may be in gff3, bed, or simple tab-delimited format (Chr:Start-Stop[tab]Name). Mutually exclusive with locus.
locus (str or None (default=None)) – A single locus to analyze, given as a string in the format Chr:Start-Stop (using one-based, inclusive numbering). This will prevent de novo discovery of small RNA loci. Mutually exclusive with loci_file.
search_microrna ('de-novo', 'known-rnas', or None (default='known-rnas')) – determines whether and how search for microRNAs will be performed. if search_microrna is None, ShortStack will not search for microRNAs. This saves computational time, but MIRNA loci will not be differentiated from other types of small RNA clusters. if search_microrna is ‘known-rnas’, t ShortStack will confine MIRNA analysis to loci where one or more queries from the known_rnas file are aligned to the genome. if search_microrna is ‘de-novo’, ShortStack will run a more comprehensive genome-wide scan for MIRNA loci. Discovered loci that do not overlap already known microRNAs should be treated with caution.
strand_cutoff (float between 0.5 and 1 (default=0.8)) – Floating point number that sets the cutoff for standedness. By default (strand_cutoff=0.8), loci with >80% reads on the top genomic strand are ‘+’ stranded, loci with <20% reads on the top genomic strand are ‘-’ stranded, and all others are unstranded ‘.’.
min_coverage (float > 0 (default=2)) – Minimum alignment depth, in units of ‘reads per million’, required to nucleate a small RNA cluster during de-novo cluster search.
pad (positive int (default=75)) – initial peaks (continuous regions with depth exceeding the argument min_coverage) are merged if they are this distance or less from each other.
threads (int > 0 (default=1)) – number of threads to run ShortStack on. More threads will generally make index building faster.