rnalysis.fastq.shortstack_align_smallrna

rnalysis.fastq.shortstack_align_smallrna(fastq_folder: Union[str, Path], output_folder: Union[str, Path], genome_fasta: Union[str, Path], shortstack_installation_folder: Union[str, Path, Literal['auto']] = 'auto', new_sample_names: Union[List[str], Literal['auto']] = 'auto', known_rnas: Optional[Union[str, Path]] = None, trim_adapter: Optional[Union[str, Literal['autotrim']]] = None, autotrim_key: str = 'TCGGACCAGGCTTCATTCCCC', multimap_mode: Literal['fractional', 'unique', 'random'] = 'fractional', align_only: bool = False, show_secondary_alignments: bool = False, dicer_min_length: PositiveInt = 21, dicer_max_length: PositiveInt = 24, loci_file: Optional[Union[str, Path]] = None, locus: Optional[str] = None, search_microrna: Union[None, Literal['de-novo', 'known-rnas']] = 'known-rnas', strand_cutoff: Fraction = 0.8, min_coverage: float = 2, pad: PositiveInt = 75, threads: PositiveInt = 1)

Align small RNA single-end reads from FASTQ files to a reference sequence using the ShortStack aligner (version 4). ShortStack is currently not supported on computers running Windows.

Parameters
  • fastq_folder (str or Path) – Path to the folder containing the FASTQ files you want to quantify

  • output_folder (str/Path to an existing folder) – Path to a folder in which the aligned reads, as well as the log files, will be saved.

  • genome_fasta (str or Path) – Path to the FASTA file which contain the reference sequences to be aligned to.

  • shortstack_installation_folder (str, Path, or 'auto' (default='auto')) – Path to the installation folder of ShortStack. For example: ‘/home/myuser/anaconda3/envs/myenv/bin’. if installation folder is set to ‘auto’, RNAlysis will attempt to find it automatically.

  • new_sample_names (list of str or 'auto' (default='auto')) – Give a new name to each quantified sample (optional). If sample_names=’auto’, sample names will be given automatically. Otherwise, sample_names should be a list of new names, with the order of the names matching the order of the file pairs.

  • known_rnas (str, Path, or None (default=None)) – Path to FASTA-formatted file of known small RNAs. FASTA must be formatted such that a single RNA sequence is on one line only. ATCGUatcgu characters are acceptable. These RNAs are typically the sequences of known microRNAs. For instance, a FASTA file of mature miRNAs pulled from https://www.mirbase.org. Providing these data increases the accuracy of MIRNA locus identification.

  • trim_adapter (str, 'autotrim', or None (default=None)) – Determines whether ShortStack will attempt to trim the supplied reads. If trim_adapter is not provided (default), no trimming will be run. If trim_adapter is set to ‘autotrim’, ShortStack will automatically infer the 3’ adapter sequence of the untrimmed reads, and the uses that to coordinate read trimming. If trim_adapter is a DNA sequence, ShortStack will trim the reads using the given DNA sequence as the 3’ adapter.

  • autotrim_key (str (default="TCGGACCAGGCTTCATTCCCC" (miR166))) – A DNA sequence to use as a known suffix during the autotrim procedure. This parameter is used only if trim_adapter is set to ‘autotrim’. ShortStack’s autotrim discovers the 3’ adapter by scanning for reads that begin with the sequence given by autotrim_key. This should be the sequence of a small RNA that is known to be highly abundant in all the libraries. The default sequence is for miR166, a microRNA that is present in nearly all plants at high levels. For non-plant experiments, or if the default is not working well, consider providing an alternative to the default.

  • multimap_mode ('fractional', 'unique', or 'random' (default='fractional')) – Sets the mode by which multi-mapped reads are handled. These modes are described in Johnson et al. (2016). The default mode (‘fractional’) has the best performance. In ‘fractional’ mode, ShortStack will use a fractional weighting scheme for placement of multi-mapped reads. In ‘unique’ mode, only uniquely-aligned reads are used as weights for placement of multi-mapped reads. In ‘random’ mode, multi-mapped read placement is random.

  • align_only (bool (default=False)) – if True, ShortStack will terminate after the alignment phase; no additional analysis will occur.

  • show_secondary_alignments (bool (default=False)) – if True, ShortStack will retain secondary alignments for multimapped reads. This will increase bam file size, possibly by a lot.

  • dicer_min_length (positive int (default=21)) – the minimum size (in nucleotides) of a valid small RNA. Together with dicer_max_length, this option sets the bounds to discriminate Dicer-derived small RNA loci from other loci. At least 80% of the reads in a given cluster must be in the range indicated by dicer_min_length and dicer_max_length.

  • dicer_max_length (positive int (default=24)) – the maximun size (in nucleotides) of a valid small RNA. Together with dicer_min_length, this option sets the bounds to discriminate Dicer-derived small RNA loci from other loci. At least 80% of the reads in a given cluster must be in the range indicated by dicer_min_length and dicer_max_length.

  • loci_file (str, Path, or None (default=None)) – Path to a file of pre-determined loci to analyze. This will prevent de-novo discovery of small RNA loci. The file may be in gff3, bed, or simple tab-delimited format (Chr:Start-Stop[tab]Name). Mutually exclusive with locus.

  • locus (str or None (default=None)) – A single locus to analyze, given as a string in the format Chr:Start-Stop (using one-based, inclusive numbering). This will prevent de novo discovery of small RNA loci. Mutually exclusive with loci_file.

  • search_microrna ('de-novo', 'known-rnas', or None (default='known-rnas')) – determines whether and how search for microRNAs will be performed. if search_microrna is None, ShortStack will not search for microRNAs. This saves computational time, but MIRNA loci will not be differentiated from other types of small RNA clusters. if search_microrna is ‘known-rnas’, t ShortStack will confine MIRNA analysis to loci where one or more queries from the known_rnas file are aligned to the genome. if search_microrna is ‘de-novo’, ShortStack will run a more comprehensive genome-wide scan for MIRNA loci. Discovered loci that do not overlap already known microRNAs should be treated with caution.

  • strand_cutoff (float between 0.5 and 1 (default=0.8)) – Floating point number that sets the cutoff for standedness. By default (strand_cutoff=0.8), loci with >80% reads on the top genomic strand are ‘+’ stranded, loci with <20% reads on the top genomic strand are ‘-’ stranded, and all others are unstranded ‘.’.

  • min_coverage (float > 0 (default=2)) – Minimum alignment depth, in units of ‘reads per million’, required to nucleate a small RNA cluster during de-novo cluster search.

  • pad (positive int (default=75)) – initial peaks (continuous regions with depth exceeding the argument min_coverage) are merged if they are this distance or less from each other.

  • threads (int > 0 (default=1)) – number of threads to run ShortStack on. More threads will generally make index building faster.