rnalysis.fastq.kallisto_quantify_paired_end
- rnalysis.fastq.kallisto_quantify_paired_end(r1_files: List[str], r2_files: List[str], output_folder: str | Path, index_file: str | Path, gtf_file: str | Path, kallisto_installation_folder: str | Path | Literal['auto'] = 'auto', new_sample_names: List[str] | Literal['auto', 'smart'] = 'smart', stranded: Literal['no', 'forward', 'reverse'] = 'no', summation_method: Literal['scaled_tpm', 'raw'] = 'scaled_tpm', bootstrap_samples: PositiveInt | None = None, **legacy_args) CountFilter
Quantify transcript abundance in paired-end mRNA sequencing data using kallisto. The FASTQ file pairs will be individually quantified and saved in the output folder, each in its own sub-folder. Alongside these files, three .csv files will be saved: a per-transcript count estimate table, a per-transcript TPM estimate table, and a per-gene scaled output table. The per-gene scaled output table is generated using the scaledTPM method (scaling the TPM estimates up to the library size) as described by Soneson et al 2015 and used in the tximport R package. This table format is considered un-normalized for library size, and can therefore be used directly by count-based statistical inference tools such as DESeq2. RNAlysis will return this table once the analysis is finished.
- Parameters:
summation_method ('scaled_tpm' or 'raw' (default='scaled_tpm'))
r1_files (list of str/Path to existing FASTQ files) – a list of paths to your Read#1 files. The files should be sorted in tandem with r2_files, so that they line up to form pairs of R1 and R2 files.
r2_files (list of str/Path to existing FASTQ files) – a list of paths to your Read#2 files. The files should be sorted in tandem with r1_files, so that they line up to form pairs of R1 and R2 files.
output_folder (str/Path to an existing folder) – Path to a folder in which the quantified results, as well as the log files, will be saved. The individual output of each pair of FASTQ files will reside in a different sub-folder within the output folder, and a summarized results table will be saved in the output folder itself.
index_file (str or Path) – Path to a pre-built kallisto index of the target transcriptome. Can either be downloaded from the kallisto transcriptome indices site, or generated manually from a FASTA file using the function kallisto_create_index.
gtf_file (str or Path) – Path to a GTF annotation file. This file will be used to map per-transcript abundances to per-gene estimated counts. The transcript names in the GTF files should match the ones in the index file - we recommend downloading cDNA FASTA/index files and GTF files from the same data source.
kallisto_installation_folder (str, Path, or 'auto' (default='auto')) – Path to the installation folder of kallisto. For example: ‘C:/Program Files/kallisto’. if installation folder is set to ‘auto’, RNAlysis will attempt to find it automatically.
new_sample_names (list of str or 'auto' (default='auto')) – Give a new name to each quantified sample (optional). If sample_names=’auto’, sample names will be given automatically. Otherwise, sample_names should be a list of new names, with the order of the names matching the order of the file pairs.
stranded ('no', 'forward', 'reverse' (default='no')) – Indicates the strandedness of the data. ‘no’ indicates the data is not stranded. ‘forward’ indicates the data is stranded, where the first read in the pair pseudoaligns to the forward strand of a transcript. ‘reverse’ indicates the data is stranded, where the first read in the pair pseudoaligns to the reverse strand of a transcript.
summation_method – Determines the method used to sum the transcript-level abundances to gene-level abundances. ‘scaled_tpm’ sums the transcript TPM estimates the gene level, and then scales then to the library size. ‘raw’ sums the transcript estimated counts to the gene level without scaling.
learn_bias (bool (default=False)) – if True, kallisto learns parameters for a model of sequences specific bias and corrects the abundances accordlingly. Note that this feature is not supported by kallisto versions beyond 0.48.0.
seek_fusion_genes (bool (default=False)) – if True, does normal quantification, but additionally looks for reads that do not pseudoalign because they are potentially from fusion genes. All output is written to the file fusion.txt in the output folder. Note that this feature is not supported by kallisto versions beyond 0.48.0.
bootstrap_samples (int >0 or None (default=None)) – Number of bootstrap samples to be generated. Bootstrap samples do not affect the estimated count values, but generates an additional .hdf5 output file which contains uncertainty estimates for the expression levels. This is required if you use tools such as Sleuth for downstream differential expression analysis, but not for more traditional tools such as DESeq2 and edgeR.