rnalysis.fastq.find_duplicates

rnalysis.fastq.find_duplicates(input_folder: Union[str, Path], output_folder: Union[str, Path], picard_installation_folder: Union[str, Path, Literal['auto']] = 'auto', new_sample_names: Union[List[str], Literal['auto']] = 'auto', output_format: Literal['sam', 'bam'] = 'bam', duplicate_handling: Literal['mark', 'remove_optical', 'remove_all'] = 'remove_all', duplicate_scoring_strategy: Literal['reference_length', 'sum_of_base_qualities', 'random'] = 'sum_of_base_qualities', optical_duplicate_pixel_distance: int = 100)

Find duplicate reads in SAM/BAM files using Picard MarkDuplicates.

Parameters
  • input_folder (str or Path) – Path to the folder containing the SAM/BAM files you want to sort.

  • output_folder (str or Path) – Path to a folder in which the sorted SAM/BAM files will be saved.

  • picard_installation_folder (str, Path, or 'auto' (default='auto')) – Path to the installation folder of Picard. For example: ‘C:/Program Files/Picard’

  • new_sample_names (list of str or 'auto' (default='auto')) – Give a new name to each converted sample (optional). If sample_names=’auto’, sample names will be given automatically. Otherwise, sample_names should be a list of new names, with the order of the names matching the order of the files in the directory.

  • output_format ('sam' or 'bam' (default='bam')) – Format of the output file.

  • duplicate_handling ('mark', 'remove_optical', or 'remove_all' (default='remove_all')) – How to handle detected duplicate reads. If ‘mark’, duplicate reads will be marked with a 1024 flag. If ‘remove_optical’, ‘optical’ duplicates and other duplicates that appear to have arisen from the sequencing process instead of the library preparation process will be removed. If ‘remove_all’, all duplicate reads will be removed.

  • duplicate_scoring_strategy ('reference_length', 'sum_of_base_qualities', or 'random' (default='sum_of_base_qualities')) – How to score duplicate reads. If ‘reference_length’, the length of the reference sequence will be used. If ‘sum_of_base_qualities’, the sum of the base qualities will be used.

  • optical_duplicate_pixel_distance (int (default=100)) – The maximum offset between two duplicate clusters in order to consider them optical duplicates. The default (100) is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best.