rnalysis.filtering.CountFilter

class rnalysis.filtering.CountFilter(fname: Union[str, Path, tuple], drop_columns: Union[str, List[str]] = None, is_normalized: bool = False)

A class that receives a count matrix and can filter it according to various characteristics.

Attributes

df: pandas DataFrame

A DataFrame that contains the count matrix contents. The DataFrame is modified upon usage of filter operations.

shape: tuple (rows, columns)

The dimensions of df.

columns: list

The columns of df.

fname: pathlib.Path

The path and filename for the purpose of saving df as a csv file. Updates automatically when filter operations are applied.

index_set: set

All of the indices in the current DataFrame (which were not removed by previously used filter methods) as a set.

index_string: string

A string of all feature indices in the current DataFrame separated by newline.

triplicates: list

Returns a nested list of the column names in the CountFilter, grouped by alphabetical order into triplicates. For example, if counts.columns is [‘A_rep1’,’A_rep2’,’A_rep3’,’B_rep1’,’B_rep2’,_B_rep3’], then counts.triplicates will be [[‘A_rep1’,’A_rep2’,’A_rep3’],[‘B_rep1’,’B_rep2’,_B_rep3’]]

__init__(fname: Union[str, Path, tuple], drop_columns: Union[str, List[str]] = None, is_normalized: bool = False)

Load a count matrix. A valid count matrix should have one row per gene/genomic feature and one column per condition/RNA library. The contents of the count matrix can be raw or pre-normalized.

Parameters
  • fname (Union[str, Path]) – full path/filename of the .csv file to be loaded into the Filter object

  • drop_columns (str, list of str, or None (default=None)) – if a string or list of strings are specified, the columns of the same name/s will be dropped from the loaded table.

  • is_normalized (bool (default=False)) – indicates whether this count table is pre-normalized. RNAlysis issues a warning when a function meant for normalized tables is applied to a table that was not already normalized.

CountFilter.average_replicate_samples(...[, ...])

Average the expression values of gene expression for each group of replicate samples.

CountFilter.biotypes_from_gtf(gtf_path[, ...])

Returns a DataFrame describing the biotypes in the table and their count.

CountFilter.biotypes_from_ref_table([...])

Returns a DataFrame describing the biotypes in the table and their count.

CountFilter.box_plot([samples, notch, ...])

Generates a box plot of the specified samples in the CountFilter object in log10 scale.

CountFilter.clustergram([sample_names, ...])

Performs hierarchical clustering and plots a clustergram on the base-2 log of a given set of samples.

CountFilter.describe([percentiles])

Generate descriptive statistics that summarize the central tendency, dispersion and shape of the dataset’s distribution, excluding NaN values.

CountFilter.difference(*others[, ...])

Keep only the features that exist in the first Filter object/set but NOT in the others.

CountFilter.differential_expression_deseq2(...)

Run differential expression analysis on the count matrix using the DESeq2 algorithm.

CountFilter.differential_expression_limma_voom(...)

Run differential expression analysis on the count matrix using the Limma-Voom pipeline.

CountFilter.drop_columns(columns[, inplace])

Drop specific columns from the table.

CountFilter.enhanced_box_plot([samples, ...])

Generates an enhanced box-plot of the specified samples in the CountFilter object in log10 scale.

CountFilter.filter_biotype_from_gtf(gtf_path)

Filters out all features that do not match the indicated biotype/biotypes (for example: 'protein_coding', 'ncRNA', etc).

CountFilter.filter_biotype_from_ref_table([...])

Filters out all features that do not match the indicated biotype/biotypes (for example: 'protein_coding', 'ncRNA', etc).

CountFilter.filter_by_attribute([...])

Filters features according to user-defined attributes from an Attribute Reference Table.

CountFilter.filter_by_go_annotations(go_ids)

Filters genes according to GO annotations, keeping only genes that are annotated with a specific GO term.

CountFilter.filter_by_kegg_annotations(kegg_ids)

Filters genes according to KEGG pathways, keeping only genes that belong to specific KEGG pathway.

CountFilter.filter_by_row_name(row_names[, ...])

Filter out specific rows from the table by their name (index).

CountFilter.filter_by_row_sum([threshold, ...])

Removes features/rows whose sum is belove 'threshold'.

CountFilter.filter_duplicate_ids([keep, ...])

Filter out rows with duplicate names/IDs (index).

CountFilter.filter_low_reads([threshold, ...])

Filter out features which are lowly-expressed in all columns, keeping only features with at least 'threshold' reads in at least one column.

CountFilter.filter_missing_values([columns, ...])

Remove all rows whose values in the specified columns are missing (NaN).

CountFilter.filter_percentile(percentile, column)

Removes all entries above the specified percentile in the specified column.

CountFilter.filter_top_n(by[, n, ascending, ...])

Sort the rows by the values of specified column or columns, then keep only the top 'n' rows.

CountFilter.find_paralogs_ensembl([...])

Find paralogs within the same species using the Ensembl database.

CountFilter.find_paralogs_panther([...])

Find paralogs within the same species using the PantherDB database.

CountFilter.fold_change(numerator, denominator)

Calculate the fold change between the numerator condition and the denominator condition, and return it as a FoldChangeFilter object.

CountFilter.from_dataframe(df, name[, ...])

CountFilter.from_folder(folder_path[, ...])

Iterates over count .txt files in a given folder and combines them into a single CountFilter table.

CountFilter.from_folder_htseqcount(folder_path)

Iterates over HTSeq count .txt files in a given folder and combines them into a single CountFilter table.

CountFilter.head([n])

Return the first n rows of the Filter object.

CountFilter.intersection(*others[, ...])

Keep only the features that exist in ALL of the given Filter objects/sets.

CountFilter.ma_plot([ref_column, columns, ...])

Generates M-A (log-ratio vs.

CountFilter.majority_vote_intersection(*others)

Returns a set/string of the features that appear in at least (majority_threhold * 100)% of the given Filter objects/sets.

CountFilter.map_orthologs_ensembl(...[, ...])

Map genes to their nearest orthologs in a different species using the Ensembl database.

CountFilter.map_orthologs_orthoinspector(...)

Map genes to their nearest orthologs in a different species using the OrthoInspector database.

CountFilter.map_orthologs_panther(...[, ...])

Map genes to their nearest orthologs in a different species using the PantherDB database.

CountFilter.map_orthologs_phylomedb(...[, ...])

Map genes to their nearest orthologs in a different species using the PhylomeDB database. This function generates a table describing all matching discovered ortholog pairs (both unique and non-unique) and returns it, and can also translate the genes in this data table into their nearest ortholog, as well as remove unmapped genes.

CountFilter.normalize_median_of_ratios(...)

Normalizes the count matrix using the 'Median of Ratios Normalization' (MRN) method (Maza et al 2013).

CountFilter.normalize_rle([inplace, ...])

Normalizes the count matrix using the 'Relative Log Expression' (RLE) method (Anders and Huber 2010).

CountFilter.normalize_tmm([log_ratio_trim, ...])

Normalizes the count matrix using the 'trimmed mean of M values' (TMM) method (Robinson and Oshlack 2010).

CountFilter.normalize_to_quantile([...])

Normalizes the count matrix using the quantile method, generalized from Bullard et al 2010.

CountFilter.normalize_to_rpkm(gtf_file[, ...])

Normalizes the count matrix to Reads Per Kilobase Million (RPKM). Divides each column in the count matrix by (total reads)*(gene length / 1000)*10^-6.

CountFilter.normalize_to_rpm([inplace, ...])

Normalizes the count matrix to Reads Per Million (RPM).

CountFilter.normalize_to_rpm_htseqcount(...)

Normalizes the count matrix to Reads Per Million (RPM).

CountFilter.normalize_to_tpm(gtf_file[, ...])

Normalizes the count matrix to Transcripts Per Million (TPM). First, normalizes each gene to Reads Per Kilobase (RPK) by dividing each gene in the count matrix by its length in Kbp (gene length / 1000). Then, divides each column in the RPK matrix by (total RPK in column)*10^-6. This calculation is similar to that of Reads Per Kilobase Million (RPKM), but in the opposite order: the "per million" normalization factors are calculated after normalizing to gene lengths, not before.

CountFilter.normalize_with_scaling_factors(...)

Normalizes the reads in the CountFilter using pre-calculated scaling factors.

CountFilter.number_filters(column, operator, ...)

Applay a number filter (greater than, equal, lesser than) on a particular column in the Filter object.

CountFilter.pairplot([samples, log2, ...])

Plot pairwise relationships in the dataset.

CountFilter.pca([samples, n_components, ...])

Performs Principal Component Analysis (PCA), visualizing the principal components that explain the most variance between the different samples.

CountFilter.plot_expression(features[, ...])

Plot the average expression and standard error of the specified features under the specified conditions.

CountFilter.print_features()

Print the feature indices in the Filter object, sorted by their current order in the FIlter object, and separated by newline.

CountFilter.save_csv([alt_filename])

Saves the current filtered data to a .csv file.

CountFilter.save_parquet([alt_filename])

Saves the current filtered data to a .parquet file.

CountFilter.save_table([suffix, alt_filename])

Save the current filtered data table.

CountFilter.scatter_sample_vs_sample(...[, ...])

Generate a scatter plot where every dot is a feature, the x value is log10 of reads (counts, RPM, RPKM, TPM, etc) in sample1, the y value is log10 of reads in sample2.

CountFilter.sort(by[, ascending, ...])

Sort the rows by the values of specified column or columns.

CountFilter.sort_by_principal_component(...)

Performs Principal Component Analysis (PCA), and sort the table based on the contribution (loadings) of genes to a specific Principal Component. This type of analysis can help you understand which genes contribute the most to each principal component, particularly using single-list enrichment analysis. .

CountFilter.split_by_attribute(attributes[, ref])

Splits the features in the Filter object into multiple Filter objects, each corresponding to one of the specified Attribute Reference Table attributes.

CountFilter.split_by_percentile(percentile, ...)

Splits the features in the Filter object into two non-overlapping Filter objects: one containing features below the specified percentile in the specfieid column, and the other containing features about the specified percentile in the specified column.

CountFilter.split_by_principal_components(...)

Performs Principal Component Analysis (PCA), and split the table based on the contribution (loadings) of genes to specific Principal Components. For each Principal Component specified, RNAlysis will find the X% most influential genes on the Principal Component based on their loadings (where X is gene_fraction), (X/2)% from the top and (X/2)% from the bottom. This type of analysis can help you understand which genes contribute the most to each principal component.

CountFilter.split_by_reads([threshold])

Splits the features in the CountFilter object into two non-overlapping CountFilter objects, based on their maximum expression level.

CountFilter.split_clicom(*parameter_dicts[, ...])

Clusters the features in the CountFilter object using the modified CLICOM ensemble clustering algorithm (Mimaroglu and Yagci 2012), and then splits those features into multiple non-overlapping CountFilter objects, based on the clustering result.

CountFilter.split_hdbscan(min_cluster_size)

Clusters the features in the CountFilter object using the HDBSCAN clustering algorithm, and then splits those features into multiple non-overlapping CountFilter objects, based on the clustering result.

CountFilter.split_hierarchical(n_clusters[, ...])

Clusters the features in the CountFilter object using the Hierarchical clustering algorithm, and then splits those features into multiple non-overlapping CountFilter objects, based on the clustering result.

CountFilter.split_kmeans(n_clusters[, ...])

Clusters the features in the CountFilter object using the K-means clustering algorithm, and then splits those features into multiple non-overlapping CountFilter objects, based on the clustering result.

CountFilter.split_kmedoids(n_clusters[, ...])

Clusters the features in the CountFilter object using the K-medoids clustering algorithm, and then splits those features into multiple non-overlapping CountFilter objects, based on the clustering result.

CountFilter.symmetric_difference(other[, ...])

Returns a set/string of the WBGene indices that exist either in the first Filter object/set OR the second, but NOT in both (set symmetric difference).

CountFilter.tail([n])

Return the last n rows of the Filter object.

CountFilter.text_filters(column, operator, value)

Applay a text filter (equals, contains, starts with, ends with) on a particular column in the Filter object.

CountFilter.transform(function[, columns, ...])

Transform the values in the Filter object with the specified function.

CountFilter.translate_gene_ids(translate_to)

Translates gene names/IDs from one type to another.

CountFilter.union(*others[, return_type])

Returns a set/string of the union of features between multiple Filter objects/sets (the features that exist in at least one of the Filter objects/sets).

CountFilter.violin_plot([samples, ylabel, ...])

Generates a violin plot of the specified samples in the CountFilter object in log10 scale.