User guide

pairwisedist module

pairwisedist can calculate the pairwise-distance matrix for an array of n samples by p features. The distance metric currently supported by pairwisedist are the Jackknife-correlation dissimilarity, the Son and Baek dissimilarities YS1 and YR1, the Pearson correlation dissimilarity and the Spearman correlation dissimilarity.

All functions in pairwisedist receive as input an n-by-p matrix. If the the matrix consists of row variables (parameter ‘rowvar’ == True - the default), pairwisedist will return an n-by-n pairwise distance matrix. Otherwise, the pairwise distance will be calculated on the columns, and pairwisedist will return a p-by-p pairwise distance matrix.

The Jackknife-correlation dissimilarity metric

Jackknife-correlation distance (referred to from here on as JK) was developed as an improved correlation metric for gene expression data. In particular, JK is meant to reduce the number of false-positives caused by Pearson correlation. This is achieved by using Jackknife resampling: we calculate the Pearson correlation coefficient p times, leaving out one of the features every time. We then pick the minimal pairwise Pearson coefficient per pair as the Jackknife coefficient for the pair.

The Jackknife correlation coefficient for X,Y is formally defined as:

min(Pearson(X[idx != i],Y[idx != i]) for i in range(p))

You can read more about this metric in the original publication.

The YS1 and YR1 dissimilarity metrics

The Son and Baek dissimilarity metrics, also know as YS1 and YR1, were developed as an improved correlation metric for time-series Gene Expression data, generated by Microarray or RNA Sequencing experiments. The goal of these distance metrics is to capture more accurately the shape of the expression pattern of each gene, and rank genes with similar expression patterns as more similar to one another. You can read more about these metrics in the original publication.

In principle, the YS1 and YR1 scores integrate information about the the correlation between the different samples (Pearson correlation (R) for YR1 metric, Spearman correlation (S) for YS1 metric), the positon of the minimal and maximal values of each sample (M) and the agreement of their slopes (A)s. The weight of each of the components can be determined by the user using the parameters ‘omega1’, ‘omega2’ and ‘omega3’ respectively. The default weights recommended by the authors are 0.5, 0.25, 0.25.

The YS1 correlation coefficient is formally defined as:

YS1(X,Y) = omega1 * S(X,Y) + omega2 * M(X,Y) + omega3 * A(X,Y)

The YR1 correlation coefficient is formally defined as:

YR1(X,Y) = omega1 * R(X,Y) + omega2 * M(X,Y) + omega3 * A(X,Y)

Due to the characteristics of the YS1 and YR2 metrics, the order of the features (columns) in the input matrix affect the values of the pairwise distance.

The sharpened cosine distance metric

Sharpened cosine distance is a distance metric that was suggested in a since-deleted tweet by Brandon Rohrer. It was suggested as an alternative to convolution for the purpose of detecting specific features, offering highest sensitivity and specificity for the shape of the feature instead of its absolute magnitude. Rohrer has since implemented a version of this distance metric for neural networks (Jax, Keras, and PyTorch), which you can find here.

The pairwisedist implementation of this distance metric aims for a different purpose - comparing the similarity (or difference) between the gene expression distribution of two genes over different time points or conditions. This purpose is supported by the properties of sharpened cosine distance - specificity towards the shape of the distribution, and not its absolute value.