# Data processing pipeline

GEO2Enrichr analyzes your data by performing the following operations:

1. Downloads the associated SOFT file from GEO or uploads a custom SOFT file via this website.
2. Discards data with missing values or one-to-many probe-to-gene mappings.
3. Checks if the data's maximum range is greater than 100 to determine if the original signal, which can range over several orders of magnitude, was already transformed to log-scale. As an equation where $M$ is a matrix of expression values: $$max(M) - min(M) > 100$$ If this equation is true, the assumption is that the data was not logged transformed and thus GEO2Enrichr $log_2$ transforms the data.
4. Checks if the data was normalized by checking if any row's median, or standard deviation, deviates from the average median or standard deviation by a magnitude greater than 4: $$\forall d_i \in D, | \frac{(d_i - \bar{D})}{\sigma(D)} | > 4$$ $$\forall s_i \in S, | \frac{(s_i - \bar{S})}{\sigma(S)} | > 4$$ where $D$ is the vector of medians of every row of data, and $S$ is the vector of standard deviations of every row of data. If either equation is true, GEO2Enrichr decides that the data was likely not normalized and thus the program quantile normalizes the data.

Broadly, quantile normalization is a four step algorithm to make two distributions statistically similar. First, the algorithm ranks the values in each column. Next it averages the rows. Finally, it reorders the columns, shifting each value back to its original location. The end result is that each distributions' statistical properties are identical. Below is a visualization:

         Original    Ranked    Averaged    Reordered
A   B       A   B     A   B       A   B
gene1    2 4 8 6     2 4 3 3   3 3 3 3     3 3 6 6
gene2    6 4 3 3     6 4 8 6   6 6 6 6     6 6 3 3
5. Averages multiple probe IDs to single gene symbols using a curated platform-to-probe-to-gene table.
6. Identifies differentially expressed genes with the selected method. The default method is the characteristic direction method, developed by the Ma'ayan Laboratory and shown to outperform all other popular methods.
7. Writes the resultant gene lists (up, down, combined) into plain text files for download.
8. Pipes the gene lists to Enrichr for further analysis.

# Characteristic Direction

#### Introduction

GEO2Enrichr uses, as a default, the Characteristic Direction for differential expression analysis. What follows is an introduction to the method.

#### Overview

The Characteristic Direction is a representation of the differential expression between two conditions as a unit vector in gene expression space. The square of each component is interpreted as a measure of the contribution of the corresponding gene to the overall differential expression in relation to all the other genes. In the current version, L2 regularization is used to limit any effects of over-fitting by penalizing the coefficients of this vector. Because L2 regularization does not shrink any set of coefficients to zero, the characteristic direction vector typically has all non-zero coefficients. As such, the characteristic direction can be used as a gene prioritization scheme, as opposed to a gene subset selection scheme. We have shown that it is possible to generate a null-distribution of characteristic direction vectors based on the hypothesis of no differential expression, and that this can be used to derive significance p values for each component which, after multiple-hypothesis testing, defines a subset of differentially expressed genes. However this process is computationally demanding, so the current version for GEO2Enrichr adopts the gene-prioritization approach. While the full characteristic direction vector is useful for geometric approaches and multiple condition comparisons, most naturally by the cosine distance for example, GEO2Enrichr also returns a short-list of the most highly prioritized genes for further analysis. The length of this list is determined by practical considerations, such as the capacity for experimental validation for example, rather than statistical significance.

#### Benchmarking

There are many approaches to the analysis of differential gene expression; in GEO2Enrichr we employ the characteristic direction as this method clearly out-performed a selection of the most commonly used approaches in a number of our benchmarking tests on GEO data. In the first such test we analyzed a large number (73) of controlled experiments in which an individual transcription factor (TF) was perturbed by knockout, knockdown or overexpression for example. The resulting differential gene expression was analyzed with five different methods; the characteristic direction, Welch’s t test, Significance Analysis of Microarays (SAM), fold-change, and Limma. The scores for each of these tests was used to prioritize the genes. We then used ChIP-Seq data for each of the perturbed TFs to identify candidate lists of differentially expressed genes based on the proximity of the binding sites to the promoter region. Though these genes are not gold standard, they are taken to be more likely to be differentially expressed than the complementary genes. This benchmarking method has the very strong advantage of using real data, as opposed to synthetic data, thereby testing the performance of the analysis method on data with all the rich and poorly-understood structure for real biological data.

Fig. 5 Results of the benchmarking test of Transcription Factor perturbations in GEO. (a) The distribution of the perturbed TFs (b) The distribution of genes interacting with the TF in the Human PPI (c-d) The distribution of ChIP-Seq candidate genes for two libraries, ChEA and Encode. The final two figures show equivalent plots for drug perturbations with (e) showing the distribution of known drug targets and (f) showing the distribution of the Hunan PPI integrators of those targets.

The degree to which each analysis method prioritized the ChIP-Seq candidate genes is taken as a measure of the performance of the method. The cumulative distribution of the scaled rank, r, of the candidate genes D(r), in the cases that the candidate genes are uniformly randomly distributed (corresponding to an uninformed guess at the priorities of each gene) takes on the linear form such that D(r) − r ≈ 0. However, if an analysis method prioritizes the candidate genes then then they will tend to have a relatively small scaled rank such that D(r) − r ≫ 0, where the natural scale for the difference from zero, φ, can be calculated based on a random walk with a step size equal to the number of candidate genes. The degree to which D(r) − r deviates from zero is taken as a quantitative measure of the performance of the analysis method.

In Fig 5 we show the results of the GEO TF benchmarking test for each for the five analysis methods. Fig 5(a) shows the distribution of the prioritization of the genes corresponding to the perturbed TF; we see that all the methods prioritize these genes. Fig 5(b) shows the distribution of the genes corresponding to the proteins interacting with the perturbed TF in the Human Protein Protein Interaction (PPI) network; here we see that the characteristic direction prioritizes the PPI interactors significantly more than the other methods. Fig 5(c-d) show the distribution for the ChIP-Seq candidate genes, and here we see the clearest signal, and again the characteristic direction prioritizes the candidate genes significantly higher based on both the ChEA and the Encode ChIP-Seq libraries.

Finally, we performed a similar benchmarking approach using 130 drug perturbation experiments from GEO, with candidate gene sets composed of the known drug targets. We see the same order of performance in this setting.

These results lead us to conclude that the characteristic direction approach to differential expression is the highest performing method of those we tested across a large number of GEO data sets.

#### A short description of the characteristic direction

The Characteristic Direction uses the Linear Discriminant Analysis (LDA) classifier, in which the probability that a microarray sample $x$ derives from each of the classes $G$ is modeled with the Bayes rule: $$Pr(G = k|X = x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^{K} f_l(x)\pi_l}$$ Where $f_k(x)$ is the class-conditional density, and $\pi_k$ is the prior probability of class $k$. By making the assumption that the class-conditional density is a multivariate Gaussian: $$f_k(x) = \frac{1}{(2\pi)^{\frac{p}{2}}|\sum_{k}|\frac{1}{2}} e^{-\frac{1}{2}(x-u_k)^{T}\sum_{k}^{-1}(x-u_k)}$$ Making the additional assumption that the covariance matrix for each class is the same $\sum_{k}=\sum \forall k$, the class boundary is a hyperplane with normal vector, $$b = \sum_{}^{-1}(u_k-u_l)$$

This vector defines the orientation of the classification boundary hyperplane and the characteristic direction approach uses this vector to characterize the differential expression between the two classes.

Additionally, this can be regularized by shrinking the covariance matrix to the scalar variance $\sigma^2$ $$\hat{\sum}(\gamma) = \gamma\hat{\sum}+(1+\gamma)\sigma^2I_{p'}$$ with $\gamma \in [0,1]$ and where $\gamma$ is the regularization parameter.

$$b = \sum_{}^{-1}(u_k-u_l)$$