ARCHS4 Help


Getting Started with ARCHS4

Data visualization

Gene pages

Data handling

Search tools

Terms of use


If you would like to receive updates on the ARCHS4 data and stay informed about new data releases consider signing up for the newsletter.


First Steps

About ARCHS4

ARCHS4 provides access to gene counts from HiSeq 2000 and HiSeq 2500 platforms for human and mouse experiments. The website allows download of the data in H5 format for programmatic access as well as a 3-dimensional view of the sample space. Search features allow browsing of the data by meta data annotation, signature similarity and functional enrichment. Selected sample sets can be downloaded into a tab separated through auto-generated R scripts for further analysis. Reads are aligned with Kallisto using a custom cloud computing platform. Human samples are aligned against GRCh38 cdna reference and mouse samples against GRCm38 cdna. All processable files from the GEO/SRA database since June 2018 are processed and available for download.


Download files

Raw read counts can be downloaded by browsing to the top of the ARCHS4 website and selecting the Download option. The raw counts are separated by human and mouse samples and are provided in H5 format. H5 provides efficient compression of the gene counts data. Tab separated files can be extracted with the provided code and auto-generated scripts provided by the web interface.

Selected sample and gene sets are displayed in the Search Results section. Under downloads scripts can be downloaded for the given gene sets. In case of gene sets a text file containing the gene symbols is provided.

Gene sets can be exported to Enrichr for further analysis.

Chrome Extension

The ARCHS4 chrome extension is a browser extension that adds content to the landing pages of RNA-seq datasets available on the Gene Expression Omnibus (GEO) when samples have been processed by ARCHS4. The extension adds links to download files that contain the aligned reads mapped to genes with counts, as well as a heatmap visualization summary of the expression data from the processed samples using Clustergrammer. The ARCHS4 Chrome extension installed from the Chrome web store. An example of an enhanced GEO landing page for series GSE77243 is shown.


Gene landing pages

Gene landing pages are accessible through the search fields on the top right of the ARCHS4 interface. Genes can be searched by Entrez gene symbol. The gene search returns the expression distribution for major tissues and cell lines and predictions of biological function and regulatory properties of the target gene. If a gene is previously known to be part of a predicted gene set the terms are marked in green. If a gene has sufficient number of prior knowledge annotations a ROC curve shows how well the prior knowledge about the gene can be recovered.


Data visualization

WebGL data viewer

The data view port displays samples or genes relative to the other samples and genes in the dataset. Samples/Genes with similar expression are clustered. The layout is computed with t-SNE.

The visualization is dynamic and allows rotation and zooming. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.

The colors of samples and gene sets can be modified in the result section.


Viewer options

The data view can display 4 precalculated data visualizations: human samples, human genes, mouse samples and mouse genes. To toggle between species, select either the human or mouse button on the left. To change from sample view to gene view select one of the yellow buttons. A new view will automatically be loaded on selection. Prior selections are saved and will be loaded again if the former mode is selected.

The colors of sample and gene sets can be modified in the result section.

Viewer user interaction

The visualization is dynamic an allows rotation and zoom. The two icons on the top right are the manual selection tool and a toggle button that moves the view port into a smaller window to the left of the webpage.


Data handling

About the H5 file format

Hierarchical Data Format (HDF) is an open source file format for large data storage. It allows programmatic accessibility of matrix entries based on column and row indices while allowing for efficient data compression. The H5 files provided by ARCHS4 contain raw read counts as well as detailed meta data information extracted from GEO.



group name otype dclass dim
0 / data H5I_GROUP
1 /data expression H5I_DATASET INTEGER 35238 x 65429
2 / info H5I_GROUP
3 /info author H5I_DATASET STRING 1
4 /info contact H5I_DATASET STRING 1
5 /info creation-date H5I_DATASET STRING 1
6 /info lab H5I_DATASET STRING 1
7 /info version H5I_DATASET STRING 1
8 / meta H5I_GROUP
9 /meta Sample_channel_count H5I_DATASET STRING 65429
10 /meta Sample_characteristics_ch1 H5I_DATASET STRING 65429
11 /meta Sample_contact_address H5I_DATASET STRING 65429
12 /meta Sample_contact_city H5I_DATASET STRING 65429
13 /meta Sample_contact_country H5I_DATASET STRING 65429
14 /meta Sample_contact_department H5I_DATASET STRING 65429
15 /meta Sample_contact_email H5I_DATASET STRING 65429
16 /meta Sample_contact_institute H5I_DATASET STRING 65429
17 /meta Sample_contact_laboratory H5I_DATASET STRING 65429
18 /meta Sample_contact_name H5I_DATASET STRING 65429
19 /meta Sample_contact_phone H5I_DATASET STRING 65429
20 /meta Sample_contact_zip-postal_code H5I_DATASET STRING 65429
21 /meta Sample_data_processing H5I_DATASET STRING 65429
22 /meta Sample_data_row_count H5I_DATASET STRING 65429
23 /meta Sample_description H5I_DATASET STRING 65429
24 /meta Sample_extract_protocol_ch1 H5I_DATASET STRING 65429
25 /meta Sample_geo_accession H5I_DATASET STRING 65429
26 /meta Sample_instrument_model H5I_DATASET STRING 65429
27 /meta Sample_last_update_date H5I_DATASET STRING 65429
28 /meta Sample_library_selection H5I_DATASET STRING 65429
29 /meta Sample_library_source H5I_DATASET STRING 65429
30 /meta Sample_library_strategy H5I_DATASET STRING 65429
31 /meta Sample_molecule_ch1 H5I_DATASET STRING 65429
32 /meta Sample_organism_ch1 H5I_DATASET STRING 65429
33 /meta Sample_platform_id H5I_DATASET STRING 65429
34 /meta Sample_relation H5I_DATASET STRING 65429
35 /meta Sample_series_id H5I_DATASET STRING 65429
36 /meta Sample_source_name_ch1 H5I_DATASET STRING 65429
37 /meta Sample_status H5I_DATASET STRING 65429
38 /meta Sample_submission_date H5I_DATASET STRING 65429
39 /meta Sample_supplementary_file_1 H5I_DATASET STRING 65429
40 /meta Sample_supplementary_file_2 H5I_DATASET STRING 65429
41 /meta Sample_taxid_ch1 H5I_DATASET STRING 65429
42 /meta Sample_title H5I_DATASET STRING 65429
43 /meta Sample_type H5I_DATASET STRING 65429
44 /meta genes H5I_DATASET STRING 35238


Using the auto-generated download script

Scripts to extract tab separated gene expression files can be created through the graphical user interface of ARCHS4. The script has to be executed as a R-script. A free version of R can be downloaded at www.rstudio.com. Upon execution the script should install all required dependencies and download the full gene expression file.


# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
packages <- c("rhdf5")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    print("Install required packages")
    source("https://bioconductor.org/biocLite.R")
    biocLite("rhdf5")
}
library("rhdf5")

destination_file = "human_matrix.h5"
extracted_expression_file = "example_expression_matrix.tsv"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix.h5"
    download.file(url, destination_file, quiet = FALSE)
} else{
    print("Local file already exists.")
}

# Selected samples to be extracted
samp = c("GSM1224927","GSM1066120","GSM1224923","GSM1224929","GSM1224924","GSM1066118","GSM1066119","GSM1224925","GSM1224930","GSM1872071","GSM2282084","GSM1872064","GSM1872067","GSM1704845")

# Retrieve information from compressed data
samples = h5read(destination_file, "meta/Sample_geo_accession")
# Identify columns to be extracted
sample_locations = which(samples %in% samp)

tissue = h5read(destination_file, "meta/Sample_source_name_ch1")
genes = h5read(destination_file, "meta/genes")
series = h5read(destination_file, "meta/Sample_series_id")

# extract gene expression from compressed data
expression = h5read(destination_file, "data/expression", index=list(1:length(genes), sample_locations))
H5close()
rownames(expression) = genes
colnames(expression) = samples[sample_locations]

# Print file
write.table(expression, file=extracted_expression_file, sep="\t", quote=FALSE)
print(paste0("Expression file was created at ", getwd(), "/", extracted_expression_file))


Parsing H5 file

Scripts to extract tab separated gene expression files can be created through the graphical user interface of ARCHS4. The script has to be executed as an R-script. A free version of R can be downloaded from: www.rstudio.com. Upon execution the script should install all required dependencies, and then download the full gene expression file before extracting the selected samples.


# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
packages <- c("rhdf5", "preprocessCore")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    print("Install required packages")
    source("https://bioconductor.org/biocLite.R")
    biocLite("rhdf5")
    biocLite("preprocessCore")
}
library("rhdf5")
library("preprocessCore")

destination_file = "human_matrix.h5"
extracted_expression_file = "example_expression_matrix.tsv"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix.h5"
    download.file(url, destination_file, quiet = FALSE)
} else{
    print("Local file already exists.")
}

# Selected samples to be extracted
samp = c("GSM1224927","GSM1066120","GSM1224923","GSM1224929","GSM1224924","GSM1066118","GSM1066119","GSM1224925","GSM1224930","GSM1872071","GSM2282084","GSM1872064","GSM1872067","GSM1704845")

# Retrieve information from compressed data
samples = h5read(destination_file, "meta/Sample_geo_accession")

# Identify columns to be extracted
sample_locations = which(samples %in% samp)

tissue = h5read(destination_file, "meta/Sample_source_name_ch1")
genes = h5read(destination_file, "meta/genes")
series = h5read(destination_file, "meta/Sample_series_id")

# extract gene expression from compressed data
expression = h5read(destination_file, "data/expression", index=list(1:length(genes), sample_locations))
H5close()

# normalize samples and correct for differences in gene count distribution
expression = log2(expression+1)
expression = normalize.quantiles(expression)

rownames(expression) = genes
colnames(expression) = samples[sample_locations]


Batch effect correction

Extracted samples from a specified tissue can originate from multiple series with slightly different experimental conditions. If desired batch effects from gene expression can be removed with the Combat library.



# R script to download selected samples
# Copy code and run on a local machine to initiate download
# Check for dependencies and install if missing
packages <- c("rhdf5", "preprocessCore", "sva")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
    print("Install required packages")
    source("https://bioconductor.org/biocLite.R")
    biocLite("rhdf5")
    biocLite("preprocessCore")
    biocLite("sva")
}
library("rhdf5")
library("preprocessCore")
library("sva")

destination_file = "human_matrix.h5"
extracted_expression_file = "example_expression_matrix.tsv"

# Check if gene expression file was already downloaded, if not in current directory download file form repository
if(!file.exists(destination_file)){
    print("Downloading compressed gene expression matrix.")
    url = "https://s3.amazonaws.com/mssm-seq-matrix/human_matrix.h5"
    download.file(url, destination_file, quiet = FALSE)
} else{
    print("Local file already exists.")
}

# Selected samples to be extracted
samp = c("GSM1224927","GSM1066120","GSM1224923","GSM1224929","GSM1224924","GSM1066118","GSM1066119","GSM1224925","GSM1224930","GSM1872071","GSM2282084","GSM1872064","GSM1872067","GSM1704845")

# Retrieve information from compressed data
samples = h5read(destination_file, "meta/Sample_geo_accession")
# Identify columns to be extracted
sample_locations = which(samples %in% samp)

tissue = h5read(destination_file, "meta/Sample_source_name_ch1")
genes = h5read(destination_file, "meta/genes")
series = h5read(destination_file, "meta/Sample_series_id")
series = series[sample_locations]

# extract gene expression from compressed data
expression = h5read(destination_file, "data/expression", index=list(1:length(genes), sample_locations))
H5close()

# normalize samples and correct for differences in gene count distribution
expression = log2(expression+1)
expression = normalize.quantiles(expression)

rownames(expression) = genes
colnames(expression) = samples[sample_locations]

# correct batch effects in gene expression
batchid = match(series, unique(series))
correctedExpression <- ComBat(dat=expression, batch=batchid, par.prior=TRUE, prior.plots=FALSE)


Search tools

Meta data search

Metadata search parses the tissue description field from GEO to find matches with the entered search term. The search ignores spaces and is case insensitive. Results are highliged in the data viewer and a result is added to the result list.

We preselected a series of cellular tissues based by cellular system. This allows simple browsing of the data for tissues of interest. Some tissue selections can return empty for either mouse or human samples.


Signature search

Signature search uses a list of high expressed genes and low expressed genes and identifies samples that match the given input. The gene expression is z-score normalized across samples to identify the relative gene expression.



Enrichment search

Enrichment search highlights samples that are enriched in gene sets from 8 gene set libraries (CHEA 2016, KEA 2016, Encode TF ChIP-seq 2016, KEGG 2016, MGI mammalian phenotype, Human phenotype, GO Biological Process, GO cellular component, GO molecular function)

Gene search

Open gene landing page when searching by gene symbol. A set of genes can be highlighted by selecting a gene set library and a corresponding gene set.


Gene pages

Functional prediction

The top 100 predictions of gene set membership accross multiple domains is shown in the tables on the gene page. Gene set membership is predicted by membership by association. If a gene shares high correlation with known members of a gene set it will get a high z-score during the membership perediction. If a gene already has known functions/gene set memberships they are highlighted in green. If a gene is extensively annotated a ROC curve shows how well known annotations could be recovered by the algorithm. If there is no image the gene has not enough prior gene memberships to build a reliable ROC curve.

Gene correlation

The gene correlation table contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues. The gene list can be uploaded to Enrichr for further investigation.

Tissue expression atlas

The tissue and cell line expression atlas are calculated from samples in ARCHS4. The tissues are grouped in multiple levels and cover a wide range of different cellular contexts. Since samples of any given tissue can come from many distinct laboratoeries condition upon sample creation are not identical and various subtypes of tissues can be mixed. This in comparison to GTEx can report the observed variability in non homogenious sample groups.


Terms of use

Use

Source code is available under the Apache Licence 2.0. Provided gene expression files available under the Creative Commons Attribution 4.0 International LicenseCreative Commons License.
All data is free to use for non-commercial purposes. For commercial use please contact MSIP.

Citation

Please acknowledge ARCHS4 in your publications by citing the following reference:
Lachmann A, Torre D, Keenan AB, Jagodnik KM, Lee HJ, Wang L, Silverstein MC, Ma’ayan A. Massive mining of publicly available RNA-seq data from human and mouse. Nature Communications 9. Article number: 1366 (2018), doi:10.1038/s41467-018-03751-6

Disclaimer

ARCHS4 is not to be used for treating or diagnosing human subjects. ARCHS4 or any documents available from this server are provided as is without any warranty of any kind, either express, implied, or statutory, including, but not limited to, any implied warranties of merchantability, fitness for particular purpose and freedom from infringement, or that ARCHS4 or any documents available from this server will be error free. The Ma'ayan lab makes no representations that the use of ARCHS4 or any documents available from this server will not infringe any patent or proprietary rights of third parties. In no event will the Ma'ayan lab or any of its members be liable for any damages, including but not limited to direct, indirect, special or consequential damages, arising out of, resulting from, or in any way connected with the use of ARCHS4 or documents available from this server.




© Ma'ayan Lab.