Content from Introduction to Omics Translation in AD
Last updated on 2024-10-25 | Edit this page
Introduction
This workshop will deliver hands-on training using the most recent computational techniques and omics data resources to improve awareness and utility of integrating multi-scale data from model systems and humans to uncover the mechanistic drivers of Alzheimer’s Disease (AD).
Despite considerable effort in drug development for AD, high rate of failure in clinical trials emphasize the need for more effective preclinical strategies, including targeted use of mouse models to study heterogenous disease phenotypes. In this era of omics-driven research, interspecies alignment of disease relevant molecular signatures relies heavily on bioinformatics tools and field-specific data infrastructures.
The goal of this workshop is to accelerate translational research in AD by training interested scientists on how to develop robust computational workflows for both hypothesis-driven and data-driven research strategies. The AD-specific, bespoke lessons will use resources from the AD Knowledge Portal, a NIA designated FAIR data repository that shares data from human and non-human studies generated by multiple collaborative research programs focused on aging, dementia, and AD.
Lesson material
Here are the R scripts and data files used in this workshop.
Content from Synapse and AD Knowledge Portal
Last updated on 2024-10-25 | Edit this page
Overview
Questions
- How to work with Synapse R client?
- How to work with data in AD Knowledge Portal?
Objectives
- Explain how to use Synapser Package.
- Demonstrate how to locate data and metadata in the Portal.
- Demonstrate how to download data from the Portal programmatically.
Working with AD Portal metadata
Metadata basics
We have now downloaded several metadata files and an RNAseq counts file from the portal. For our next exercises, we want to read those files in as R data so we can work with them.
We can see from the download_table
we got during the
bulk download step that we have five metadata files. Two of these should
be the individual and biospecimen files, and three of them are assay
metadata files.
R
download_table %>%
`dplyr`::select(name, metadataType, assay)
We are only interested in RNAseq data, so we will only read in the individual, biospecimen, and RNAseq assay metadata files.
R
# counts matrix
counts <- read_tsv("data/htseqcounts_5XFAD.txt",
show_col_types = FALSE)
# individual metadata
ind_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_individual_metadata.csv",
show_col_types = FALSE)
# biospecimen metadata
bio_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_biospecimen_metadata.csv",
show_col_types = FALSE)
#assay metadata
rna_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_assay_RNAseq_metadata.csv",
show_col_types = FALSE)
Let’s examine the data and metadata files a bit before we begin our analyses.
Counts data
R
# Calling a tibble object will print the first ten rows in a nice tidy output;
# doing the same for a base R dataframe will print the whole thing until it runs
# out of memory. If you want to inspect a large dataframe, use `head(df)`
counts
OUTPUT
# A tibble: 55,489 × 73
gene_id `32043rh` `32044rh` `32046rh` `32047rh` `32048rh` `32049rh` `32050rh`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00… 22554 0 0 0 16700 0 0
2 ENSG00… 344489 4 0 1 260935 6 8
3 ENSMUS… 5061 3483 3941 3088 2756 3067 2711
4 ENSMUS… 0 0 0 0 0 0 0
5 ENSMUS… 208 162 138 127 95 154 165
6 ENSMUS… 44 17 14 28 23 24 14
7 ENSMUS… 143 88 121 117 115 109 75
8 ENSMUS… 22 6 10 11 11 19 24
9 ENSMUS… 7165 5013 5581 4011 4104 5254 4345
10 ENSMUS… 3728 2316 2238 1965 1822 1999 1809
# ℹ 55,479 more rows
# ℹ 65 more variables: `32052rh` <dbl>, `32053rh` <dbl>, `32057rh` <dbl>,
# `32059rh` <dbl>, `32061rh` <dbl>, `32062rh` <dbl>, `32065rh` <dbl>,
# `32067rh` <dbl>, `32068rh` <dbl>, `32070rh` <dbl>, `32073rh` <dbl>,
# `32074rh` <dbl>, `32075rh` <dbl>, `32078rh` <dbl>, `32081rh` <dbl>,
# `32088rh` <dbl>, `32640rh` <dbl>, `46105rh` <dbl>, `46106rh` <dbl>,
# `46107rh` <dbl>, `46108rh` <dbl>, `46109rh` <dbl>, `46110rh` <dbl>, …
The data file has a column of ENSEMBL gene ids and then a bunch of
columns with count data, where the column headers correspond to the
specimenID
s. These specimenID
s should all be
in the RNAseq assay metadata file, so let’s check.
R
# what does the RNAseq assay metadata look like?
rna_meta
OUTPUT
# A tibble: 72 × 12
specimenID platform RIN rnaBatch libraryBatch sequencingBatch libraryPrep
<chr> <chr> <lgl> <dbl> <dbl> <dbl> <chr>
1 32043rh IlluminaN… NA 1 1 1 polyAselec…
2 32044rh IlluminaN… NA 1 1 1 polyAselec…
3 32046rh IlluminaN… NA 1 1 1 polyAselec…
4 32047rh IlluminaN… NA 1 1 1 polyAselec…
5 32049rh IlluminaN… NA 1 1 1 polyAselec…
6 32057rh IlluminaN… NA 1 1 1 polyAselec…
7 32061rh IlluminaN… NA 1 1 1 polyAselec…
8 32065rh IlluminaN… NA 1 1 1 polyAselec…
9 32067rh IlluminaN… NA 1 1 1 polyAselec…
10 32070rh IlluminaN… NA 1 1 1 polyAselec…
# ℹ 62 more rows
# ℹ 5 more variables: libraryPreparationMethod <lgl>, isStranded <lgl>,
# readStrandOrigin <lgl>, runType <chr>, readLength <dbl>
R
# are all the column headers from the counts matrix
# (except the first "gene_id" column) in the assay metadata?
all(colnames(counts[-1]) %in% rna_meta$`specimenID`)
OUTPUT
[1] TRUE
Assay metadata
The assay metadata contains information about how data was generated
on each sample in the assay. Each specimenID
represents a
unique sample. We can use some tools from dplyr
to explore
the metadata.
R
# how many unique specimens were sequenced?
n_distinct(rna_meta$`specimenID`)
OUTPUT
[1] 72
R
# were the samples all sequenced on the same platform?
distinct(rna_meta, platform)
OUTPUT
# A tibble: 1 × 1
platform
<chr>
1 IlluminaNovaseq6000
R
# were there multiple sequencing batches reported?
distinct(rna_meta, sequencingBatch)
OUTPUT
# A tibble: 1 × 1
sequencingBatch
<dbl>
1 1
Biospecimen metadata
The biospecimen metadata contains specimen-level information,
including organ and tissue the specimen was taken from, how it was
prepared, etc. Each specimenID
is mapped to an
individualID
.
R
# all specimens from the RNAseq assay metadata file should be in the biospecimen file
all(rna_meta$`specimenID` %in% bio_meta$`specimenID`)
OUTPUT
[1] TRUE
R
# but the biospecimen file also contains specimens from different assays
all(bio_meta$`specimenID` %in% rna_meta$`specimenID`)
OUTPUT
[1] FALSE
Individual metadata
The individual metadata contains information about all the
individuals in the study, represented by unique
individualID
s. For humans, this includes information on
age, sex, race, diagnosis, etc. For MODEL-AD mouse models, the
individual metadata has information on model genotypes, stock numbers,
diet, and more.
R
# all `individualID`s in the biospecimen file should be in the individual file
all(bio_meta$`individualID` %in% ind_meta$`individualID`)
OUTPUT
[1] TRUE
R
# which model genotypes are in this study?
distinct(ind_meta, genotype)
OUTPUT
# A tibble: 2 × 1
genotype
<chr>
1 5XFAD_carrier
2 5XFAD_noncarrier
Joining metadata
We use the three-file structure for our metadata because it allows us
to store metadata for each study in a tidy format. Every line in the
assay and biospecimen files represents a unique specimen, and every line
in the individual file represents a unique individual. This means the
files can be easily joined by specimenID
and
individualID
to get all levels of metadata that apply to a
particular data file. We will use the left_join()
function
from the dplyr
package, and the %\>%
operator from the magrittr
package. If you are unfamiliar
with the pipe, think of it as a shorthand for “take this (the preceding
object) and do that (the subsequent command)”. See magrittr for more info on
piping in R.
R
# join all the rows in the assay metadata
# that have a match in the biospecimen metadata
joined_meta <- rna_meta %>% # start with the rnaseq assay metadata
left_join(bio_meta, by = "specimenID") %>% # join rows from biospecimen
# that match specimenID
left_join(ind_meta, by = "individualID") # join rows from individual
# that match individualID
joined_meta
OUTPUT
# A tibble: 72 × 53
specimenID platform RIN rnaBatch libraryBatch sequencingBatch libraryPrep
<chr> <chr> <lgl> <dbl> <dbl> <dbl> <chr>
1 32043rh IlluminaN… NA 1 1 1 polyAselec…
2 32044rh IlluminaN… NA 1 1 1 polyAselec…
3 32046rh IlluminaN… NA 1 1 1 polyAselec…
4 32047rh IlluminaN… NA 1 1 1 polyAselec…
5 32049rh IlluminaN… NA 1 1 1 polyAselec…
6 32057rh IlluminaN… NA 1 1 1 polyAselec…
7 32061rh IlluminaN… NA 1 1 1 polyAselec…
8 32065rh IlluminaN… NA 1 1 1 polyAselec…
9 32067rh IlluminaN… NA 1 1 1 polyAselec…
10 32070rh IlluminaN… NA 1 1 1 polyAselec…
# ℹ 62 more rows
# ℹ 46 more variables: libraryPreparationMethod <lgl>, isStranded <lgl>,
# readStrandOrigin <lgl>, runType <chr>, readLength <dbl>,
# individualID <dbl>, specimenIdSource <chr>, organ <chr>, tissue <chr>,
# BrodmannArea <lgl>, sampleStatus <chr>, tissueWeight <lgl>,
# tissueVolume <lgl>, nucleicAcidSource <lgl>, cellType <lgl>,
# fastingState <lgl>, isPostMortem <lgl>, samplingAge <lgl>, …
We now have a very wide dataframe that contains all the available metadata on each specimen in the RNAseq data from this study. This procedure can be used to join the three types of metadata files for every study in the AD Knowledge Portal, allowing you to filter individuals and specimens as needed based on your analysis criteria!
R
library(lubridate)
# convert columns of strings to month-date-year format
joined_meta_time <- joined_meta %>%
mutate(dateBirth = mdy(dateBirth), dateDeath = mdy(dateDeath)) %>%
# create a new column that subtracts dateBirth from dateDeath in days,
# then divide by 30 to get months
mutate(timepoint = as.numeric(difftime(dateDeath, dateBirth, units ="days"))/30) %>%
# convert numeric ages to timepoint categories
mutate(timepoint = case_when(timepoint > 10 ~ "12 mo",
timepoint < 10 & timepoint > 5 ~ "6 mo",
timepoint < 5 ~ "4 mo"))
covars_5XFAD <- joined_meta_time %>%
dplyr::select(individualID, specimenID, sex, genotype, timepoint) %>%
distinct() %>%
as.data.frame()
rownames(covars_5XFAD) <- covars_5XFAD$specimenID
head(covars_5XFAD)
OUTPUT
individualID specimenID sex genotype timepoint
32043rh 32043 32043rh female 5XFAD_carrier 12 mo
32044rh 32044 32044rh male 5XFAD_noncarrier 12 mo
32046rh 32046 32046rh male 5XFAD_noncarrier 12 mo
32047rh 32047 32047rh male 5XFAD_noncarrier 12 mo
32049rh 32049 32049rh female 5XFAD_noncarrier 12 mo
32057rh 32057 32057rh female 5XFAD_noncarrier 12 mo
We will save joined_meta
for the next lesson.
R
saveRDS(covars_5XFAD, file = "data/covars_5XFAD.rds")
Single Specimen files
For files that contain data from a single specimen (e.g. raw sequencing files, raw mass spectra, etc.), we can use the Synapse annotations to associate these files with the appropriate metadata.
Exercise
Use Explore Data to find all RNAseq files from the
Jax.IU.Pitt_5XFAD
study. If we filter for data where
Study = "Jax.IU.Pitt_5XFAD"
and
Assay = "rnaSeq"
we will get a list of 148 files, including
raw fastqs and processed counts data.
We can use the function synGetAnnotations
to view the
annotations associated with any file without downloading the file.
R
# the synID of a random fastq file from this list
random_fastq <- "syn22108503"
# extract the annotations as a nested list
fastq_annotations <- synGetAnnotations(random_fastq)
fastq_annotations
The file annotations let us see which study the file is associated
with (Jax.IU.Pitt.5XFAD
), which species it is from (mouse),
which assay generated the file (rnaSeq), and a whole bunch of other
properties. Most importantly, single-specimen files are annotated with
the specimenID
of the specimen in the file, and the
individualID
of the individual that specimen was taken
from. We can use these annotations to link files to the rest of the
metadata, including metadata that is not in annotations. This is
especially helpful for human studies, as potentially identifying
information like age, race, and diagnosis is not included in file
annotations.
R
# find records belonging to the individual this file maps
# to in our joined metadata
joined_meta %>%
filter(`individualID` == fastq_annotations$individualID[[1]])
Annotations during bulk download
When bulk downloading many files, the best practice is to preserve
the download manifest that is generated which lists all the files, their
synIDs
, and all their annotations. If using the Synapse R
client, follow the instructions in the Bulk download files section
above.
If we use the “Programmatic Options” tab in the AD Portal download menu to download all 148 rnaSeq files from the 5XFAD study, we would get a table query that looks like this:
R
query <- synTableQuery("SELECT * FROM syn11346063.37
WHERE ( ( \"study\" HAS ( 'Jax.IU.Pitt_5XFAD' ) )
AND ( \"assay\" HAS ( 'rnaSeq' ) ) )")
As we saw previously, this downloads a csv file with the results of our AD Portal query. Opening that file lets us see which specimens are associated with which files:
R
annotations_table <- read_csv(query$filepath, show_col_types = FALSE)
annotations_table
You could then use
purrr::walk(download_table$id, ~synGet(.x, downloadLocation = ))
to walk through the column of synIDs
and download all 148
files. However, because these are large files, it might be preferable to
use the Python client or command line client for increased speed.
Once you’ve downloaded all the files in the id
column,
you can link those files to their annotations by the name
column.
R
# We'll use the "random fastq" that we got annotations for earlier
# To avoid downloading the whole 3GB file, we'll use synGet with
# "downloadFile = FALSE" to get only the Synapse entity information, rather than
# the file. If we downloaded the actual file, we could find it in the directory
# and search using the filename. Since we're just downloading the Synapse entity
# wrapper object, we'll use the file name listed in the object properties.
fastq <- synGet(random_fastq, downloadFile = FALSE)
# filter the annotations table to rows that match the fastq filename
annotations_table %>%
filter(name == fastq$properties$name)
Multispecimen files
Multispecimen files in the AD Knowledge Portal are files that contain
data or information from more than one specimen. They are not annotated
with individualID
s or specimenID
s, since these
files may contain numbers of specimens that exceed the annotation
limits. These files are usually processed or summary data (gene counts,
peptide quantifications, etc), and are always annotated with
isMultiSpecimen = TRUE
.
If we look at the processed data files in the table of 5XFAD RNAseq
file annotations we just downloaded, we will see that it
isMultiSpecimen = TRUE
, but individualID
and
specimenID
are blank:
R
annotations_table %>%
filter(fileFormat == "txt") %>%
dplyr::select(name, individualID, specimenID, isMultiSpecimen)
The multispecimen file should contain a row or column of
specimenID
s that correspond to the specimenID
s
used in a study’s metadata, as we have seen with the 5XFAD counts
file.
R
# In this example, we take a slice of the counts data to reduce computation,
# transpose it so that each row represents a single specimen, and then join it
# to the joined metadata by the `specimenID`
counts %>%
slice_head(n = 5) %>%
t() %>%
as_tibble(rownames = "specimenID") %>%
left_join(joined_meta, by = "specimenID")
OUTPUT
# A tibble: 73 × 58
specimenID V1 V2 V3 V4 V5 platform RIN rnaBatch libraryBatch
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl>
1 gene_id "ENS… "ENS… "ENS… "ENS… "ENS… <NA> NA NA NA
2 32043rh " 22… "344… " 5… " … " … Illumin… NA 1 1
3 32044rh " … " … "348… " … " 16… Illumin… NA 1 1
4 32046rh " … " … "394… " … " 13… Illumin… NA 1 1
5 32047rh " … " … "308… " … " 12… Illumin… NA 1 1
6 32048rh " 16… "260… " 2… " … " … Illumin… NA 1 1
7 32049rh " … " … "306… " … " 15… Illumin… NA 1 1
8 32050rh " … " … "271… " … " 16… Illumin… NA 1 1
9 32052rh " 19… "337… " 3… " … " … Illumin… NA 1 1
10 32053rh " 14… "206… " 3… " … " … Illumin… NA 1 1
# ℹ 63 more rows
# ℹ 48 more variables: sequencingBatch <dbl>, libraryPrep <chr>,
# libraryPreparationMethod <lgl>, isStranded <lgl>, readStrandOrigin <lgl>,
# runType <chr>, readLength <dbl>, individualID <dbl>,
# specimenIdSource <chr>, organ <chr>, tissue <chr>, BrodmannArea <lgl>,
# sampleStatus <chr>, tissueWeight <lgl>, tissueVolume <lgl>,
# nucleicAcidSource <lgl>, cellType <lgl>, fastingState <lgl>, …
Session Info
R
sessionInfo()
OUTPUT
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 purrr_1.0.2
[5] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1
[9] tidyverse_2.0.0 dplyr_1.1.4
loaded via a namespace (and not attached):
[1] bit_4.5.0 gtable_0.3.6 compiler_4.4.1 renv_1.0.11
[5] crayon_1.5.3 tidyselect_1.2.1 parallel_4.4.1 scales_1.3.0
[9] yaml_2.3.10 R6_2.5.1 generics_0.1.3 knitr_1.48
[13] munsell_0.5.1 pillar_1.9.0 tzdb_0.4.0 rlang_1.1.4
[17] utf8_1.2.4 stringi_1.8.4 xfun_0.48 bit64_4.5.2
[21] timechange_0.3.0 cli_3.6.3 withr_3.0.1 magrittr_2.0.3
[25] grid_4.4.1 vroom_1.6.5 hms_1.1.3 lifecycle_1.0.4
[29] vctrs_0.6.5 evaluate_1.0.1 glue_1.8.0 fansi_1.0.6
[33] colorspace_2.1-1 tools_4.4.1 pkgconfig_2.0.3
Key Points
- Use your Synapse login credentials to access the Portal.
- Use Synapser package to download data from the Portal.
Content from Differential expression analysis
Last updated on 2024-10-25 | Edit this page
Overview
Questions
- What transcriptomic changes do we observe in mouse models carrying AD-related mutations?
Objectives
- Read in a count matrix and metadata.
- Understand the data from AD mouse models
- Format the data for differential analysis
- Perform differential analysis using DESeq2.
- Pathway enrichment of differentially expressed genes
- Save data for next lessons
Differential Expression Analysis
Reading Gene Expression Count matrix from Previous Lesson
In this lesson, we will use the raw counts matrix and metadata downloaded in the previous lesson and will perform differential expression analysis.
R
counts <- read.delim("data/htseqcounts_5XFAD.txt", check.names = FALSE)
Reading Sample Metadata from Previous Lesson
R
covars <- readRDS("data/covars_5XFAD.rds")
Let’s explore the data:
Let’s look at the top of the metadata.
R
head(covars)
OUTPUT
individualID specimenID sex genotype timepoint
32043rh 32043 32043rh female 5XFAD_carrier 12 mo
32044rh 32044 32044rh male 5XFAD_noncarrier 12 mo
32046rh 32046 32046rh male 5XFAD_noncarrier 12 mo
32047rh 32047 32047rh male 5XFAD_noncarrier 12 mo
32049rh 32049 32049rh female 5XFAD_noncarrier 12 mo
32057rh 32057 32057rh female 5XFAD_noncarrier 12 mo
identify distinct groups using sample metadata
R
distinct(covars, sex, genotype, timepoint)
OUTPUT
sex genotype timepoint
32043rh female 5XFAD_carrier 12 mo
32044rh male 5XFAD_noncarrier 12 mo
32049rh female 5XFAD_noncarrier 12 mo
46105rh female 5XFAD_noncarrier 6 mo
46108rh male 5XFAD_noncarrier 6 mo
46131rh female 5XFAD_noncarrier 4 mo
46877rh male 5XFAD_noncarrier 4 mo
46887rh female 5XFAD_carrier 4 mo
32053rh male 5XFAD_carrier 12 mo
46111rh female 5XFAD_carrier 6 mo
46865rh male 5XFAD_carrier 6 mo
46866rh male 5XFAD_carrier 4 mo
How many mice were used to produce this data?
R
covars %>% group_by(sex,genotype,timepoint) %>%
dplyr::count()
OUTPUT
# A tibble: 12 × 4
# Groups: sex, genotype, timepoint [12]
sex genotype timepoint n
<chr> <chr> <chr> <int>
1 female 5XFAD_carrier 12 mo 6
2 female 5XFAD_carrier 4 mo 6
3 female 5XFAD_carrier 6 mo 6
4 female 5XFAD_noncarrier 12 mo 6
5 female 5XFAD_noncarrier 4 mo 6
6 female 5XFAD_noncarrier 6 mo 6
7 male 5XFAD_carrier 12 mo 6
8 male 5XFAD_carrier 4 mo 6
9 male 5XFAD_carrier 6 mo 6
10 male 5XFAD_noncarrier 12 mo 6
11 male 5XFAD_noncarrier 4 mo 6
12 male 5XFAD_noncarrier 6 mo 6
How many rows and columns are there in counts?
R
dim(counts)
OUTPUT
[1] 55489 73
In the counts matrix, genes are in rows and samples are in columns. Let’s look at the first few rows.
R
head(counts,n=5)
OUTPUT
gene_id 32043rh 32044rh 32046rh 32047rh 32048rh 32049rh 32050rh
1 ENSG00000080815 22554 0 0 0 16700 0 0
2 ENSG00000142192 344489 4 0 1 260935 6 8
3 ENSMUSG00000000001 5061 3483 3941 3088 2756 3067 2711
4 ENSMUSG00000000003 0 0 0 0 0 0 0
5 ENSMUSG00000000028 208 162 138 127 95 154 165
32052rh 32053rh 32057rh 32059rh 32061rh 32062rh 32065rh 32067rh 32068rh
1 19748 14023 0 17062 0 15986 10 0 18584
2 337456 206851 1 264748 0 252248 172 4 300398
3 3334 3841 4068 3306 4076 3732 3940 4238 3257
4 0 0 0 0 0 0 0 0 0
5 124 103 164 116 108 134 204 239 148
32070rh 32073rh 32074rh 32075rh 32078rh 32081rh 32088rh 32640rh 46105rh
1 1 0 0 22783 17029 16626 15573 12721 4
2 4 2 9 342655 280968 258597 243373 188818 19
3 3351 3449 4654 4844 3132 3334 3639 3355 4191
4 0 0 0 0 0 0 0 0 0
5 159 167 157 211 162 149 160 103 158
46106rh 46107rh 46108rh 46109rh 46110rh 46111rh 46112rh 46113rh 46115rh
1 0 0 0 0 0 17931 0 19087 0
2 0 0 1 5 1 293409 8 273704 1
3 3058 4265 3248 3638 3747 3971 3192 3805 3753
4 0 0 0 0 0 0 0 0 0
5 167 199 113 168 175 203 158 108 110
46121rh 46131rh 46132rh 46134rh 46138rh 46141rh 46142rh 46862rh 46863rh
1 0 0 12703 18833 0 18702 17666 0 14834
2 0 1 187975 285048 0 284499 250600 0 218494
3 4134 3059 3116 3853 3682 2844 3466 3442 3300
4 0 0 0 0 0 0 0 0 0
5 179 137 145 183 171 138 88 154 157
46865rh 46866rh 46867rh 46868rh 46871rh 46872rh 46873rh 46874rh 46875rh
1 10546 10830 10316 10638 15248 0 0 11608 11561
2 169516 152769 151732 190150 229063 6 1 165941 171303
3 3242 3872 3656 3739 3473 3154 5510 3657 4121
4 0 0 0 0 0 0 0 0 0
5 131 152 152 155 140 80 240 148 112
46876rh 46877rh 46878rh 46879rh 46881rh 46882rh 46883rh 46884rh 46885rh
1 0 0 12683 15613 0 14084 20753 0 0
2 0 2 183058 216122 0 199448 306081 0 5
3 3422 3829 3996 4324 2592 2606 4600 2913 3614
4 0 0 0 0 0 0 0 0 0
5 147 166 169 215 115 101 174 127 151
46886rh 46887rh 46888rh 46889rh 46890rh 46891rh 46892rh 46893rh 46895rh
1 16639 16072 0 16680 13367 0 25119 92 0
2 242543 258061 0 235530 196721 0 371037 1116 0
3 3294 3719 3899 4173 4008 3037 5967 3459 4262
4 0 0 0 0 0 0 0 0 0
5 139 128 210 127 156 116 260 161 189
46896rh 46897rh
1 15934 0
2 235343 6
3 3923 3486
4 0 0
5 179 117
As you can see, the gene ids are ENSEMBL IDs. There is some risk that these may not be unique. Let’s check whether any of the gene symbols are duplicated. We will sum the number of duplicated gene symbols.
R
sum(duplicated(rownames(counts)))
OUTPUT
[1] 0
The sum equals zero, so there are no duplicated gene symbols, which is good. Similarly, samples should be unique. Once again, let’s verify this:
R
sum(duplicated(colnames(counts)))
OUTPUT
[1] 0
Formatting the count matrix
Now, as we see that gene_id is in first column of count matrix, but we will need only count data in matrix, so we will change the gene_id column to rownames. Converting the gene_id as rownames of count matrix
R
counts <- counts %>%
column_to_rownames(.,var="gene_id") %>%
as.data.frame()
let’s confirm if change is done correctly
R
head(counts, n=5)
OUTPUT
32043rh 32044rh 32046rh 32047rh 32048rh 32049rh 32050rh
ENSG00000080815 22554 0 0 0 16700 0 0
ENSG00000142192 344489 4 0 1 260935 6 8
ENSMUSG00000000001 5061 3483 3941 3088 2756 3067 2711
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 208 162 138 127 95 154 165
32052rh 32053rh 32057rh 32059rh 32061rh 32062rh 32065rh
ENSG00000080815 19748 14023 0 17062 0 15986 10
ENSG00000142192 337456 206851 1 264748 0 252248 172
ENSMUSG00000000001 3334 3841 4068 3306 4076 3732 3940
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 124 103 164 116 108 134 204
32067rh 32068rh 32070rh 32073rh 32074rh 32075rh 32078rh
ENSG00000080815 0 18584 1 0 0 22783 17029
ENSG00000142192 4 300398 4 2 9 342655 280968
ENSMUSG00000000001 4238 3257 3351 3449 4654 4844 3132
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 239 148 159 167 157 211 162
32081rh 32088rh 32640rh 46105rh 46106rh 46107rh 46108rh
ENSG00000080815 16626 15573 12721 4 0 0 0
ENSG00000142192 258597 243373 188818 19 0 0 1
ENSMUSG00000000001 3334 3639 3355 4191 3058 4265 3248
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 149 160 103 158 167 199 113
46109rh 46110rh 46111rh 46112rh 46113rh 46115rh 46121rh
ENSG00000080815 0 0 17931 0 19087 0 0
ENSG00000142192 5 1 293409 8 273704 1 0
ENSMUSG00000000001 3638 3747 3971 3192 3805 3753 4134
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 168 175 203 158 108 110 179
46131rh 46132rh 46134rh 46138rh 46141rh 46142rh 46862rh
ENSG00000080815 0 12703 18833 0 18702 17666 0
ENSG00000142192 1 187975 285048 0 284499 250600 0
ENSMUSG00000000001 3059 3116 3853 3682 2844 3466 3442
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 137 145 183 171 138 88 154
46863rh 46865rh 46866rh 46867rh 46868rh 46871rh 46872rh
ENSG00000080815 14834 10546 10830 10316 10638 15248 0
ENSG00000142192 218494 169516 152769 151732 190150 229063 6
ENSMUSG00000000001 3300 3242 3872 3656 3739 3473 3154
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 157 131 152 152 155 140 80
46873rh 46874rh 46875rh 46876rh 46877rh 46878rh 46879rh
ENSG00000080815 0 11608 11561 0 0 12683 15613
ENSG00000142192 1 165941 171303 0 2 183058 216122
ENSMUSG00000000001 5510 3657 4121 3422 3829 3996 4324
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 240 148 112 147 166 169 215
46881rh 46882rh 46883rh 46884rh 46885rh 46886rh 46887rh
ENSG00000080815 0 14084 20753 0 0 16639 16072
ENSG00000142192 0 199448 306081 0 5 242543 258061
ENSMUSG00000000001 2592 2606 4600 2913 3614 3294 3719
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 115 101 174 127 151 139 128
46888rh 46889rh 46890rh 46891rh 46892rh 46893rh 46895rh
ENSG00000080815 0 16680 13367 0 25119 92 0
ENSG00000142192 0 235530 196721 0 371037 1116 0
ENSMUSG00000000001 3899 4173 4008 3037 5967 3459 4262
ENSMUSG00000000003 0 0 0 0 0 0 0
ENSMUSG00000000028 210 127 156 116 260 161 189
46896rh 46897rh
ENSG00000080815 15934 0
ENSG00000142192 235343 6
ENSMUSG00000000001 3923 3486
ENSMUSG00000000003 0 0
ENSMUSG00000000028 179 117
As you can see from count table there are some genes that start with “ENSG” and others start with “ENSMUSG”. “ENSG” referes to human gene ENSEMBL id and “ENSMUSG” refer to mouse ENSEMBL id. Let’s check how many gene_ids are NOT from the mouse genome by searching for the string “MUS” (as in Mus musculus) in the rownames of count matrix
R
counts[,1:6] %>%
filter(!str_detect(rownames(.), "MUS"))
OUTPUT
32043rh 32044rh 32046rh 32047rh 32048rh 32049rh
ENSG00000080815 22554 0 0 0 16700 0
ENSG00000142192 344489 4 0 1 260935 6
Ok, so we see there are two human genes in out count matrix. Why? What genes are they?
Briefly, 5xFAD mouse strain harbors two human transgenes APP (“ENSG00000142192”) and PSEN1 (“ENSG00000080815”) and inserted into exon 2 of the mouse Thy1 gene. To validate 5XFAD strain and capture expression of human transgene APP and PS1, a custom mouse genomic sequence was created and we quantified expression of human as well as mouse App (“ENSMUSG00000022892”) and Psen1 (“ENSMUSG00000019969”) genes by our MODEL-AD RNA-Seq pipeline.
Validation of 5xFAD mouse strain
First we convert the dataframe to longer format and join our covariates by MouseID
R
count_tpose <- counts %>%
rownames_to_column(.,var="gene_id") %>%
filter(gene_id %in% c("ENSG00000080815","ENSMUSG00000019969","ENSG00000142192","ENSMUSG00000022892")) %>%
pivot_longer(.,cols = -"gene_id",names_to = "specimenID",values_to="counts") %>% as.data.frame() %>%
left_join(covars, by="specimenID") %>% as.data.frame()
head(count_tpose)
OUTPUT
gene_id specimenID counts individualID sex genotype
1 ENSG00000080815 32043rh 22554 32043 female 5XFAD_carrier
2 ENSG00000080815 32044rh 0 32044 male 5XFAD_noncarrier
3 ENSG00000080815 32046rh 0 32046 male 5XFAD_noncarrier
4 ENSG00000080815 32047rh 0 32047 male 5XFAD_noncarrier
5 ENSG00000080815 32048rh 16700 32048 female 5XFAD_carrier
6 ENSG00000080815 32049rh 0 32049 female 5XFAD_noncarrier
timepoint
1 12 mo
2 12 mo
3 12 mo
4 12 mo
5 12 mo
6 12 mo
Rename the APP and PSEN1 genes to specify whether mouse or human.
R
#make the age column a factor and re-order the levels
count_tpose$timepoint <- factor(count_tpose$timepoint,levels=c("4 mo","6 mo","12 mo"))
# rename the gene id to gene symbol
count_tpose$gene_id[count_tpose$gene_id %in% "ENSG00000142192"] <- "Human APP"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSG00000080815"] <- "Human PSEN1"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSMUSG00000022892"] <- "Mouse App"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSMUSG00000019969"] <- "Mouse Psen1"
Visualize orthologous genes.
R
#Create simple box plots showing normalized counts by genotype and time point, faceted by sex.
count_tpose %>%
ggplot(aes(x=timepoint, y=counts, color=genotype)) +
geom_boxplot() +
geom_point(position=position_jitterdodge()) +
facet_wrap(~sex+gene_id) +theme_bw()
You will notice expression of Human APP is higher in 5XFAD carriers but lower in non-carriers. However mouse App expressed in both 5XFAD carrier and non-carrier.
We are going to sum the counts from both ortholgous genes (human APP and mouse App; human PSEN1 and mouse Psen1) and save summed expression as expression of mouse genes, respectively to match with gene names in control mice.
R
#merge mouse and human APP gene raw count
counts[rownames(counts) %in% "ENSMUSG00000022892",] <- counts[rownames(counts) %in% "ENSMUSG00000022892",] + counts[rownames(counts) %in% "ENSG00000142192",]
counts <- counts[!rownames(counts) %in% c("ENSG00000142192"),]
#merge mouse and human PS1 gene raw count
counts[rownames(counts) %in% "ENSMUSG00000019969",] <- counts[rownames(counts) %in% "ENSMUSG00000019969",] + counts[rownames(counts) %in% "ENSG00000080815",]
counts <- counts[!rownames(counts) %in% c("ENSG00000080815"),]
Let’s verify if expression of both human genes have been merged or not:
R
counts[,1:6] %>%
filter(!str_detect(rownames(.), "MUS"))
OUTPUT
[1] 32043rh 32044rh 32046rh 32047rh 32048rh 32049rh
<0 rows> (or 0-length row.names)
What proportion of genes have zero counts in all samples?
R
gene_sums <- data.frame(gene_id = rownames(counts),
sums = Matrix::rowSums(counts))
sum(gene_sums$sums == 0)
OUTPUT
[1] 9691
We can see that 9691 (17%) genes have no reads at all associated with them. In the next lesson, we will remove genes that have no counts in any samples.
Differential Analysis using DESeq2
Now, after exploring and formatting the data, We will look for differential expression between the control and 5xFAD mice at different ages for both sexes. The differentially expressed genes (DEGs) can inform our understanding of how the 5XFAD mutation affect the biological processes.
DESeq2 analysis consist of multiple steps. We are going to briefly understand some of the important steps using subset of data and then we will perform differential analysis on whole dataset.
First, order the data (so counts and metadata specimenID orders match) and save as another variable name
R
rawdata <- counts[,sort(colnames(counts))]
metadata <- covars[sort(rownames(covars)),]
subset the counts matrix and sample metadata to include only 12 month old male mice. You can amend the code to compare wild type and 5XFAD mice from either sex, at any time point.
R
meta.12M.Male <- metadata[(metadata$sex=="male" & metadata$timepoint=='12 mo'),]
meta.12M.Male
OUTPUT
individualID specimenID sex genotype timepoint
32044rh 32044 32044rh male 5XFAD_noncarrier 12 mo
32046rh 32046 32046rh male 5XFAD_noncarrier 12 mo
32047rh 32047 32047rh male 5XFAD_noncarrier 12 mo
32053rh 32053 32053rh male 5XFAD_carrier 12 mo
32059rh 32059 32059rh male 5XFAD_carrier 12 mo
32061rh 32061 32061rh male 5XFAD_noncarrier 12 mo
32062rh 32062 32062rh male 5XFAD_carrier 12 mo
32073rh 32073 32073rh male 5XFAD_noncarrier 12 mo
32074rh 32074 32074rh male 5XFAD_noncarrier 12 mo
32075rh 32075 32075rh male 5XFAD_carrier 12 mo
32088rh 32088 32088rh male 5XFAD_carrier 12 mo
32640rh 32640 32640rh male 5XFAD_carrier 12 mo
R
dat <- as.matrix(rawdata[,colnames(rawdata) %in% rownames(meta.12M.Male)])
colnames(dat)
OUTPUT
[1] "32044rh" "32046rh" "32047rh" "32053rh" "32059rh" "32061rh" "32062rh"
[8] "32073rh" "32074rh" "32075rh" "32088rh" "32640rh"
R
rownames(meta.12M.Male)
OUTPUT
[1] "32044rh" "32046rh" "32047rh" "32053rh" "32059rh" "32061rh" "32062rh"
[8] "32073rh" "32074rh" "32075rh" "32088rh" "32640rh"
R
match(colnames(dat),rownames(meta.12M.Male))
OUTPUT
[1] 1 2 3 4 5 6 7 8 9 10 11 12
Next, we build the DESeqDataSet using the following function:
R
ddsHTSeq <- DESeqDataSetFromMatrix(countData=dat,
colData=meta.12M.Male,
design = ~genotype)
R
ddsHTSeq
Pre-filtering
While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.
R
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) >= 10,]
ddsHTSeq
Reference level
By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels.
specifying the reference-level to 5XFAD_noncarrier:
R
ddsHTSeq$genotype <- relevel(ddsHTSeq$genotype,ref="5XFAD_noncarrier")
Run the standard differential expression analysis steps that is wrapped into a single function, DESeq.
R
dds <- DESeq(ddsHTSeq,parallel = TRUE)
Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. By default the argument alpha is set to 0.1. If the adjusted p value cutoff will be a value other than 0.1, alpha should be set to that value:
R
res <- results(dds,alpha=0.05) # setting 0.05 as significant threshold
res
We can order our results table by the smallest p value:
R
resOrdered <- res[order(res$pvalue),]
head(resOrdered,n=10)
we can summarize some basic tallies using the summary function.
R
summary(res)
How many adjusted p-values were less than 0.05?
R
sum(res$padj < 0.05, na.rm=TRUE)
How many adjusted p-values were less than 0.1?
R
sum(res$padj < 0.1, na.rm=TRUE)
Function to convert ensembleIDs to common gene names
We’ll use a package to translate mouse ENSEMBL IDS to gene names. Run this function and they will be called up when assembling results from the differential expression analysis
R
map_function.df <- function(x, inputtype, outputtype) {
mapIds(
org.Mm.eg.db,
keys = row.names(x),
column = outputtype,
keytype = inputtype,
multiVals = "first"
)
}
Generating Results table
Here we will call the function to get the ‘symbol’ names of the genes incorporated into the results table, along with the columns we are most interested in
R
All_res <- as.data.frame(res) %>%
mutate(symbol = map_function.df(res,"ENSEMBL","SYMBOL")) %>% ##run map_function to add symbol of gene corresponding to ENSEBL ID
mutate(EntrezGene = map_function.df(res,"ENSEMBL","ENTREZID")) %>% ##run map_function to add Entrez ID of gene corresponding to ENSEBL ID
dplyr::select("symbol", "EntrezGene","baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")
head(All_res)
Extracting genes that are significantly expressed
Let’s subset all the genes with a pvalue < 0.05
R
dseq_res <- subset(All_res[order(All_res$padj), ], padj < 0.05)
Wow! We have a lot of genes with apparently very strong statistically significant differences between the control and 5xFAD carrier.
R
dim(dseq_res)
R
head(dseq_res)
Exploring and exporting results
Exporting results to CSV files
we can save results file into a csv file like this:
R
write.csv(All_res,file="results/All_5xFAD_12months_male.csv")
write.csv(dseq_res,file="results/DEG_5xFAD_12months_male.csv")
Volcano plot
We can visualize the differential expression results using Volcano plot function from EnhancedVolcano package. For the most basic volcano plot, only a single data-frame, data-matrix, or tibble of test results is required, containing point labels, log2FC, and adjusted or unadjusted P values. The default cut-off for log2FC is >|2|; the default cut-off for P value is 10e-6.
R
EnhancedVolcano(All_res,
lab = (All_res$symbol),
x = 'log2FoldChange',
y = 'padj',legendPosition = 'none',
title = 'Volcano plot:Differential Expression Results',
subtitle = '',
FCcutoff = 0.1,
pCutoff = 0.05,
xlim = c(-3, 6))
You can see that some top significantly expressed are immune/inflammation-related genes such as Ctsd, C4b, Csf1 etc. These genes are upregulated in the 5XFAD strain.
Principal component plot of the samples
Principal component analysis is a dimension reduction technique that reduces the dimensionality of these large matrixes into a linear coordinate system, so that we can more easily visualize what factors are contributing the most to variation in the dataset by graphing the principal components.
R
ddsHTSeq <- DESeqDataSetFromMatrix(countData=as.matrix(rawdata), colData=metadata, design= ~ genotype)
R
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)>1) >= 10, ]
dds <- DESeq(ddsHTSeq,parallel = TRUE)
R
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)
plotPCA(vsd, intgroup=c("genotype", "sex","timepoint"))
We can see that clustering is occurring, though it’s kind of hard to see exactly how they are clustering in this visualization.
It is also possible to customize the PCA plot using the ggplot function.
R
pcaData <- plotPCA(vsd, intgroup=c("genotype", "sex","timepoint"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2,color=genotype, shape=sex)) +
geom_point(size=3) +
geom_text(aes(label=timepoint),hjust=0.5, vjust=2,size =3.5) +
labs(x= paste0("PC1: ",percentVar[1],"% variance"), y= paste0("PC2: ",percentVar[2],"% variance"))
PCA identified genotype and sex being a major source of variation in between 5XFAD and WT mice. Female and male samples from the 5XFAD carriers clustered distinctly at all ages, suggesting the presence of sex-biased molecular changes in animals.
Function for Differential analysis using DESeq2
Finally, we can build a function for differential analysis that consists of all above discussed steps. It will require to input sorted raw count matrix, sample metadata and define the reference group.
R
DEG <- function(rawdata, meta,
include.batch = FALSE,
ref = ref) {
dseq_res <- data.frame()
All_res <- data.frame()
if (include.batch) {
cat("Including batch as covariate\n")
design_formula <- ~ Batch + genotype
}
else{
design_formula <- ~ genotype
}
dat2 <- as.matrix(rawdata[, colnames(rawdata) %in% rownames(meta)])
ddsHTSeq <-
DESeqDataSetFromMatrix(countData = dat2,
colData = meta,
design = design_formula)
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) >= 10, ]
ddsHTSeq$genotype <- relevel(ddsHTSeq$genotype, ref = ref)
dds <- DESeq(ddsHTSeq, parallel = TRUE)
res <- results(dds, alpha = 0.05)
#summary(res)
res$symbol <- map_function.df(res,"ENSEMBL","SYMBOL")
res$EntrezGene <- map_function.df(res,"ENSEMBL","ENTREZID")
All_res <<- as.data.frame(res[, c("symbol", "EntrezGene","baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")])
}
Let’s use this function to analyze all groups present in our data.
Differential Analysis of all groups
First, we add a Group column to our metadata table that will combine all variables of interest (genotype, sex, and timepoint) for each sample.
R
metadata$Group <- paste0(metadata$genotype,"-",metadata$sex,"-",metadata$timepoint)
unique(metadata$Group)
OUTPUT
[1] "5XFAD_carrier-female-12 mo" "5XFAD_noncarrier-male-12 mo"
[3] "5XFAD_noncarrier-female-12 mo" "5XFAD_carrier-male-12 mo"
[5] "5XFAD_noncarrier-female-6 mo" "5XFAD_noncarrier-male-6 mo"
[7] "5XFAD_carrier-female-6 mo" "5XFAD_noncarrier-female-4 mo"
[9] "5XFAD_carrier-female-4 mo" "5XFAD_carrier-male-6 mo"
[11] "5XFAD_carrier-male-4 mo" "5XFAD_noncarrier-male-4 mo"
Next, we create a comparison table that has all cases and controls that we would like to compare with each other. Here I have made comparison groups for age and sex-matched 5xFAD carriers vs 5xFAD_noncarriers, with carriers as the cases and noncarriers as the controls:
R
comparisons <- data.frame(control=c("5XFAD_noncarrier-male-4 mo", "5XFAD_noncarrier-female-4 mo", "5XFAD_noncarrier-male-6 mo",
"5XFAD_noncarrier-female-6 mo","5XFAD_noncarrier-male-12 mo", "5XFAD_noncarrier-female-12 mo"),
case=c("5XFAD_carrier-male-4 mo", "5XFAD_carrier-female-4 mo", "5XFAD_carrier-male-6 mo",
"5XFAD_carrier-female-6 mo","5XFAD_carrier-male-12 mo", "5XFAD_carrier-female-12 mo")
)
R
comparisons
OUTPUT
control case
1 5XFAD_noncarrier-male-4 mo 5XFAD_carrier-male-4 mo
2 5XFAD_noncarrier-female-4 mo 5XFAD_carrier-female-4 mo
3 5XFAD_noncarrier-male-6 mo 5XFAD_carrier-male-6 mo
4 5XFAD_noncarrier-female-6 mo 5XFAD_carrier-female-6 mo
5 5XFAD_noncarrier-male-12 mo 5XFAD_carrier-male-12 mo
6 5XFAD_noncarrier-female-12 mo 5XFAD_carrier-female-12 mo
Finally, we implement our DEG function on each case/control comparison of interest and store the result table in a list and data frame:
R
# initiate an empty list and data frame to save results
DE_5xFAD.list <- list()
DE_5xFAD.df <- data.frame()
for (i in 1:nrow(comparisons))
{
meta <- metadata[metadata$Group %in% comparisons[i,],]
DEG(rawdata,meta,ref = "5XFAD_noncarrier")
#append results in data frame
DE_5xFAD.df <- rbind(DE_5xFAD.df,All_res %>% mutate(model="5xFAD",sex=unique(meta$sex),age=unique(meta$timepoint)))
#append results in list
DE_5xFAD.list[[i]] <- All_res
names(DE_5xFAD.list)[i] <- paste0(comparisons[i,2])
}
Let’s explore the result stored in our list:
R
names(DE_5xFAD.list)
We can easily extract the result table for any group of interest by using $ and name of group. Let’s check top few rows from 5XFAD_carrier-male-4 mo group:
R
head(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`)
Let’s check the result stored as dataframe:
R
head(DE_5xFAD.df)
Check if result is present for all ages:
R
unique((DE_5xFAD.df$age))
Check if result is present for both sexes:
R
unique((DE_5xFAD.df$sex))
Check number of genes in each group:
R
dplyr::count(DE_5xFAD.df,model,sex,age)
Check number of genes significantly differentially expressed in all cases compared to age and sex-matched controls:
R
degs.up <- map(DE_5xFAD.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange>0)))
degs.down <- map(DE_5xFAD.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange<0)))
deg <- data.frame(Cases=names(degs.up), Up_DEGs.pval.05=as.vector(unlist(degs.up)),Down_DEGs.pval.05=as.vector(unlist(degs.down)))
knitr::kable(deg)
Interestingly, in females more genes are differentially expressed at all age groups, and more genes are differentially expressed the older the mice get in both sexes.
Pathway Enrichment
We may wish to look for enrichment of biological pathways in a list of differentially expressed genes. Here we will test for enrichment of KEGG pathways using using enrichKEGG function in clusterProfiler package.
R
dat <- list(FAD_M_4=subset(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`[order(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`$padj), ], padj < 0.05) %>% pull(EntrezGene),
FAD_F_4=subset(DE_5xFAD.list$`5XFAD_carrier-female-4 mo`[order(DE_5xFAD.list$`5XFAD_carrier-female-4 mo`$padj), ], padj < 0.05) %>% pull(EntrezGene))
## perform enrichment analysis
enrich_pathway <- compareCluster(dat,
fun = "enrichKEGG",
pvalueCutoff = 0.05,
organism = "mmu"
)
enrich_pathway@compareClusterResult$Description <- gsub(" - Mus musculus \\(house mouse)","",enrich_pathway@compareClusterResult$Description)
Let’s plot top enriched functions using dotplot function of clusterProfiler package.
R
clusterProfiler::dotplot(enrich_pathway,showCategory=10,font.size = 14,title="Enriched KEGG Pathways")
What does this plot infer?
Save Data for Next Lesson
We will use the results data in the next lesson. Save it now and we will load it at the beginning of the next lesson. We will use R’s save command to save the objects in compressed, binary format. The save command is useful when you want to save multiple objects in one file.
R
save(DE_5xFAD.df,DE_5xFAD.list,file="results/DEAnalysis_5XFAD.Rdata")
Session Info
R
sessionInfo()
OUTPUT
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] Matrix_1.6-5 gtable_0.3.6 compiler_4.4.1 renv_1.0.11
[5] highr_0.11 tidyselect_1.2.1 scales_1.3.0 yaml_2.3.10
[9] lattice_0.22-5 R6_2.5.1 labeling_0.4.3 generics_0.1.3
[13] knitr_1.48 munsell_0.5.1 pillar_1.9.0 tzdb_0.4.0
[17] rlang_1.1.4 utf8_1.2.4 stringi_1.8.4 xfun_0.48
[21] timechange_0.3.0 cli_3.6.3 withr_3.0.1 magrittr_2.0.3
[25] grid_4.4.1 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5
[29] evaluate_1.0.1 glue_1.8.0 farver_2.1.2 fansi_1.0.6
[33] colorspace_2.1-1 tools_4.4.1 pkgconfig_2.0.3
Key Points
- Use your Synapse login credentials to access the Portal.
- Use Synapser package to download data from the Portal.
Content from Mouse human alignment of transcriptomic signatures
Last updated on 2024-10-25 | Edit this page
Overview
Questions
- How do we perform a cross-species comparison?
- Which aspects of disease does a model capture?
Objectives
- Approaches to align mouse data to human data
- Understand the human AD co-expression modules
- Perform correlation analysis
- Understand the biological domains of Alzheimer’s disease
- Use biological domain annotations to compare between species
Author Gregory Cary, Jackson Laboratory
Aligning Human and Mouse Phenotype
Alzheimer’s Disease (AD) is complex, and we can not expect any animal model to fully recapitulate all aspects of late onset AD (LOAD) pathology. To study AD with animal models we must find dimensions through which we can align phenotypes between the models and human cohorts. In MODEL-AD we use the following data modalities to identify commonalities between mouse models and human cohorts:
- Imaging (i.e. MRI and PET) to correspond with human imaging studies
(e.g. ADNI)
- Neuropathological phenotypes
- Lots of ’omics — genomics, proteomics, and metabolomics
The ’omics comparisons allow for a much richer contrast, since a significant portion of genes are shared between these two species. Homology at the anatomical and neuropathological levels is less clear.
Overview of Human cohort data
The Accelerating Medicines Partnership-Alzheimer’s Disease Consortium (AMP-AD) has generated extensive sets of ’omics data from a variety of human Alzheimer’s disease cohorts. AMP-AD researchers are applying systems biology approaches toward the goal of elucidating AD mechanisms and highlighting potential therapeutic targets.
There are three large, independent human cohorts that are part of AMP-AD:
- The Religious Orders Study and the Memory and Aging Project (ROSMAP, syn3219045)
- Mount Sinai Brain Bank (MSBB,
syn3159438)
- Mayo Clinic (Mayo, syn5550404)
These studies have collected postmortem RNA-seq profiles from over
1,200 individuals spanning seven distinct brain regions: - dorsolateral
prefrontal cortex (DLPFC)
- temporal cortex (TCX)
- inferior frontal gyrus (IFG)
- superior temporal gyrus (STG)
- frontal pole (FP)
- parahippocampal gyrus (PHG)
- cerebellum (CBE)
These samples are generally balanced for AD, MCI, and non-affected controls. The data provide a broad assessment on how AD affects multiple brain regions in 3 different populations in the US.
Today we’ll consider two different approaches to compare gene expression between human cohorts and model systems:
- Correlation within co-expression module genes
- correlations will be generally weak for all expression, but animal models may recapitulate specific aspects of the disease
- we can use subsets of genes from co-expression modules, which represent genes expressed in similar patterns in AD, and look for correlations within these subsets
- Correlation of functional enrichment results
- another approach is to consider the functional annotation enriched among differentially expressed genes in human and mouse
- we can similarly sub-divide these groups of co-functional genes into biological domains to aid our interpretation
Overview of Human Consensus RNA-Seq Coexpression Modules
Wan, et al. performed meta analysis including all available AMP-AD RNA-seq datasets and systematically defined correspondences between gene expression changes associated with AD in human brains. Briefly, the RNA-seq read libraries were normalized and covariates were adjusted for each human study separately before testing for differential expression using fixed/mixed effects modeling to account for batch effects. The expression data from each brain region was used to perform co-expression analysis using a variety of different algorithms, generating in total 2,978 co-expression modules across all tissues. Of these, 660 modules showed enrichment for at least one AD-specific differentially expressed gene from the meta-analysis of all cases compared to controls.
Wan et al clustered these modules together using network analyses and found 30 co-expression modules related to LOAD pathology. Among the 30 aggregate co-expression modules, five consensus clusters were described that span brain region and study cohorts. These consensus clusters consist of subsets of modules which are associated with similar AD related changes across brain regions and cohorts. Wan et al looked for enrichment of cell -type signatures within the modules using expression-weighted cell type enrichment analysis (Skene and Grant, 2016) and examined enrichment of functional annotations within the modules.
This figure shows a matrix view of gene content overlap between the 30 co-expression modules. You’ll note a few strongly overlapping group of modules, implicating similar expression profiles in different studies and brain regions, which are the consensus clusters (A-E).
The first module block (consensus cluster A) is enriched for signatures of astrocytes, while the next block (consensus cluster B) is enriched for signatures of other cell types, including endothelial and microglial expressed genes, suggesting an inflammation component. The third module block (consensus cluster C) is strongly enriched for signatures of neuronal gene expression, linking the modules within this cluster to neurodegenerative processes. The fourth module block (consensus cluster D) is enriched for oligodendrocyte and other glial genes, indicating myelination and other neuronal support functions associated with these modules. Finally, consensus cluster E contains mixed modules that don’t have clear cell type enrichments, but do have enrichments for processes involved in response to stress or unfolded proteins. Stress response is not cell specific, so the expression represented by these modules may be throughout many cells in the brain.
Accessing AMP-AD module data
These AMP-AD co-expression modules are very useful for making comparisons between animal models and the human cohorts. In order to use these modules to make the comparisons, we’ll need to download data pertaining to the 30 co-expression modules. These data are available from the Synapse data repository; let’s download and take a closer look at the data.
R
query <- synTableQuery("SELECT * FROM syn11932957")
module_table <- read.table(query$filepath, sep = ",",header = TRUE)
Let’s look at module table
R
head(module_table)
Here you see 9 columns in this table. Columns we’re interested in are:
- column 3:
GeneID
contains Ensembl gene IDs
- column 4:
Module
is the module name in which gene is clustered
- column 7:
brainRegion
is the tissue of the corresponding module
- column 9:
external_gene_name
are gene symbols
How many unique modules are in the table?
R
length(unique(module_table$Module))
What are the names of the modules?
R
unique(module_table$Module)
How many genes are in each module?
R
table(module_table$Module)
We can visualize this as bar plot using ggplot2 package.
R
ggplot(module_table,aes(y=Module)) +
geom_bar() +
theme_bw()
Challenge 1
What are other ways to count genes in each module?
R
dplyr::count(module_table, Module)
We can also check the total number of unique genes in the table
R
length(unique(module_table$GeneID))
Mouse orthologs of Human module genes
In the module table we’ve downloaded we have human ENSEMBL ids and human gene symbols. In order to compare between human and mouse models, we will need to identify the corresponding mouse gene IDs. We are going to add the gene IDs of orthologous genes in mouse to the corresponding human genes in the module table.
Orthology mapping can be tricky, but thankfully Wan et al have already identified mouse orthologs for each of the human genes using the HGNC Comparison of Orthology Predictions (HCOP) tool. While there are a variety of different ways to get data about gene orthology, for the sake of simplicity we are going to read that table from Synapse.
R
mouse.human.ortho <- fread(synapser::synGet("syn17010253")$path,check.names = F,header=T)
Let’s take a look at the first few rows of this orthology table:
R
head(mouse.human.ortho)
There are 15 columns with various gene identifiers for each species. We’ll add mouse gene symbols from the ortholog table to the module table by matching the human ENSEMBL IDs from both tables.
R
module_table$Mouse_gene_name <-
mouse.human.ortho$mouse_symbol[
match(module_table$GeneID,mouse.human.ortho$human_ensembl_gene)
]
Taking a look at the module table, we can see the new column of mouse orthologs we just added.
R
head(module_table)
Some genes don’t have identified orthologs. Also there’s some redundant information. Let’s only keep the columns of interest and rows that contain a mouse ortholog mapping:
R
ampad_modules <- module_table %>%
distinct(tissue = brainRegion, module = Module, gene = GeneID, Mouse_gene_name) %>%
filter(Mouse_gene_name != "")
Take a look at this new data table:
R
head(ampad_modules)
Challenge 2
How many human genes are we removing that don’t have identified orthologs?
R
dplyr::filter(module_table, is.na(Mouse_gene_name)) %>%
dplyr::select(external_gene_name) %>%
dplyr::distinct() %>%
nrow()
Reading differential expression result of human data from meta-analysis
Now we know the genes that are in each AMP-AD co-expression cluster, along with the ID of the corresponding orthologous gene in mouse. We’ll also need to know how these genes change expression in AD.
We’ll download the results from differential expression meta-analysis of reprocessed AMP-AD RNA-seq data (all 7 brain regions). Log fold change values for human transcripts can be obtained from Synapse.
R
ampad_modules_raw <- fread(synapser::synGet("syn14237651")$path,check.names = F,header=T)
Let’s take a look at these data
R
head(ampad_modules_raw)
Data from which tissues are in this table?
R
unique(ampad_modules_raw$Tissue)
All 7 brain regions are represented.
The AMP-AD data has been processed many ways and using different models and comparisons. Let’s take a look at how many ways the data have been analyzed:
R
ampad_modules_raw %>% select(Model, Comparison) %>% distinct()
For our analyses we’ll use data from the “Diagnosis” model and
comparisons between AD cases vs controls. We’ll filter the table for
these conditions and only keep the three columns we’ll need:
Tissue
, Gene
and logFC
.
R
ampad_fc <- ampad_modules_raw %>%
as_tibble() %>%
filter(Model == "Diagnosis", Comparison == "AD-CONTROL") %>%
dplyr::select(tissue = Tissue, gene = ensembl_gene_id, ampad_fc = logFC) %>%
distinct()
Combine with modules so correlation can be done per module
Next, we will combine the fold change table we just downloaded
(ampad_fc
) and module table from before
(ampad_modules
). First, let’s look at both tables to check
how can we merge them together?
R
head(ampad_fc)
head(ampad_modules)
The columns common to both tables are gene
(the human
Ensembl gene IDs) and tissue
(the brain region
corresponding to the module/measurement). So we will merge the data sets
using these two columns.
Reminder: Each gene can be present in multiple brain regions, but should only be in one module per brain region. Let’s double check using the first gene in the table:
R
ampad_modules[ampad_modules$gene %in% "ENSG00000168439",]
This gene is present in six different co-expression modules all from different brain regions. You can do this for any other gene as well.
We’ll merge the two tables using the dplyr::inner_join
function:
R
ampad_modules_fc <- inner_join(ampad_modules, ampad_fc, by = c("gene", "tissue")) %>%
dplyr::select(symbol = Mouse_gene_name, module, ampad_fc)
Take a look at the new table we just made:
R
head(ampad_modules_fc)
We will use the data in ampad_modules_fc
to compare with
log fold change data measured in mouse models.
Preparing module information for correlation plots
Let’s package up these data and save this progress so far. This is
some manual book-keeping to arrange the modules into consensus clusters
for plotting later. You can just copy-paste this code into your
Rstudio
session.
R
cluster_a <- tibble(
module = c("TCXblue", "PHGyellow", "IFGyellow"),
cluster = "Consensus Cluster A (ECM organization)",
cluster_label = "Consensus Cluster A\n(ECM organization)"
)
cluster_b <- tibble(
module = c("DLPFCblue", "CBEturquoise", "STGblue", "PHGturquoise", "IFGturquoise", "TCXturquoise", "FPturquoise"),
cluster = "Consensus Cluster B (Immune system)",
cluster_label = "Consensus Cluster B\n(Immune system)"
)
cluster_c <- tibble(
module = c("IFGbrown", "STGbrown", "DLPFCyellow", "TCXgreen", "FPyellow", "CBEyellow", "PHGbrown"),
cluster = "Consensus Cluster C (Neuronal system)",
cluster_label = "Consensus Cluster C\n(Neuronal system)"
)
cluster_d <- tibble(
module = c("DLPFCbrown", "STGyellow", "PHGgreen", "CBEbrown", "TCXyellow", "IFGblue", "FPblue"),
cluster = "Consensus Cluster D (Cell Cycle, NMD)",
cluster_label = "Consensus Cluster D\n(Cell Cycle, NMD)"
)
cluster_e <- tibble(
module = c("FPbrown", "CBEblue", "DLPFCturquoise", "TCXbrown", "STGturquoise", "PHGblue"),
cluster = "Consensus Cluster E (Organelle Biogensis, Cellular stress response)",
cluster_label = "Consensus Cluster E\n(Organelle Biogenesis,\nCellular stress response)"
)
module_clusters <- cluster_a %>%
bind_rows(cluster_b) %>%
bind_rows(cluster_c) %>%
bind_rows(cluster_d) %>%
bind_rows(cluster_e) %>%
mutate(cluster_label = fct_inorder(cluster_label))
head(module_clusters)
mod <- module_clusters$module
save(ampad_modules_fc,module_clusters,mod,file="../data/AMPAD_Module_Data.RData")
Correlation between mouse models and human AD modules
As demonstrated in the plots below, we’ll aim to identify modules where the gene expression is correlated between human and mouse orthologs (left) as well as modules where there is no correlation (right).
There are two approaches that we commonly use to compute correlation between mouse data and human AD data:
- Compare change in expression in Human AD cases vs controls with
change in expression in mouse models for each gene in a given
module:
- LogFC(h) = log fold change in expression of human AD patients compared to control patients.
- LogFC(m) = log fold change in expression of mouse AD models compared to control mouse models.
\[cor.test(LogFC(h), LogFC(m))\]
- Compare Human AD expression changes to mouse genetic effects for
each gene in a given module:
- h = human gene expression (Log2 RNA-seq Fold Change AD/control)
- β = mouse gene expression effect from linear regression model (Log2 RNA-seq TPM)
\[cor.test(LogFC(h), β)\]
Both approaches allow us to assess directional coherence between gene expression for genes in AMP-AD modules and the effects of genetic manipulations in mice. For this lesson we are going to use the first approach.
Step 0: Reading Gene Expression Count matrix from Previous Lesson
We’ll first read the result table saved after differential analysis in last lesson. We’ll start with the data from the 5xFAD mouse model to understand the required steps to perform the correlation with human AD modules.
R
load("results/DEAnalysis_5XFAD.Rdata")
We can also load AMP-AD module data.
R
load("data/AMPAD_Module_Data.RData")
Step 1: Measure the correlation between mouse models for each sex at each age and AMP-AD modules using common genes from both datasets
We compute a Pearson correlation between changes in expression for each gene within a given module (log fold change for cases vs controls) with each mouse model (log fold change of the 5xFAD mice vs sex- and age-matched B6 mice).
First, we’ll combine both mouse DE_5xFAD.df
and human
ampad_modules_fc
log fold change data sets for all
genes.
R
model_vs_ampad <- inner_join(DE_5xFAD.df, ampad_modules_fc, by = c("symbol"), multiple = "all")
Note: for this join we specify
multiple = "all"
so that the same gene can be matched
across multiple human tissues.
R
head(model_vs_ampad)
Now we’ll create a list column containing data frames using tidyr::nest function. Nesting is implicitly a summarizing operation: you get one row for each group defined by the non-nested columns.
R
df <- model_vs_ampad %>%
dplyr::select(module, model, sex, age, symbol, log2FoldChange, ampad_fc) %>%
group_by(module, model, sex, age) %>%
nest(data = c(symbol, log2FoldChange, ampad_fc))
Let’s take a look:
R
head(df)
And we can also look at some of the nested data:
R
head(df$data[[1]])
These are the mouse and human log fold-change values for all genes in the TCXblue module; the mouse data corresponds to 4 month old male 5xFAD mice.
The total number of groups in the data table:
R
dim(df)
Next, we’ll compute correlation coefficients using the
cor.test
function:
R
cor.df <- df %>%
mutate(
cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]],
.x[["ampad_fc"]], method = "pearson")),
estimate = map_dbl(cor_test, "estimate"),
p_value = map_dbl(cor_test, "p.value")
) %>%
ungroup() %>%
dplyr::select(-cor_test)
Here we’re using purrr::map
based functions to apply the
correlation test to every entry in the data
column. We can
pull out specific features from the cor_test
list column,
including the computed correlation coefficient (estimate
)
and the significance (p.value
).
In the end we should have correlation coefficients and significance values for every comparison in our data table:
R
head(cor.df)
Step 2: Annotate correlation table to prepare for visualization
These steps will make it easier to make a nice looking plot during the next step. We’ll add a column that flags whether the correlation is significant or not, and we’ll add in the information about which consensus cluster each module belongs to:
R
model_module <- cor.df %>%
mutate(significant = p_value < 0.05) %>%
left_join(module_clusters, by = "module") %>%
dplyr::select(cluster, cluster_label, module, model, sex, age,
correlation = estimate, p_value, significant)
R
head(model_module)
Step 3: Create a dataframe to use as input for plotting the results
More preparations for plotting, here we’ll get all of the values in the right order so that they are grouped together nicely on the plot.
R
order.model <- c("5xFAD (Male)", "5xFAD (Female)")
correlation_for_plot <- model_module %>%
arrange(cluster) %>%
mutate(
module = factor(module,levels=mod),
model_sex = glue::glue("{model} ({str_to_title(sex)})"),
) %>%
arrange(model_sex) %>%
mutate(
age = factor(age, levels = c('4 mo','6 mo', '12 mo')),
model_sex = factor(model_sex,levels=order.model),
model_sex = fct_rev(model_sex),
)
head(correlation_for_plot)
Visualizing the Correlation plot
Now, we will use the above matrix and visualize the correlation
results using ggplot2
package functions.
R
data <- correlation_for_plot
range(correlation_for_plot$correlation)
ggplot2::ggplot() +
# the AMP-AD modules will be along the x-axis, the mouse models will be along the y-axis
ggplot2::geom_tile(data = data, ggplot2::aes(x = .data$module, y = .data$model_sex), colour = "black", fill = "white") +
# each tile of the grid will be filled with a circle where the fill and size correspond to the correlation coefficient
ggplot2::geom_point(data = dplyr::filter(data), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation, size = abs(.data$correlation))) +
# we'll draw a box arround significant correlations
ggplot2::geom_point(data = dplyr::filter(data, .data$significant), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation),color="black",shape=0,size=9) +
# plot the x-axis on the top of the plot, set the parameters of the scales
ggplot2::scale_x_discrete(position = "top") +
ggplot2::scale_size(guide = "none", limits = c(0, 0.4)) +
ggplot2::scale_color_gradient2(limits = c(-0.5, 0.5), low = "#85070C", high = "#164B6E", name = "Correlation", guide = ggplot2::guide_colorbar(ticks = FALSE)) +
# remove axis labels
ggplot2::labs(x = NULL, y = NULL) +
# facet the plot based on age range for the mice (rows) and consensus cluster (columns)
ggplot2::facet_grid( rows = dplyr::vars(.data$age),cols = dplyr::vars(.data$cluster_label), scales = "free", space = "free",switch = 'y') +
# specify how different aspects of the plot will look
ggplot2::theme(
strip.text.x = ggplot2::element_text(size = 10,colour = c("black")),
strip.text.y.left = ggplot2::element_text(angle = 0,size = 12),
axis.ticks = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 90, hjust = 0, size = 12),
axis.text.y = ggplot2::element_text(angle = 0, size = 12),
panel.background = ggplot2::element_blank(),
plot.title = ggplot2::element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
plot.title.position = "plot",
panel.grid = ggplot2::element_blank(),
legend.position = "right"
)
In the above plot, categories along the x-axis are the 30 AMP-AD co-expression modules grouped into 5 consensus clusters, while the categories along the y-axis show the different groupings of mouse models tested (split by age and sex). Positive correlations are shown in blue and negative correlations in red. Color intensity and size of the circles are proportional to the correlation coefficient. Black squares around dots represent significant correlation at p-value=0.05 and non-significant correlations are left blank.
Male and female 5xFAD mice display gene expression alterations across all five consensus clusters, with the strongest correlations observed among modules/genes in Consensus Cluster B, which generally consists of immune system pathways and functions.
Examining individual correlation results
Let’s say we want to know more about a single comparison in the plot
above and which genes are contributing to the correlation result. Maybe
we’re really interested in the correlations to the FPbrown module
between female 5xFAD mice at 4 months and 6 months. We can plot the
individual correlations for each comparison in the plot above with the
data we have. We’ll label genes with large fold change in the mouse
using the ggrepel::geom_label_repel
function.
R
# specify which comparisons to consider
m <- 'FPbrown'
s <- 'female'
a <- c('4 mo','6 mo')
# filter the correlation data frame to these comparisons
indiv.corr <- cor.df %>% filter(module == m, sex == s, age %in% a) %>% unnest(data) %>%
mutate( facet = str_c(age, '\n', 'r = ',signif(estimate,3),' ; p = ',signif(p_value,3) ))
# plot
ggplot(indiv.corr, aes( log2FoldChange , ampad_fc ))+
geom_vline(xintercept = 0, lwd = .1)+
geom_hline(yintercept = 0, lwd = .1)+
geom_point( size = .5, color = 'darkred')+
geom_smooth(method = 'lm', lwd = .5)+
ggrepel::geom_label_repel( data = arrange(indiv.corr, desc(abs(log2FoldChange))),
aes(label = symbol), size = 2 ) +
labs(x = '5xFAD logFC', y = 'AMP-AD logFC',
title = unique(indiv.corr$module))+
facet_wrap(~facet)+
theme_bw()
Here we can see that the 5xFAD logFC are pretty small in general. The correlations are relatively weak and driven by individual genes that have relatively large changes (e.g. Heatr4, Nirp1a, Rpl39l). If we compare a different module, say STGblue, we can see a stronger relationship between the mouse and human expression.
R
m <- 'STGblue'
s <- 'female'
a <- c('4 mo','12 mo')
indiv.corr <- cor.df %>% filter(module == m, sex == s, age %in% a) %>% unnest(data) %>%
mutate( v = str_c(age, '\n', 'r = ',signif(estimate,3),' ; p = ',signif(p_value,3) ))
ggplot(indiv.corr, aes( log2FoldChange , ampad_fc ))+
geom_vline(xintercept = 0, lwd = .1)+
geom_hline(yintercept = 0, lwd = .1)+
geom_point( size = .5, color = 'darkred')+
geom_smooth(method = 'lm', lwd = .5)+
ggrepel::geom_label_repel( data = arrange(indiv.corr, desc(abs(log2FoldChange))),
aes(label = symbol), size = 2 ) +
labs(x = '5xFAD logFC', y = 'AMP-AD logFC',
title = unique(indiv.corr$module))+
facet_wrap(~v)+
theme_bw()
These correlations are much stronger (r ~ 0.3 vs 0.1 in the previous example), and there is a consistent pattern between young mice and old mice, with similar genes being expressed in similar ways (e.g. Itgax and Clec7a are up-regulated at both ages). We also saw in the Volcano plot that immune-related genes were significantly up-regulated in the 5xFAD model. These significant positive correlation suggests that the 5xFAD model captures inflammatory changes observed in human AD patients.
Challenge 3
Considering the LOAD1 (i.e. APOE4.Trem2) model results, which modules show correlation and how does it compare with 5xFAD?
R
load("fig/DEAnalysis_LOAD1.Rdata")
model_vs_ampad <- inner_join(DE_LOAD1.df, ampad_modules_fc, by = c("symbol"), multiple = "all")
cor.df <- model_vs_ampad %>%
dplyr::select(module, model, sex, age, symbol, log2FoldChange, ampad_fc) %>%
group_by(module, model, sex, age) %>%
nest(data = c(symbol, log2FoldChange, ampad_fc)) %>%
mutate(
cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]],
.x[["ampad_fc"]], method = "pearson")),
estimate = map_dbl(cor_test, "estimate"),
p_value = map_dbl(cor_test, "p.value")
) %>%
ungroup() %>%
dplyr::select(-cor_test)
model_module <- cor.df %>%
mutate(significant = p_value < 0.05) %>%
left_join(module_clusters, by = "module") %>%
dplyr::select(cluster, cluster_label, module, model, sex, age,
correlation = estimate, p_value, significant)
order.model <- c("APOE4 (Male)", "APOE4 (Female)","Trem2 (Male)", "Trem2 (Female)","APOE4Trem2 (Male)", "APOE4Trem2 (Female)")
correlation_for_plot <- model_module %>%
arrange(cluster) %>%
mutate(
module = factor(module,levels=mod),
model_sex = glue::glue("{model} ({str_to_title(sex)})"),
) %>%
arrange(model_sex) %>%
mutate(
age = factor(age, levels = c(4,8)),
model_sex = factor(model_sex,levels=order.model),
model_sex = fct_rev(model_sex),
)
data <- correlation_for_plot
ggplot2::ggplot() +
ggplot2::geom_tile(data = data, ggplot2::aes(x = .data$module, y = .data$model_sex), colour = "black", fill = "white") +
ggplot2::geom_point(data = dplyr::filter(data), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation, size = abs(.data$correlation))) +
ggplot2::geom_point(data = dplyr::filter(data, .data$significant), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation),color="black",shape=0,size=9) +
ggplot2::scale_x_discrete(position = "top") +
ggplot2::scale_size(guide = "none", limits = c(0, 0.4)) +
ggplot2::scale_color_gradient2(limits = c(-0.5, 0.5), low = "#85070C", high = "#164B6E", name = "Correlation", guide = ggplot2::guide_colorbar(ticks = FALSE)) +
ggplot2::labs(x = NULL, y = NULL) +
ggplot2::facet_grid( rows = dplyr::vars(.data$age),cols = dplyr::vars(.data$cluster_label), scales = "free", space = "free",switch = 'y') +
ggplot2::theme(
strip.text.x = ggplot2::element_text(size = 10,colour = c("black")),
strip.text.y.left = ggplot2::element_text(angle = 0,size = 12),
axis.ticks = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(angle = 90, hjust = 0, size = 12),
axis.text.y = ggplot2::element_text(angle = 0, size = 12),
panel.background = ggplot2::element_blank(),
plot.title = ggplot2::element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
plot.title.position = "plot",
panel.grid = ggplot2::element_blank(),
legend.position = "right"
)
The LOAD1 models show weaker correlation to AMP-AD module gene expression overall, and in particular anti-correlation for Consensus Cluster B (immune) modules.
Detecting functional coherence of gene sets from omics data
Most omics analyses produce data on many thousands of genomic features (i.e. transcripts, proteins, etc.) for each condition tested. Simply looking at these lists of genes and associated statistics can be daunting and uninformative. We need approaches to identify which biological functions are being impacted by a given experiment from these systems-level measurements.
Gene functional enrichment analysis describes a variety of statistical methods that identify groups of genes that share a particular biological function or process and show differential association with experimental conditions. Most approaches compare some statistically valid set of differentially expressed features to sets of functional annotations for those features. There are many different functional annotation sets available, some of the more commonly used include:
- gene function resources, such as the Gene Ontology
(i.e. GO)
- pathway databases, such as Reactome or KEGG
- disease and phenotype ontologies, such as the Human Phenotype Ontology, the Mammalian Phenotype Ontology, and the Disease Ontology
These are the resources that are the foundation for many genomics knowledge bases, such as MGI, monarch initiative, and the Alliance of Genome Resources. The precise nature of each of these resources varies, but the core information contained within each is the relationship of sets of genes to biologically meaningful annotations. These annotations are typically expertly curated from the published literature.
There are a variety of statistical approaches that can be employed to test for functional enrichment among genes identified from an omics dataset. Two of the most common broad classes of tests are over-representation analysis (ORA) and gene set enrichment analysis (GSEA). For example, consider the figure below from Zhao & Rhee, Trends in Genetics (2023). Let’s consider each in a bit more detail.
Over-representation analysis
ORA
involves statistical tests of overlap between two
lists of genes: one derived from the experiment and one from the
functional annotations. For example, one might ask what is the overlap
between the genes in an annotation class, such as “Lysosome”, and the
genes that are significantly up-regulated in a given experimental
condition. These tests usually rely on some form of Fisher’s exact test
(e.g. fisher.test()
) or hypergeometric test (e.g.
phyper()
). If the gene lists consist of a larger number of
overlapping genes than would be expected at random - given the sample
sizes involved - then there is said to be a statistical
over-representation of the annotation class within the experimental
condition.
Of course, these overlap tests are generally considered for all
annotation classes, which can number in the hundreds to thousands.
Performing this many statistical tests ensures that many will be
identified as significant by chance. Therefore, there is typically a
requirement to correct for multiple testing errors
(e.g. p.adjust()
).
There are many R packages available to handle the statistical tests
and corrections involved in ORA. Today we’ll use
clusterProfiler::enrichGO()
, which wraps statistical
testing for overlap with GO terms and multiple test correction in one
function.
Let’s start by considering the enrichments from the mouse data analyzed previously.
R
theme_set(theme_bw())
load('results/DEAnalysis_5XFAD.RData')
We’ll start by considering the genes that are significantly DE in 12 month old male animals
R
gene.list.up <- DE_5xFAD.df %>%
filter(sex == 'male',
age == '12 mo',
padj <= 0.05,
log2FoldChange > 0) %>%
arrange(desc(log2FoldChange)) %>%
filter(!duplicated(symbol), !is.na(symbol)) %>%
pull(symbol) %>%
unique()
gene.list.dn <- DE_5xFAD.df %>%
filter(sex == 'male',
age == '12 mo',
padj <= 0.05,
log2FoldChange < 0) %>%
arrange(desc(log2FoldChange)) %>%
filter(!duplicated(symbol), !is.na(symbol)) %>%
pull(symbol) %>%
unique()
length(gene.list.up)
length(gene.list.dn)
There are a total of significantly up-regulated genes and significantly down-regulated genes in this cohort. Now test for over representation of GO terms among the DEGs. First, we need to identify the universe of all possible genes which includes the genes that were both measured by the experiment and contained within the annotation set.
R
univ <- as.data.frame(org.Mm.egGO) %>%
pull(gene_id) %>%
unique() %>%
bitr(., fromType = "ENTREZID", toType = 'SYMBOL', OrgDb = org.Mm.eg.db, drop = T) %>%
pull('SYMBOL') %>%
intersect(., DE_5xFAD.df$symbol)
Now let’s test for enriched GO terms (this can take 3-4 minutes)
R
enr.up <- enrichGO(gene.list.up,
ont = 'all',
OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
universe = univ
)
enr.dn <- enrichGO(gene.list.dn,
ont = 'all',
OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
universe = univ
)
How many significant terms are identified:
R
enr.up@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
enr.dn@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
Challenge 4
How many significant terms are identified from the up-regulated gene list if you do not specify the “universe”?
R
enrichGO(gene.list.up, ont = 'all', OrgDb = org.Mm.eg.db, keyType = 'SYMBOL') %>%
.@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
Plot the top 10 significant terms:
R
cowplot::plot_grid(
dotplot(enr.dn, showCategory = 10) + ggtitle('down'),
dotplot(enr.up, showCategory = 10) + ggtitle('up')
)
From this we can see that nervous system related terms (e.g. “learning or memory”, “neurotransmitter transport”) are down in 5xFAD males at 12 months, while immune functions (e.g. “cytokine-mediated signaling pathway” and “leukocyte mediated immunity”) are up in 5xFAD males at 12 months.
Gene set enrichment analysis
GSEA
is an alternative approach that uses a statistical
measure from the omics data (e.g. the log fold change or significance)
to rank the genes. An “enrichment score” is calculated for each
annotation set based on where the genes annotated to the term sit in the
overall distribution.
Let’s analyze those same data with the
clusterProfiler::gseGO()
function. First we’ll specify the
gene list and use the log2FoldChange
value to rank the
genes in the list.
R
gene.list <- DE_5xFAD.df %>%
filter(sex == 'male',
age == '12 mo',
padj <= 0.05
) %>%
arrange(desc(log2FoldChange)) %>%
filter(!duplicated(symbol), !is.na(symbol)) %>%
pull(log2FoldChange, name = symbol)
Now we’ll test for enrichment:
R
enr <- gseGO(gene.list, ont = 'all', OrgDb = org.Mm.eg.db, keyType = 'SYMBOL')
How many significant terms are identified:
R
enr@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
Challenge 5
The tally above represents all genes, both up- and down-regulatd. To
compare between GSEA and ORA, can you identify how many GSEA enriched
terms are associated with up-regulated genes and how many are associated
with down-regulated genes? (Hint: the NES
value within the
results relates to the directionality of enrichment).
R
enr@result %>% filter(p.adjust <= 0.05, NES < 0) %>% pull(ID) %>% unique() %>% length()
enr@result %>% filter(p.adjust <= 0.05, NES > 0) %>% pull(ID) %>% unique() %>% length()
Plot the term enrichments
R
ridgeplot(enr, core_enrichment = T, showCategory = 15)
This plot shows the top 15 GSEA enriched terms. The displayed terms
are all from the up-regulated class (i.e. NES > 0
), and
consist of many immune-relevant terms (e.g. “immune response” and
“cytokine production”).
Common pitfalls & best practices
These kinds of functional enrichment analyses are very common, but not all results reported are equal! A recent paper describes an “Urgent need for consistent standards in functional enrichment analysis”. They examine how the results of functional enrichment analyses are reported in the literature, and identify several common shortcomings. Watch out for these common mistakes when performing and reporting your own analyses!
Alzheimers’s Disease Biological Domains
While these results are informative and help to provide essential biological context to the results of the omics experiment, it is difficult to understand all of the enriched terms and what that means for the biology of disease. It would be useful to group each of the enriched terms within broader areas of disease biology. There are several common molecular endophenotypes that are detected within human Alzheimer’s cohorts across omics data types (transcriptomics, proteomics, and genetics). Several of these endophenotypic areas are shown in the figure below (source: Jesse Wiley).
These endophenotypes describe molecular processes that are dysregulated by or during the disease process. Very similar sets of endophenotypes have been cataloged among the targets of therapeutics in clinical trials for AD (see below, Cummings et al, Alzheimer’s disease drug development pipeline: 2023, Alz Dem TRCI.
In order to formalize and operationalize these endophenotypic areas, the Emory-Sage-SGC center of the TREAT-AD consortium has developed the Biological Domains of AD. In the enumeration of these domains, several criteria were established; the biological domains defined should be:
-
objective: leverage existing well-annotated resources
-
automatable: in CADRO, therapeutics and targets are
manually assigned
-
intelligible: groupings should be easy to understand
- modifiable: definitions should be (and are!) continuously be updated.
In all, 19 distinct biological domains (aka biodomains) have been
identified. These biological domains are defined using sets of GO terms
that align with the endophenotype. For example, terms like “cytokine
receptor binding” and “leukocyte migration” are annotated to the
Immune Response
biodomain, while terms like “postsynaptic
density” and “synaptic vesicle cycle” are annotated to the
Synapse
biodomain. Of all terms in the GO, 7,127 terms
(16.4%) are annotated to one of the biological domains. Because the GO
terms have genes annotated, genes can be associated with specific
endophenotypes via the biological domain groupings of terms. In all,
18,289 genes are annotated to at least 1 biodomain term. While the
biodomains exhibit very few overlapping GO terms (panel A), due to gene
pleiotropy (etc) the number of overlapping genes between biological
domains is quite a bit higher (panel B). The development of the
biological domains is described in more detail in the published paper Genetic
and Multi-omic Risk Assessment of Alzheimer’s Disease Implicates Core
Associated Biological Domains.
Download and explore the biological domain annotations
First let’s download the biodomain definition file from synapse.
R
#synLogin()
# biodomain definitions
biodom <- readRDS(synGet('syn25428992')$path)
dom.lab <- read_csv(synGet('syn26856828')$path)
What is in the file?
R
head(biodom)
You can see the file is a list of GO term accessions
(GO_ID
) and names (GOterm_Name
) as well as the
corresponding endophenotype (Biodomain
) to which the term
is annotated. There are also gene annotations for each term. These
annotation sets come from two sources: (1) the symbol
and
uniprot
annotations are drawn directly from the Gene
Ontology via the provided API, (2) ensembl_id
,
entrez_id
, and hgnc_symbol
are from BioMart
annotations (e.g. biomaRt::getLDS()
).
We can see how many GO terms are annotated to each biodomain:
R
biodom %>% group_by(Biodomain) %>% summarise(n_term = length(unique(GO_ID))) %>% arrange(n_term)
The biodomains range in size from Tau Homeostasis
(10
terms) up to Synapse
(1,379 terms).
We can also investigate the individual genes annotated to each biodomain GO term.
R
biodom %>% filter(GOterm_Name == 'amyloid-beta formation') %>% pull(symbol) %>% unlist()
So the genes associated with amyloid-beta formation
within the APP Metabolism
biodomain include PSEN1, PSEN2,
BACE1, ABCA7, NCSTN, and others.
Annotate enriched terms with biological domain
Let’s re-characterize the 5xFAD functional enrichments we computed previously with the biological domain annotations and see if we can get more context about what is changing in that model. We’ll focus on the GSEA results and start by annotating the results with biodomain groupings.
R
enr.bd = enr@result %>%
left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%
relocate(Biodomain)
head(enr.bd[,1:4])
Not all of the enriched terms are annotated to a biological domain.
Some terms are too broad and not specific (e.g. ‘defense response’),
while others may not have been captured by a biological domain
annotation yet (e.g. ‘regulation of immune system process’). Remember
that the conception of the biodomains involved a requirement that they
be modifiable, and these terms may be added to the biodomain
definintions in the future. Let’s modify the Biodomain
value for terms that aren’t annotated to a domain to ‘none’.
R
enr.bd$Biodomain[is.na(enr.bd$Biodomain)] <- 'none'
head(enr.bd[,1:4])
How many terms are enriched from each domain?
R
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
)
arrange(bd.tally, desc(n_sig_term))
Many enriched terms don’t map to a domain (), but most do (). Of
those that do, the vast majority map into the
Immune Response
biodomain.
We can plot the enrichment results, stratified by biodomain:
R
enr.bd_plot <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>%
arrange(Biodomain, p.adjust)
ggplot(enr.bd_plot, aes(NES, Biodomain)) +
geom_violin(data = subset(enr.bd_plot, NES > 0),
aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0),
aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(p.adjust), fill = color),
color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
Challenge 6
How could you plot the results from the ORA to show biodomain enrichements?
R
enr.ora = bind_rows(
enr.up@result %>% mutate(dir = 'up'),
enr.dn@result %>% mutate(dir = 'dn')
) %>%
left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%
relocate(Biodomain)
enr.ora$Biodomain[is.na(enr.ora$Biodomain)] <- 'none'
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
n_sig_term = enr.ora$ID[ enr.ora$Biodomain == domain ] %>% unique() %>% length()
)
enr.ora <- full_join(enr.ora, dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>%
arrange(Biodomain, p.adjust) %>%
mutate(signed_logP = -log10(p.adjust),
signed_logP = if_else(dir == 'dn', -1*signed_logP, signed_logP))
ggplot(enr.ora, aes(signed_logP, Biodomain)) +
geom_violin(data = subset(enr.ora, dir == 'up'), aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.ora, dir == 'dn'), aes(color = color), scale = 'width')+
geom_jitter(aes(size = Count, fill = color), color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
Based on the gene list (up or down) we can add a sign to the
significance. When we plot this we can see there are many more
significantly enriched terms from the ORA. The dominant signal is still
the up-regulation of terms from the Immune Response
biodomain. There is also nearly exclusive up-regulation of terms from
the Autophagy
, Metal Binding and Homeostasis
,
Oxidative Stress
, and APP Metabolism
domains.
The most down-regulated terms are from the Synapse
biodomain.
Comparing Human and Mouse functional enrichments
Finally let’s consider how we can compare the functional enrichments in the mouse models with functional enrichments from the AMP-AD cohorts. We can first take a look at the functional enrichments from the different human cohorts. To save some time, let’s load some pre-computed enrichment results for mouse & human data.
R
mm.gsea <- bind_rows(
read_tsv('data/5xFAD_gsea_results.tsv', col_types = cols()) %>%
mutate(age = str_remove_all(age, ' mo') %>% as.double()),
read_tsv('data/LOAD1_gsea_results.tsv', col_types = cols())
)
hs.gsea <- read_tsv('data/Hsap_gsea_results.tsv', col_types = cols())
The cohorts and tissues available are:
R
hs.gsea %>% select(Study, Tissue) %>% distinct()
OUTPUT
# A tibble: 6 × 2
Study Tissue
<chr> <chr>
1 MAYO CBE
2 MAYO TCX
3 MSSM STG
4 MSSM PHG
5 MSSM IFG
6 ROSMAP DLPFC
We can focus on one tissue per cohort, let’s look at TCX for Mayo, PHG for Mt Sinai, and DLPFC for ROSMAP.
R
enr.bd = hs.gsea %>%
filter(Tissue %in% c('TCX','PHG','DLPFC')) %>%
left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%
relocate(Biodomain) %>%
mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain))
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
)
enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>%
filter(!is.na(Study)) %>%
mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>%
arrange(Biodomain, p.adjust)
ggplot(enr.bd, aes(NES, Biodomain)) +
geom_violin(data = subset(enr.bd, NES > 0), aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd, NES < 0), aes(color = color), scale = 'width')+
geom_point(aes(size = -log10(p.adjust), fill = color, group = ID),
color = 'grey20', shape = 21, alpha = .5, position = position_jitter(width = .25, seed = 2) )+
geom_vline(xintercept = 0, lwd = .1)+
facet_wrap(~paste0(Study, ': ', Tissue))+
theme(legend.position = 'top')+
scale_fill_identity()+scale_color_identity()
Huh, the Mayo and MSSM look largely similar, but the ROSMAP DLPFC doesn’t have very many enriched terms. Let’s ignore the ROSMAP result for now and compare the terms enriched between Mayo TCX and the 5xFAD males. First we can join up the enriched terms between the sets into one data frame.
R
comb_5x = bind_rows(
hs.gsea %>%
filter(Study == 'MAYO', Tissue == 'TCX') %>%
mutate(species = 'hs') %>%
select(ID, Description, species, NES, p.adjust),
mm.gsea %>%
filter(model == '5xFAD', sex == 'male', age == 12) %>%
mutate(species = 'mm') %>%
select(ID, Description, species, NES, p.adjust)) %>%
pivot_wider(id_cols = c(ID, Description),
names_from = species,
values_from = NES
) %>%
unnest(cols = c(hs,mm)) %>%
mutate(across(c(hs,mm), ~ if_else(is.na(.x), 0, .x)))
head(comb_5x)
OUTPUT
# A tibble: 6 × 4
ID Description hs mm
<chr> <chr> <dbl> <dbl>
1 GO:0001568 blood vessel development 3.20 0
2 GO:0048514 blood vessel morphogenesis 3.18 0
3 GO:0001944 vasculature development 3.17 0
4 GO:0001525 angiogenesis 3.16 0
5 GO:0062023 collagen-containing extracellular matrix 3.15 0
6 GO:0042060 wound healing 3.06 0
Great! Now let’s annotate the biodomains and plot it up:
R
enr.bd <- comb_5x %>%
left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%
relocate(Biodomain) %>%
mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain))
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
)
enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(Biodomain = factor(Biodomain,
levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>%
arrange(Biodomain)
ggplot(enr.bd, aes(hs, mm)) +
geom_point(aes(fill = color),color = 'grey20', shape = 21, alpha = .5)+
geom_vline(xintercept = 0, lwd = .1)+
geom_hline(yintercept = 0, lwd = .1)+
labs(x = 'Mayo TCX\nGSEA NES', y = '12mo male 5xFAD\nGSEA NES')+
facet_wrap(~Biodomain, scales = 'free')+
scale_fill_identity()
There are several terms from multiple biodomains that are enriched in similar directions (i.e. from the up-regulated genes in both human AD and 5xFAD). There are fewer that are enriched in opposite directions (i.e. up in human AD and down in 5xFAD, e.g. see Vasculature and Mitochondrial Metabolism), but there are many terms that are unique to either the human or the mouse data set.
Overall, by aligning human and mouse omics signatures through the lens of domains affected in each context, we can get a better understanding of the relationships between the biological processes affected in each context.
Session Info
R
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.16.2 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 tidyverse_2.0.0 ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] bit_4.5.0 gtable_0.3.6 crayon_1.5.3 compiler_4.4.1
[5] renv_1.0.11 tidyselect_1.2.1 parallel_4.4.1 scales_1.3.0
[9] yaml_2.3.10 R6_2.5.1 generics_0.1.3 knitr_1.48
[13] munsell_0.5.1 pillar_1.9.0 tzdb_0.4.0 rlang_1.1.4
[17] utf8_1.2.4 stringi_1.8.4 xfun_0.48 bit64_4.5.2
[21] timechange_0.3.0 cli_3.6.3 withr_3.0.1 magrittr_2.0.3
[25] grid_4.4.1 vroom_1.6.5 hms_1.1.3 lifecycle_1.0.4
[29] vctrs_0.6.5 evaluate_1.0.1 glue_1.8.0 fansi_1.0.6
[33] colorspace_2.1-1 tools_4.4.1 pkgconfig_2.0.3
Key Points
- AMP-AD gene modules represent major transcriptomic heterogeneity in AD
- Correlation of logFC is a practical approach for human-mouse alignment of AD-associated transcriptomic signatures
- Functional gene set signatures are also a useful point of comparison between species
- Subsetting functions into associated Biological Domains allows for more granular comparisons.