Cross Species Functional Alignment
Last updated on 2026-02-12 | Edit this page
Estimated time: 180 minutes
Overview
Questions
- How do we perfrom a cross-species comparison?
- What transcriptomic changes do we observe in mouse models?
- Which aspects of disease does a model capture?
Objectives
- Approaches to align mouse data to human data
- Understand the human AD co-expression modules
- Understand the data from AD mouse models
- Perform differential analysis using DESeq2
- Perform correlation analysis between mouse models and human modules
- Understand the biological domains and subdomains of AD
- Use domain annotations to compare between species
Author: Gregory Cary, Jackson Laboratory
Setup
First, let’s install the necessary packages into the workspace, starting with the annotation packages:
R
# Mouse and Human annotation databases
# BiocManager::install(c('org.Mm.eg.db','org.Hs.eg.db', 'GO.db'))
Next we’ll install the fgsea and
clusterProfiler packages, which will be used for GO term
enrichment.
R
# fgsea package
# BiocManager::install(c('fgsea', 'clusterProfiler'))
Finally we can instally the synapseclient python package
using the reticulate R package, this will enable us
comand-line access to objects in the Synapse data repository.
R
# synapseclient python package
# reticulate::py_install('synapseclient')
Finally we’ll need to generate a personal access token on Synapse.
Login to Synapse, go to
Your Account > Account Settings > Personal Access Tokens > Manage Personal Access Tokens.
Click on Create New Token and make sure to enable both View
and Download permissions. Paste the resulting PAT in the quotes below
and save it to your workspace.
R
# synToken <-
NOTE: This is for learning purposes only. Storing
your PAT in an environmental variable in your workspace is not secure
and you could easily loose access to the PAT. The preferred method is to
store it in a separate file called .synapseConfig under
your home directory.
Now we should be able to login to Synapse through our R session using the following commands:
R
# import the synapseclient python package
syn.client <- reticulate::import('synapseclient')
syn <- syn.client$Synapse()
# log in to Synapse
# syn$login('', authToken = synToken)
Finally, let’s load the R packages we’ll need for today’s lesson:
R
# load necessary libraries for the analysis
suppressPackageStartupMessages( library(DESeq2) )
suppressPackageStartupMessages( library(org.Mm.eg.db) )
suppressPackageStartupMessages( library(org.Hs.eg.db) )
suppressPackageStartupMessages( library(clusterProfiler) )
suppressPackageStartupMessages( library(fgsea) )
suppressPackageStartupMessages( library(tidyverse) )
# set ggplot plotting theme (personal preference)
theme_set( theme_bw() )
[1] Aligning Human and Mouse Phenotypes
Alzheimer’s Disease (AD) is complex, and we can not expect any single animal model to fully recapitulate all aspects of late onset AD (LOAD) pathology. To study AD with animal models we must find dimensions through which we can align phenotypes between the models and human cohorts. In MODEL-AD we use the following data modalities to identify commonalities between mouse models and human cohorts:
- Imaging (i.e. MRI and PET) to correspond with human imaging studies
(e.g. ADNI)
- Neuropathological and biomarker phenotypes
- Lots of ’omics — genomics, proteomics, and metabolomics
The ’omics comparisons allow for very rich comparisons because a significant proportion of genes are shared between these two species. Furthermore, homology at the anatomical and neuropathological levels is less clear.

In this session we will explore several ways to compare ’omics signatures between human AD patients and mouse models. We’ll focus on transcriptomic alignment for this session, but we’ll consider other modalities in later sessions. We’ll consider several different approaches to compare gene expression between human cohorts and model systems:
- Correlation of genes within human co-expression modules
- correlations will be generally weak for all expression, but animal
models may recapitulate specific aspects of the disease
- we can use subsets of genes from co-expression modules, which
represent genes expressed in similar patterns in AD, and look for
correlations within these subsets
- Correlation of functional enrichment results
- another approach is to consider the functional annotation enriched among differentially expressed genes in human and mouse.
- we can similarly sub-divide these groups of co-functional genes into biological domains to aid our interpretation
Let’s start by briefly reviewing how to assess differential
expression in our mouse RNA-seq datasets using the DESeq2
package. Then we’ll move on to discussing the background of the human
cohorts and co-expression modules.
[2] Differential Expression Analysis in AD Mouse Models
Let’s analyze the 5xFAD RNA-seq expression data we explored yesterday. Specifically, we want to know which genes are differentially expressed at each age as a result of the transgenes that constitute the 5xFAD model.
The 5xFAD mouse model is a 5x transgenic model consisting of mutatnt human transgenes of the amyloid precursor protein (APP) and presenilin 1 (PSEN1) genes. The specific variants are all causal variants for Familial Alzheimer’s Disease (FAD) and include three variants in the APP gene - Swedish (K670N, M671L), Florida (I716V), and London (V717I) - and two in the PSEN1 gene - M146L and L286V. The expression of both transgenes is under control of the neural-specific elements of the mouse Thy1 promoter, which drives overexpression of the transgenes in the brain. More information about this generation and maintenance of this strain can be obtained from the JAX strain catalog.
This model has been extensively characterized by the MODEL-AD consortium, and others, including the study that we explored yesterday (i.e. syn21983020). For more information about this specific MODEL-AD study, see the publication by Oblak et al (2021). The primary patho-physiological phenotypes are summarized in the figure below from AlzForum, and include (1) early deposition of amyloid plaques, (2) gliosis and neuroinflammation, (3) synaptic changes and cognitive impairment, and (4) neuronal loss, specifically in cortical layer V and the subiculum. Importantly, Tau tangles are absent from this model.

But what else can the RNA-seq data tell us about the
transcriptomic response to the 5xFAD model in the brain? To know more,
we need to assess the differentially expressed transcripts. We’ll use
the DESeq2 package to perform differential expression
analysis of the 5xFAD RNA-seq data.
read 5xFAD RNA-seq count data
First, let’s re-access the RNA-seq data and metadata from Synapse
R
# RNA-seq counts
counts <- syn$get('syn22108847') %>% .$path %>% read_tsv()
# biospecimen metadata
meta <- syn$get('syn22103213') %>% .$path %>% read_csv() %>%
select(individualID, specimenID)
counts <- read_tsv("data/htseqcounts_5XFAD.txt",
show_col_types = FALSE)
# individual metadata
ind_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_individual_metadata.csv",
show_col_types = FALSE)
# biospecimen metadata
bio_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_biospecimen_metadata.csv",
show_col_types = FALSE)
# assay metadata
rna_meta <- read_csv("data/Jax.IU.Pitt_5XFAD_assay_RNAseq_metadata.csv",
show_col_types = FALSE)
# individual metadata, joined to the above
meta <- syn$get('syn22103212') %>% .$path %>% read_csv() %>%
left_join(., meta, by = 'individualID') %>%
filter(!is.na(specimenID))
We can modify the metadata to only include covariates we’ll need for this analysis
R
# order rows that have corresponding IDs in the counts table
covars <- meta %>% slice( match(colnames(counts[,-1]), specimenID) )
# compute the age of animals in months
covars <- covars %>%
mutate(
dateBirth = mdy(dateBirth),
dateDeath = mdy(dateDeath),
age = interval(dateBirth,dateDeath) %/% months(1))
# change the group variable based on the animal genotype
covars <- covars %>%
mutate( group = if_else(genotype == '5XFAD_carrier', '5xFAD', 'WT') )
# finally, only keep the columns we'll need
covars <- covars %>% select(specimenID, group, sex, age)
head(covars)
First, let’s make sure we have all relevant metadata
R
all(colnames(counts[,-1])==covars$specimenID)
How many animals do we have in each group?
R
covars %>% group_by(group, age, sex) %>% summarise(n = n())
Looks like we have 6 samples each from two genotypes (5xFAD or WT), three ages (4 months, 6 months, and 10 months), and both sexes, for a total of 72 samples.
accounting for transgenes
The 5xFAD model has two copies each of the APP and PSEN1 genes - one endogenous mouse gene, and the orthologous human transgene. The RNA-seq data was assessed using a custom transcriptome definition that included the sequences of both the mouse and human versions of each gene.
Ultimately we are going to sum the counts from both ortholgous genes (human APP and mouse App; human PSEN1 and mouse Psen1). But first, let’s look at the expression of each of these genes in the different groups. To start we’ll filter the counts down to just those four relevant gene IDs and join the counts up with the covariates to explore the expression of these genes.
R
tg.counts <- counts %>%
filter(gene_id %in% c("ENSG00000080815","ENSMUSG00000019969",
"ENSG00000142192","ENSMUSG00000022892")) %>%
pivot_longer(.,cols = -"gene_id",names_to = "specimenID",values_to="counts") %>%
left_join(covars ,by="specimenID")
head(tg.counts)
Let’s do a little data housekeeping:
R
# make an age column that is a factor and re-order the levels
tg.counts <- tg.counts %>%
mutate(
age.m = str_c(age, 'm'),
age.m = factor(age.m, levels = c('4m','6m','10m'))
)
# add gene symbols
tg.counts <- tg.counts %>%
mutate(
symbol = case_when(
gene_id == "ENSG00000142192" ~ "Human APP",
gene_id == "ENSG00000080815" ~ "Human PSEN1",
gene_id == "ENSMUSG00000022892" ~ "Mouse App",
gene_id == "ENSMUSG00000019969" ~ "Mouse Psen1"
)
)
Okay, now let’s plot the counts for each gene across the samples:
R
ggplot(tg.counts, aes(x=age.m, y=counts, color=group, shape = sex)) +
geom_boxplot() +
geom_point(position=position_jitterdodge())+
facet_wrap(~symbol, scales = 'free')+
theme_bw()
The human transgenes all have a counts of zero in the WT animals (where the transgenes are absent), while the endogenous mouse genes are expressed relatively consistently across both groups.
Let’s combine the expression of corresponding human and mouse genes by summing the expression and saving the summed expression as expression of mouse genes, respectively to match with gene names in control mice.
R
# move the gene_id column to rownames, to enable summing across rows
counts <- counts %>% column_to_rownames("gene_id")
#merge mouse and human APP gene raw count
counts[rownames(counts) %in% "ENSMUSG00000022892",] <-
counts[rownames(counts) %in% "ENSMUSG00000022892",] +
counts[rownames(counts) %in% "ENSG00000142192",]
counts <- counts[!rownames(counts) %in% c("ENSG00000142192"),]
#merge mouse and human PS1 gene raw count
counts[rownames(counts) %in% "ENSMUSG00000019969",] <-
counts[rownames(counts) %in% "ENSMUSG00000019969",] +
counts[rownames(counts) %in% "ENSG00000080815",]
counts <- counts[!rownames(counts) %in% c("ENSG00000080815"),]
We can confirm that the human genes are now absent from the counts table:
R
counts[,1:6] %>% filter(!str_detect(rownames(.), "MUS"))
prepare data and run DESeq analysis
Next we’ll prepare the data for differential expression analysis. We’ll use DESeq2 today, though there are other approaches. Another disclaimer: there are multiple steps to a DESeq2 analysis and we’re not going to get into nitty-gritty details here. We’ll briefly cover some of the basics, but for more information, please refer to the DESeq2 vignette.
Let’s perform this analysis stratified by age group while controlling for the sex of the animals as a covariate. We can start with the youngest animals (4 months old). Let’s sub-set the data and covariates to these data:
R
covars.4m <- covars %>% filter(age == 4)
counts.4m <- counts[,colnames(counts) %in% covars.4m$specimenID]
Next we’ll build the DESeq object
R
ddsHTSeq <- DESeqDataSetFromMatrix(countData=counts.4m,
colData=covars.4m,
design = ~group+sex)
R
ddsHTSeq
Now we have a DESeqDataSet object covering counts data
for 55k genes across 24 mice.
Let’s take a closer look at the counts that go into this object
R
counts.4m[1:5,1:5]
You can see that ENSMUSG00000000003 has 0 reads across
the samples listed here. Let’s find out how many genes are
0 counts across all samples.
R
gene_sums <- data.frame(gene_id = rownames(counts),
sums = Matrix::rowSums(counts))
sum(gene_sums$sums == 0)
We can see that 9691 (17.5%) genes have no reads at all. Let’s filter these out. While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.
Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads in at least 6 separate samples.
R
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq) >= 10) >= 6,]
Challenge 1
What proportion of the 55k genes we started with remain after this filter?
R
ddsHTSeq
There are 24765 genes, or 44.6% (24765/55487).
Let’s also make sure DESeq knows which group is our control or
“reference” group. By default this is arbitrarily assigned to the first
group in the factor. We can use the relevel function to set
the reference group to “WT” (wild type).
R
ddsHTSeq$group <- relevel(ddsHTSeq$group,ref="WT")
Now we’re ready to run the differential expression analysis; it’ll take a few seconds to process this step:
R
dds <- DESeq(ddsHTSeq, parallel = TRUE)
What are the results that have been computed:
R
resultsNames(dds)
Because we specified design = ~ group + sex when setting
up the DESeq object, we now have the results from these two contrasts.
We can pull the results specifically for the 5xFAD vs WT comparison
using the results function. The results table contains the
log2 fold change, p-value, and adjusted p-value for each gene in the
analysis.
R
res <- results(dds, contrast = c('group','5xFAD','WT'), alpha=0.05)
fad.res.4m <- as.data.frame(res)
We can get a summary of the DE results using the summary
function:
R
summary(res)
Ok, there are a total of 425 significantly differentially expressed genes at 4 months of age when comparing 5xFAD to WT brain tissue. The vast majority of these (392) are expressed at higher levels in 5xFAD mice relative to WT. Let’s take a look at the most significantly DE genes by ordering the results table by the adjusted p-value:
R
fad.res.4m %>% arrange(padj) %>% select(log2FoldChange, padj) %>% head()
This is a little difficult to interpret given all genes are
identified by their ENSEMBL IDs. Let’s map in the gene symbols using the
org.Mm.eg.db package. This package contains a mapping of
ENSEMBL IDs to gene symbols, and we can use the
AnnotationDbi::mapIds function to get the gene symbols for
our results table.
R
fad.res.4m$symbol <- AnnotationDbi::mapIds(
org.Mm.eg.db::org.Mm.eg.db,
keys = rownames(fad.res.4m),
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first"
)
fad.res.4m %>% arrange(padj) %>% select(symbol, log2FoldChange, padj) %>% head()
Ok! Now we can see that among the most significantly up-regulated genes in 5xFAD mice brains at 4 months of age are Psen1 and App (which we saw previously), along with Thy1, Cst7, Ccl6, and Clec7a. Let’s plot all DE genes:
R
ggplot(fad.res.4m, aes(log2FoldChange, -log10(padj)))+
geom_vline(xintercept = 0, lwd = .1)+
geom_point(alpha = .3, aes(color = (padj < 0.05 & abs(log2FoldChange) > log2(1.2)) ), show.legend = F)+
scale_color_manual(values = c('grey','red'))+
ggrepel::geom_text_repel(
data = subset(fad.res.4m, padj < 0.05 & abs(log2FoldChange) > log2(1.2)),
aes(label = symbol), min.segment.length = 0)+
theme_bw()
Let’s put it all together and compute results for each age group.
This code does all of the steps outlined above for each age cohort. The
input data, DESeq objects, and result tables are stored in
columns of a data frame. Feel free to copy and paste this code, but
investigate it and be sure you understand what each part is doing. It
may take about a minute or two to complete.
R
st = Sys.time()
fad.deg = tibble(
age = c(4,6,10),
meta = map(age, ~ covars %>% filter(age == .x)),
counts = map(meta, ~ counts %>% select(.x %>% pull(specimenID)))
) %>%
rowwise() %>%
mutate(
dds = DESeq2::DESeqDataSetFromMatrix(
countData = counts,
colData = meta,
design = ~ sex + group) %>% list()
) %>%
ungroup() %>%
mutate(
dds = map(dds, ~ .x[rowSums(counts(.x) >= 10) >= 6,]),
dds = map(dds, ~ {.x$group = relevel(.x$group, ref = 'WT'); .x}),
dds = map(dds, ~ DESeq2::DESeq(.x)),
res = map(dds, ~ results(.x, contrast = c('group','5xFAD','WT'), alpha=0.05)),
res.t = map(res, ~ as.data.frame(.x) %>%
rownames_to_column('Ensembl_gene_id') %>%
mutate(symbol = AnnotationDbi::mapIds(
org.Mm.eg.db::org.Mm.eg.db,
keys = Ensembl_gene_id,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")))
)
ed = Sys.time() - st
print(ed)
What information does the fad.r tibble contain? Let’s
take a look:
R
glimpse(fad.deg)
The first column is the age brackets in integers. The second and third columns are the metadata and count data for each age cohort (24 samples per age bracket). The fourth column is the DESeq object and the fifth and sixth columns are the results from the DE analysis.
So by running:
R
summary(fad.deg$res[[2]])
We can see that there are more significantly DE genes at 6 months than we saw at four months.
Challenge 2
How many up- and down-regulated genes are found for each age?
R
map_dbl(fad.deg$res.t,
~.x %>% filter(padj <= 0.05, log2FoldChange > 0) %>%
pull(Ensembl_gene_id) %>% length)
map_dbl(fad.deg$res.t,
~.x %>% filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id) %>% length)
There are between 392 and 1855 up-regulated genes, and between 33 and 1827 down-regulated genes.
Challenge 3
How many down-regulated genes overlap between each timepoint?
The comparisons to check are 4m+6m, 4m+10m, and 6m+10m; these are in rows 1+2, 1+3, and 2+3, respectively.
R
length(intersect(
fad.deg$res.t[[1]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id),
fad.deg$res.t[[2]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id)
))
length(intersect(
fad.deg$res.t[[1]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id),
fad.deg$res.t[[3]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id)
))
length(intersect(
fad.deg$res.t[[2]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id),
fad.deg$res.t[[3]] %>%
filter(padj <= 0.05, log2FoldChange < 0) %>%
pull(Ensembl_gene_id)
))
It looks like there are 22, 22, and 424 overlapping down-regulated genes, respectively.
This is a good point to save these data before we move on.
R
saveRDS(fad.deg, here::here("5xFAD_DESeq_analysis.rds"))
[3] Overview of Human cohort data
Now that we have our mouse transcriptomes analyzed, let’s switch gears and think about the human datasets. The Accelerating Medicines Partnership-Alzheimer’s Disease Consortium (AMP-AD) has generated extensive sets of ’omics data from a variety of human Alzheimer’s Disease cohorts. AMP-AD researchers are applying systems biology approaches toward the goal of elucidating AD mechanisms and highlighting potential therapeutic targets.
There are three large, independent human cohorts that are part of AMP-AD:
- The Religious Orders Study and the Memory and Aging Project (ROSMAP, syn3219045)
- Mount Sinai Brain Bank (MSBB,
syn3159438)
- Mayo Clinic (Mayo, syn5550404)
These studies have collected postmortem RNA-seq profiles from over
1,200 individuals spanning seven distinct brain regions:
- dorsolateral prefrontal cortex (DLPFC)
- temporal cortex (TCX)
- inferior frontal gyrus (IFG)
- superior temporal gyrus (STG)
- frontal pole (FP)
- parahippocampal gyrus (PHG)
- cerebellum (CBE)
These samples are generally balanced for AD, MCI, and non-affected controls. The data provide a broad assessment on how AD affects multiple brain regions in 3 different populations in the US.

Overview of Human Consensus RNA-Seq Coexpression Modules
Wan, et al. (2020) performed meta analysis including all available AMP-AD RNA-seq datasets and systematically defined correspondences between gene expression changes associated with AD in human brains. Briefly, the RNA-seq read libraries were normalized and covariates were adjusted for each human study separately before testing for differential expression using fixed/mixed effects modeling to account for batch effects. The expression data from each brain region was used to perform co-expression analysis using a variety of different algorithms, generating in total 2,978 co-expression modules across all tissues. Of these, 660 modules showed enrichment for at least one AD-specific differentially expressed gene from the meta-analysis of all cases compared to controls.
Wan et al clustered these modules together using network analyses and found 30 co-expression modules related to LOAD pathology. Among the 30 aggregate co-expression modules, five consensus clusters were described that span brain region and study cohorts. These consensus clusters consist of subsets of modules which are associated with similar AD related changes across brain regions and cohorts. Wan et al looked for enrichment of cell -type signatures within the modules using expression-weighted cell type enrichment analysis (Skene and Grant (2016)) and examined enrichment of functional annotations within the modules.

This figure shows a matrix view of gene content overlap between the 30 co-expression modules. You’ll note a few strongly overlapping group of modules, implicating similar expression profiles in different studies and brain regions, which are the consensus clusters (A-E).
The first module block (consensus cluster A) is enriched for signatures of astrocytes, while the next block (consensus cluster B) is enriched for signatures of other cell types, including endothelial and microglial expressed genes, suggesting an inflammation component. The third module block (consensus cluster C) is strongly enriched for signatures of neuronal gene expression, linking the modules within this cluster to neurodegenerative processes. The fourth module block (consensus cluster D) is enriched for oligodendrocyte and other glial genes, indicating myelination and other neuronal support functions associated with these modules. Finally, consensus cluster E contains mixed modules that don’t have clear cell type enrichments, but do have enrichments for processes involved in response to stress or unfolded proteins. Stress response is not cell specific, so the expression represented by these modules may be throughout many cells in the brain.
Accessing AMP-AD module data
These AMP-AD co-expression modules are very useful for making comparisons between animal models and the human cohorts. In order to use these modules to make the comparisons, we’ll need to download data pertaining to the 30 co-expression modules. These data are available from the Synapse data repository (syn11932957); let’s download and take a closer look at the data.
R
query <- syn$tableQuery("SELECT * FROM syn11932957")
module_table <- read_csv(query$filepath)
head(module_table)
Here you see 9 columns in this table. Columns we’re interested in are:
- column 3:
GeneIDcontains Ensembl gene IDs - column 4:
Moduleis the module name in which gene is clustered - column 7:
brainRegionis the tissue of the corresponding module - column 9:
external_gene_nameare gene symbols
How many unique modules are in the table?
R
length(unique(module_table$Module))
What are the names of the modules?
R
unique(module_table$Module)
How many genes are in each module?
R
table(module_table$Module)
We can visualize this as bar plot using ggplot2 package.
R
ggplot(module_table,aes(y=Module)) +
geom_bar() +
theme_bw()
Challenge 4
What are other ways to count genes in each module?
You could also try:
R
dplyr::count(module_table ,Module)
We can also check the total number of unique genes in the table
R
length(unique(module_table$GeneID))
Mouse orthologs of Human module genes
In the module table we’ve downloaded we have human ENSEMBL ids and human gene symbols. In order to compare between human and mouse models, we will need to identify the corresponding (i.e. orthologous) mouse gene IDs. We are going to add the gene IDs of orthologous genes in mouse to the corresponding human genes in the module table.
Orthology mapping can be tricky, but thankfully Wan et al have already identified mouse orthologs for each of the human genes using the HGNC Comparison of Orthology Predictions (HCOP) tool. While there are a variety of different ways to get data about gene orthology, for the sake of simplicity we are going to read that table from Synapse (syn17010253).
R
mouse.human.ortho <- syn$get("syn17010253")$path %>% read_tsv()
head(mouse.human.ortho)
There are 15 columns with various gene identifiers for each species.
We’ll add mouse gene symbols from the ortholog table to the module table
by matching the human ENSEMBL IDs from both tables
(i.e. GeneID from the module table and
human_ensembl_gene from the orthology table).
R
module_table$Mouse_gene_symbol <-
mouse.human.ortho$mouse_symbol[
match(module_table$GeneID,mouse.human.ortho$human_ensembl_gene)
]
Taking a look at the module table, we can see the new column of mouse orthologs we just added.
R
head(module_table)
Some genes don’t have identified orthologs. Also there’s some redundant information. Let’s only keep the columns of interest and rows that contain a mouse ortholog mapping:
R
ampad_modules <- module_table %>%
distinct(tissue = brainRegion, module = Module, gene = GeneID, Mouse_gene_symbol) %>%
filter(Mouse_gene_symbol != "")
Take a look at this new data table:
R
head(ampad_modules)
Challenge 5
How many human genes are we removing that don’t have identified orthologs?
R
dplyr::filter(module_table, is.na(Mouse_gene_symbol)) %>%
dplyr::select(external_gene_name) %>%
dplyr::distinct() %>%
nrow()
2998 genes
Reading differential expression result of human data from meta-analysis
Now we know the genes that are in each AMP-AD co-expression cluster, along with the ID of the corresponding orthologous genes in mouse. We’ll also need to know how the expression of these genes change in AD.
We’ll download the results from differential expression analysis of reprocessed AMP-AD RNA-seq data (all 7 brain regions). Log fold change values for human transcripts can be obtained from Synapse (syn14237651).
R
ampad_modules_raw <- read_tsv(syn$get("syn14237651")$path)
head(ampad_modules_raw)
Data from which tissues are in this table?
R
unique(ampad_modules_raw$Tissue)
All 7 brain regions are represented.
The AMP-AD data has been processed many ways and using different models and comparisons. Let’s take a look at how many ways the data have been analyzed:
R
ampad_modules_raw %>% select(Model, Comparison) %>% distinct()
For our analyses we’ll use data from the “Diagnosis” model and
comparisons between AD cases vs controls. We’ll filter the table for
these conditions and only keep the three columns we’ll need:
Tissue, Gene and logFC.
R
ampad_fc <- ampad_modules_raw %>%
as_tibble() %>%
filter(Model == "Diagnosis", Comparison == "AD-CONTROL") %>%
dplyr::select(tissue = Tissue, gene = ensembl_gene_id, ampad_fc = logFC) %>%
distinct()
Combine with modules so correlation can be done per module
Next, we will combine the fold change table we just downloaded
(ampad_fc) and module table from before
(ampad_modules). First, let’s look at both tables to check
how can we merge them together?
R
head(ampad_fc)
head(ampad_modules)
The columns common to both tables are gene (the human
Ensembl gene IDs) and tissue (the brain region
corresponding to the module/measurement). So we will merge the data sets
using these two columns.
Reminder: Each gene can be present in multiple brain regions, but should only be in one module per brain region. Let’s double check using the first gene in the table:
R
ampad_modules[ampad_modules$gene %in% "ENSG00000168439",]
This gene is present in six different co-expression modules all from different brain regions. You can do this for any other gene as well.
We’ll merge the two tables using the dplyr::inner_join
function:
R
ampad_modules_fc <- inner_join(ampad_modules, ampad_fc, by = c("gene", "tissue")) %>%
dplyr::select(symbol = Mouse_gene_symbol, module, ampad_fc)
Take a look at the new table we just made:
R
head(ampad_modules_fc)
We will use the data in ampad_modules_fc to compare with
log fold change data measured in mouse models.
Preparing module information for correlation plots
Let’s package up these data and save this progress so far. This is
some manual book-keeping to arrange the modules into consensus clusters
for plotting later. You can just copy-paste this code into your
Rstudio session.
R
cluster_a <- tibble(
module = c("TCXblue", "PHGyellow", "IFGyellow"),
cluster = "Consensus Cluster A (ECM organization)",
cluster_label = "Consensus Cluster A\n(ECM organization)"
)
cluster_b <- tibble(
module = c("DLPFCblue", "CBEturquoise", "STGblue", "PHGturquoise", "IFGturquoise", "TCXturquoise", "FPturquoise"),
cluster = "Consensus Cluster B (Immune system)",
cluster_label = "Consensus Cluster B\n(Immune system)"
)
cluster_c <- tibble(
module = c("IFGbrown", "STGbrown", "DLPFCyellow", "TCXgreen", "FPyellow", "CBEyellow", "PHGbrown"),
cluster = "Consensus Cluster C (Neuronal system)",
cluster_label = "Consensus Cluster C\n(Neuronal system)"
)
cluster_d <- tibble(
module = c("DLPFCbrown", "STGyellow", "PHGgreen", "CBEbrown", "TCXyellow", "IFGblue", "FPblue"),
cluster = "Consensus Cluster D (Cell Cycle, NMD)",
cluster_label = "Consensus Cluster D\n(Cell Cycle, NMD)"
)
cluster_e <- tibble(
module = c("FPbrown", "CBEblue", "DLPFCturquoise", "TCXbrown", "STGturquoise", "PHGblue"),
cluster = "Consensus Cluster E (Organelle Biogensis, Cellular stress response)",
cluster_label = "Consensus Cluster E\n(Organelle Biogenesis,\nCellular stress response)"
)
module_clusters <- cluster_a %>%
bind_rows(cluster_b) %>%
bind_rows(cluster_c) %>%
bind_rows(cluster_d) %>%
bind_rows(cluster_e) %>%
mutate(cluster_label = fct_inorder(cluster_label))
head(module_clusters)
mod <- module_clusters$module
save(ampad_modules_fc,module_clusters,mod, file= here::here("AMPAD_Module_Data.RData"))
[4] Correlation between mouse models and human AD modules
Now we’ll get to the cross-species alignment. Our goal, as demonstrated in the plots below, is to identify modules where the gene expression is correlated between human and mouse orthologs (left) as well as modules where there is no correlation (right).

There are two approaches that we commonly use to compute correlation between mouse data and human AD data:
- Compare change in expression in Human AD cases vs controls with
change in expression in mouse models for each gene in a given
module:
- LogFC(h) = log fold change in expression of human AD patients compared to control patients.
- LogFC(m) = log fold change in expression of mouse AD models compared to control mouse models.
\[cor.test(LogFC(h), LogFC(m))\]
- Compare Human AD expression changes to mouse genetic effects for
each gene in a given module:
- h = human gene expression (Log2 RNA-seq Fold Change AD/control)
- β = mouse gene expression effect from linear regression model (Log2 RNA-seq TPM)
\[cor.test(LogFC(h), β)\]
Both approaches allow us to assess directional coherence between gene expression for genes in AMP-AD modules and the effects of genetic manipulations in mice. For this session we are going to use the first approach; we’ll return to approach #2 later in the week.
Challenge 6
What are the relative advantages of each approach?
\[cor.test(LogFC(h), LogFC(m))\]
• direct comparison of effect sizes and direction • intuitive interpretation
\[cor.test(LogFC(h), β)\]
• identify the genetic contributions • human AD data often has age, sex, and other covariates regressed out, to derive the AD specific effect • controlling for analogous variables by computing the genetic effect (β) is often advantageous
Others?
Step 0: Reading Gene Expression Count matrix from Previous Lesson
First we’ll read the results saved after differential expression analysis (above). We’ll only keep the information about which age cohort and the differential expression results.
R
fad.deg <- load("results/DEAnalysis_5XFAD.Rdata")
WARNING
Warning in readChar(con, 5L, useBytes = TRUE): cannot open compressed file
'results/DEAnalysis_5XFAD.Rdata', probable reason 'No such file or directory'
ERROR
Error in `readChar()`:
! cannot open the connection
R
head(fad.deg)
ERROR
Error:
! object 'fad.deg' not found
Let’s also load the AMP-AD module data.
R
load("data/AMPAD_Module_Data.RData")
Step 1: Measure the correlation between mouse models for each sex at each age and AMP-AD modules using common genes from both datasets
We compute a Pearson correlation between changes in expression for each gene within a given module (log fold change for cases vs controls) with each mouse model (log fold change of the 5xFAD mice vs sex- and age-matched B6 mice).
First, we’ll combine both mouse fad.deg and human
ampad_modules_fc log fold change data sets for all
genes.
R
model_vs_ampad <- inner_join(fad.deg,
ampad_modules_fc,
by = c("symbol"),
multiple = "all")
Note: for this join we specify
multiple = "all" so that the same gene can be matched
across multiple human tissues and multiple mouse ages.
R
head(model_vs_ampad)
Now we’ll create a list column containing data frames using the tidyr::nest function. Nesting is implicitly a summarizing operation: you get one row for each group defined by the non-nested columns.
R
df <- model_vs_ampad %>%
dplyr::select(module, age, symbol, log2FoldChange, ampad_fc) %>%
group_by(module, age) %>%
nest(data = c(symbol, log2FoldChange, ampad_fc))
head(df)
And we can also look at some of the nested data:
R
head(df$data[[1]])
These are the mouse and human log fold-change values for all genes in the TCXblue module; the mouse data corresponds to 4 month old 5xFAD mice.
The total number of groups in the data table:
R
dim(df)
Next, we’ll compute correlation coefficients using the
cor.test function:
R
cor.df <- df %>%
mutate(
cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]],
.x[["ampad_fc"]], method = "pearson")),
estimate = map_dbl(cor_test, "estimate"),
p_value = map_dbl(cor_test, "p.value")
) %>%
ungroup() %>%
dplyr::select(-cor_test)
Here we’re using purrr::map based functions to apply the
correlation test to every entry in the data column. We can
pull out specific features from the cor_test list column,
including the computed correlation coefficient (estimate)
and the significance (p.value).
In the end we should have correlation coefficients and significance values for every comparison in our data table:
R
head(cor.df)
Step 2: Annotate correlation table to prepare for visualization
These steps will make it easier to make a nice looking plot during the next step. We’ll add a column that flags whether the correlation is significant or not, and we’ll add in the information about which consensus cluster each module belongs to:
R
model_module <- cor.df %>%
mutate(significant = p_value < 0.05) %>%
left_join(module_clusters, by = "module") %>%
dplyr::select(cluster, cluster_label, module, age,
correlation = estimate, p_value, significant)
head(model_module)
Step 3: Create a dataframe to use as input for plotting the results
More preparations for plotting, here we’ll get all of the values in the right order so that they are grouped together nicely on the plot.
R
correlation_for_plot <- model_module %>%
arrange(cluster) %>%
mutate(
module = factor(module,levels=mod),
age = factor(age, levels = c('10m','6m','4m'))
)
head(correlation_for_plot)
Visualizing the Correlation plot
Now, we will use the above matrix and visualize the correlation
results using ggplot2 package functions.
R
data <- correlation_for_plot
range(correlation_for_plot$correlation)
ggplot() +
# the AMP-AD modules will be along the x-axis, the mouse models will be along the y-axis
geom_tile(data = data, aes(x = .data$module, y = .data$age), colour = "black", fill = "white") +
# each tile of the grid will be filled with a circle where the fill and size correspond to the correlation coefficient
geom_point(data = data, aes(x = .data$module, y = .data$age,
colour = .data$correlation, size = abs(.data$correlation))) +
# we'll draw a box arround significant correlations
geom_point(data = dplyr::filter(data, .data$significant),
aes(x = .data$module, y = .data$age, colour = .data$correlation),
color="black",shape=0,size=9) +
# plot the x-axis on the top of the plot, set the parameters of the scales
scale_x_discrete(position = "top") +
scale_size(guide = "none", limits = c(0, 0.4)) +
scale_color_gradient2(limits = c(-0.5, 0.5), low = "#85070C", high = "#164B6E",
name = "Correlation", guide = guide_colorbar(ticks = FALSE)) +
# remove axis labels
labs(x = NULL, y = NULL) +
# facet the plot based on age range for the mice (rows) and consensus cluster (columns)
facet_grid( rows = vars('5xFAD'), cols = dplyr::vars(.data$cluster_label),
scales = "free", space = "free",switch = 'y') +
# specify how different aspects of the plot will look
theme(
strip.text.x = element_text(size = 10,colour = c("black")),
strip.text.y.left = element_text(angle = 0,size = 12),
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 0, size = 12),
axis.text.y = element_text(angle = 0, size = 12),
panel.background = element_blank(),
plot.title = element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
plot.title.position = "plot",
panel.grid = element_blank(),
legend.position = "right"
)
In the above plot, categories along the x-axis are the 30 AMP-AD co-expression modules grouped into 5 consensus clusters, while the categories along the y-axis show the different groupings of mouse models tested (split by age). Positive correlations are shown in blue and negative correlations in red. Color intensity and size of the circles are proportional to the correlation coefficient. Black squares around dots represent significant correlation at p-value=0.05 and non-significant correlations are left blank.
5xFAD mice display gene expression alterations that are correlated with human disease across all five consensus clusters, with the strongest correlations observed among modules/genes in Consensus Cluster B, which generally consists of immune system pathways and functions.
Examining individual correlation results
Let’s say we want to know more about a single comparison in the plot
above and which genes are contributing to the correlation result. Maybe
we’re really interested in the correlations to the FPbrown module
between 5xFAD mice at 4 months and 6 months. We can plot the individual
correlations for each comparison in the plot above with the data we
have. We’ll label genes with large fold change in the mouse using the
ggrepel::geom_label_repel function.
R
# specify which comparisons to consider
m <- 'FPbrown'
a <- c('4m','6m')
# filter the correlation data frame to these comparisons
indiv.corr <- cor.df %>% filter(module == m, age %in% a) %>% unnest(data) %>%
mutate( facet = str_c(age, '\n', 'r = ',signif(estimate,3),' ; p = ',signif(p_value,3) ))
# plot
ggplot(indiv.corr, aes( log2FoldChange , ampad_fc ))+
geom_vline(xintercept = 0, lwd = .1)+
geom_hline(yintercept = 0, lwd = .1)+
geom_point( size = .5, color = 'darkred')+
geom_smooth(method = 'lm', lwd = .5)+
ggrepel::geom_text_repel( data = arrange(indiv.corr, desc(abs(log2FoldChange))),
aes(label = symbol), size = 3, min.segment.length = 0 ) +
labs(x = '5xFAD logFC', y = 'AMP-AD logFC',
title = unique(indiv.corr$module))+
facet_wrap(~facet)+
theme_bw()
Here we can see that the 5xFAD logFC are pretty small in general. The correlations are relatively weak and driven by individual genes that have relatively large changes (e.g. Heatr4, Nirp1a, Rpl39l). If we compare a different module, say STGblue, we can see a stronger relationship between mouse and human expression changes.
R
m <- 'STGblue'
a <- c('4m','6m')
# filter the correlation data frame to these comparisons
indiv.corr <- cor.df %>% filter(module == m, age %in% a) %>% unnest(data) %>%
mutate( facet = str_c(age, '\n', 'r = ',signif(estimate,3),' ; p = ',signif(p_value,3) ))
# plot
ggplot(indiv.corr, aes( log2FoldChange , ampad_fc ))+
geom_vline(xintercept = 0, lwd = .1)+
geom_hline(yintercept = 0, lwd = .1)+
geom_point( size = .5, color = 'darkred')+
geom_smooth(method = 'lm', lwd = .5)+
ggrepel::geom_text_repel( data = arrange(indiv.corr, desc(abs(log2FoldChange))),
aes(label = symbol), size = 3, min.segment.length = 0 ) +
labs(x = '5xFAD logFC', y = 'AMP-AD logFC',
title = unique(indiv.corr$module))+
facet_wrap(~facet)+
theme_bw()
These correlations are much stronger (r is approximately 0.3 vs 0.1 for the previous module), and there is a consistent pattern between young mice and old mice, with similar genes being expressed in similar ways (e.g. Itgax and Clec7a are up-regulated at both ages). These significant positive correlations suggest that the 5xFAD model captures inflammatory changes observed in human AD patients.
[5] Detecting functional coherence of gene sets from omics data
Most omics analyses produce data on many thousands of genomic features (i.e. transcripts, proteins, etc.) for each condition tested. Simply looking at these lists of genes and associated statistics can be daunting and uninformative. We need approaches to identify which biological functions are being impacted by a given experiment from these systems-level measurements.
Gene functional enrichment analysis describes a variety of statistical methods that identify groups of genes that share a particular biological function or process and show differential association with experimental conditions. Most approaches compare some statistically valid set of differentially expressed features to sets of functional annotations for those features. There are many different functional annotation sets available, some of the more commonly used include:
- gene function resources, such as the Gene Ontology (i.e. GO)
- pathway databases, such as Reactome or KEGG
- disease and phenotype ontologies, such as the Human Phenotype Ontology, the Mammalian Phenotype Ontology, and the Disease Ontology
These are the resources that are the foundation for many genomics knowledge bases, such as MGI, monarch initiative, and the Alliance of Genome Resources. The precise nature of each of these resources varies, but the core information contained within each is the relationship of sets of genes to biologically meaningful annotations. These annotations are typically expertly curated from the published literature.
There are a variety of statistical approaches that can be employed to test for functional enrichment among genes identified from an omics dataset. Two of the most common broad classes of tests are over-representation analysis (ORA) and gene set enrichment analysis (GSEA). For example, consider the figure below from Zhao & Rhee, Trends in Genetics (2023). Let’s consider each in a bit more detail.

Over-representation analysis
ORA involves statistical tests of overlap between two
lists of genes: one derived from the experiment and one from the
functional annotations. For example, one might ask what is the overlap
between the genes in an annotation class, such as “Lysosome”, and the
genes that are significantly up-regulated in a given experimental
condition. These tests usually rely on some form of Fisher’s exact test
(e.g. fisher.test()) or hypergeometric test
(e.g. phyper()). If the gene lists consist of a larger
number of overlapping genes than would be expected at random - given the
sample sizes involved - then there is said to be a statistical
over-representation of the annotation class within the experimental
condition.
Of course, these overlap tests are generally considered for all
annotation classes, which can number in the hundreds to thousands.
Performing this many statistical tests ensures that many will be
identified as significant by chance. Therefore, there is typically a
requirement to correct for multiple testing errors
(e.g. p.adjust()).
There are many R packages available to handle the statistical tests
and corrections involved in ORA. Today we’ll use
clusterProfiler::enrichGO(), which wraps statistical
testing for overlap with GO terms and multiple test correction in one
function.
Let’s start by considering the enrichments from the mouse data analyzed previously. We’ll start by considering the genes that are significantly DE in 6 month old animals
R
gene.list.up <- fad.deg %>%
filter(age == '6m',
padj <= 0.05,
log2FoldChange > 0) %>%
arrange(desc(log2FoldChange)) %>%
filter(!duplicated(symbol), !is.na(symbol)) %>%
pull(symbol) %>%
unique()
gene.list.dn <- fad.deg %>%
filter(age == '6m',
padj <= 0.05,
log2FoldChange < 0) %>%
arrange(desc(log2FoldChange)) %>%
filter(!duplicated(symbol), !is.na(symbol)) %>%
pull(symbol) %>%
unique()
length(gene.list.up)
length(gene.list.dn)
There are a total of # r length(gene.list.up) significantly up-regulated genes and r length(gene.list.dn) significantly down-regulated genes in this cohort. Now test for over representation of GO terms among the DEGs. First, we need to identify the universe of all possible genes which includes the genes that were both measured by the experiment and contained within the annotation set.
R
univ <- as.data.frame(org.Mm.egGO) %>%
pull(gene_id) %>%
unique() %>%
bitr(., fromType = "ENTREZID", toType = 'SYMBOL', OrgDb = org.Mm.eg.db, drop = T) %>%
pull('SYMBOL') %>%
intersect(., fad.deg$symbol)
Now let’s test for enriched GO terms (this can take 3-4 minutes)
R
enr.up <- enrichGO(gene.list.up,
ont = 'all',
OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
universe = univ
)
enr.dn <- enrichGO(gene.list.dn,
ont = 'all',
OrgDb = org.Mm.eg.db,
keyType = 'SYMBOL',
universe = univ
)
How many significant terms are identified:
R
enr.up@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
enr.dn@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
Challenge 7
How many significant terms are identified from the up-regulated gene list if you do not specify the “universe”?
R
enrichGO(gene.list.up, ont = 'all', OrgDb = org.Mm.eg.db, keyType = 'SYMBOL') %>%
.@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
Plot the top 10 significant terms:
R
cowplot::plot_grid(
dotplot(enr.dn, showCategory = 10) + ggtitle('down'),
dotplot(enr.up, showCategory = 10) + ggtitle('up')
)
From this we can see that nervous system related terms (e.g. “dendrite development” and “protein localization to synapse”) are down in 5xFAD mice at 6 months, while immune functions (e.g. “regulation of innate immune response” and “leukocyte mediated immunity”) are up in 5xFAD mice at 6 months.
Gene set enrichment analysis
GSEA is an alternative approach that uses a statistical
measure from the omics data (e.g. the log fold change or significance)
to rank the genes. An “enrichment score” is calculated for each
annotation set based on where the genes annotated to the term sit in the
overall distribution.
Let’s analyze those same data with the
fgsea::fgseaMultilevel() function. First we’ll specify the
gene list and use the log2FoldChange value to rank the
genes in the list.
R
gene.list <- fad.deg %>%
filter(age == '6m', padj <= 0.05) %>%
arrange(desc(log2FoldChange)) %>%
filter(!is.na(symbol),!duplicated(symbol)) %>%
pull(log2FoldChange, name = symbol)
We’ll also need a list of GO terms and the genes annotated. We can
get this from the org.Mm.eg.db annotation package using the
AnnotationDbi::select function
R
# Get GO annotations
go.terms <- AnnotationDbi::select(
org.Mm.eg.db,
keys = keys( org.Mm.eg.db , keytype = "SYMBOL"),
columns = c("GO", "SYMBOL"),
keytype = "SYMBOL"
)
# Create the list of gene sets
go.gene.sets <- split(go.terms$SYMBOL, go.terms$GO)
# Filter gene sets by size (minSize and maxSize are arguments in fgsea)
min_size <- 15
max_size <- 500
go.sets.filtered <- go.gene.sets[sapply(go.gene.sets, function(x) length(x) >= min_size && length(x) <= max_size)]
Now we’ll test for enrichment:
R
gse.enr <- fgsea::fgseaMultilevel(
pathways = go.sets.filtered,
stats = gene.list,
minSize = min_size,
maxSize = max_size,
nproc = 1 )
Let’s take a look
R
gse.enr %>% arrange(padj) %>% head(n = 10)
This isn’t especially informative without the GO term names. Let’s pull those in
R
# get term names from the GO.db package
go_term_map <- AnnotationDbi::select(
GO.db::GO.db,
keys = gse.enr$pathway,
columns = c("GOID", "TERM"),
keytype = "GOID"
)
# join with results table
gse.enr <- inner_join(gse.enr, go_term_map %>% select(pathway = GOID, TERM)) %>% relocate(TERM, .after = pathway)
# now take another look
gse.enr %>% arrange(padj) %>% head(n = 10)
This is more helpful. We can see that the GO terms associated with up-regulated genes include “inflammatory response” and “chemotaxis”, while there’s one term among the top 10 that is associated with down-regulated genes, “chemical synaptic transmission”.
How many significant terms are identified:
R
gse.enr %>% filter(padj <= 0.05) %>% pull(pathway) %>% unique() %>% length()
Challenge 8
The tally above represents all genes, both up- and down-regulatd. To
compare between GSEA and ORA, can you identify how many GSEA enriched
terms are associated with up-regulated genes and how many are associated
with down-regulated genes? (Hint: the NES value within the
results relates to the directionality of enrichment).
R
gse.enr %>% filter(padj <= 0.05, NES < 0) %>% pull(pathway) %>% unique() %>% length()
gse.enr %>% filter(padj <= 0.05, NES > 0) %>% pull(pathway) %>% unique() %>% length()
40 terms are up-regulated, while only 2 are associated with down-regulated genes
Challenge 9
You may notice that your numbers of significantly enriched terms are slightly different. Why would this be the case?
fgsea uses a permutation-based approach to estimate the
significance of the gene set enrichments. Because this involves random
sampling, running the analysis multiple times with the same settings can
result in subtle, minor differences in the output, particularly in the
p-values.
Increasing the nPermSimple parameter increases the
number of permutations performed. This leads to a more thorough sampling
of the null distribution, thereby improving the precision of the
estimated p-values, especially for highly significant pathways. However,
it’s a trade-off, as a higher nPermSimple value will also
increase the computational time required to run the analysis.
Let’s compute the enriched terms for all age groups:
R
fad.enr <- fad.deg %>%
filter(!is.na(symbol)
, padj <= 0.05
) %>%
group_by(age) %>%
summarise(gl = log2FoldChange %>% setNames(., symbol) %>% sort(decreasing = T) %>% list()) %>%
ungroup() %>%
mutate(
gse = map(
gl,
~ fgsea::fgseaMultilevel(
pathways = go.sets.filtered,
stats = .x,
minSize = min_size,
maxSize = max_size,
nproc = 1
)
),
res = map(gse, ~ {
go_term_map <- AnnotationDbi::select(
GO.db::GO.db,
keys = .x$pathway,
columns = c("GOID", "TERM"),
keytype = "GOID"
)
inner_join(.x, go_term_map %>% select(pathway = GOID, TERM)) %>%
relocate(TERM, .after = pathway)
}))
saveRDS(fad.enr, here::here('results', '5xFAD_fgsea_results.rds'))
Common pitfalls & best practices
These kinds of functional enrichment analyses are very common, but not all results reported are equal! A recent paper describes an “Urgent need for consistent standards in functional enrichment analysis”. They examine how the results of functional enrichment analyses are reported in the literature, and identify several common shortcomings. Watch out for these common mistakes when performing and reporting your own analyses! We’ll have more opportunities to discuss issues with reproducibility in computational biology research in future sessions.

[6] Alzheimers’s Disease Biological Domains
While these results are informative and help to provide essential biological context to the results of the omics experiment, it is difficult to understand all of the enriched terms and what that means for the biology of disease. It would be useful to group each of the enriched terms within broader areas of disease biology. There are several common molecular endophenotypes that are detected within human Alzheimer’s cohorts across omics data types (transcriptomics, proteomics, and genetics). Several of these endophenotypic areas are shown in the figure below (source: Jesse Wiley).

These endophenotypes describe molecular processes that are dysregulated by or during the disease process. Very similar sets of endophenotypes have been identified among the targets of therapeutics in clinical trials for AD (see below, Cummings et al, Alzheimer’s disease drug development pipeline: 2024, Alz Dem TRCI.


In order to formalize and operationalize these endophenotypic areas, the Emory-Sage-SGC-JAX (ESSJ) TREAT-AD center has developed the Biological Domains of AD. In the enumeration of these domains, several criteria were established; the defined biological domains should be:
-
objective: leverage existing well-annotated resources
-
automatable: in CADRO, therapeutics and targets are
manually assigned
-
intelligible: groupings should be easy to understand
- modifiable: definitions should be (and are!) continuously be updated.
In all, 19 distinct biological domains (aka biodomains) have been
identified. These biological domains are defined using sets of GO terms
that align with the endophenotype. For example, terms like “cytokine
receptor binding” and “leukocyte migration” are annotated to the
Immune Response biodomain, while terms like “postsynaptic
density” and “synaptic vesicle cycle” are annotated to the
Synapse biodomain. Of all terms in the GO, 6,837 terms
(15.7%) are annotated to one of the biological domains. Because the GO
terms have genes annotated, genes can be associated with specific
endophenotypes via the biological domain groupings of terms. In all,
16,275 genes are annotated to at least 1 biodomain term. While the
biodomains exhibit very few overlapping GO terms (panel A), due to gene
pleiotropy (etc) the number of overlapping genes between biological
domains is quite a bit higher (panel B). The development of the
biological domains is described in more detail in the published paper Genetic and Multi-omic Risk Assessment
of Alzheimer’s Disease Implicates Core Associated Biological
Domains.

The ESSJ TREAT-AD center has developed approaches to group terms within each biodomain into functionally coherent sub-domains. The sub-domains are driven by gene co-annotation and disease risk score enrichment.

Download and explore the biological domain annotations
First let’s download the biodomain definition file from synapse.
R
# biodomain definitions
biodom <- readRDS(syn$get('syn25428992')$path)
# biodomain labels and colors
dom.lab <- read_csv(syn$get('syn26856828')$path)
What is in the dom.lab file?
R
glimpse(dom.lab)
This file contains some useful standardized abbreviations and colors that will be useful as we work with and plot domain information later.
What is in the biodom file?
R
glimpse(biodom)
You can see the file is a list of GO term accessions
(GO_ID) and names (GOterm_Name) as well as the
corresponding endophenotype areas (e.g. Biodomain and
Subdomain) to which the term is annotated. There are also
gene annotations for each term. These annotation sets come from two
sources: (1) the symbol and uniprot
annotations are drawn directly from the Gene Ontology via the provided
API, (2) ensembl_id, entrez_id, and
hgnc_symbol are from BioMart annotations
(e.g. biomaRt::getLDS()).
We can see how many GO terms are annotated to each biodomain:
R
biodom %>%
group_by(Biodomain) %>%
summarise(n_term = length(unique(GO_ID))) %>%
arrange(n_term)
The biodomains range in size from Tau Homeostasis (10
terms) up to Synapse (1,379 terms).
What about the size of the subdomains? For simplicity let’s focus on
the Subdomains within the Synapse domain.
R
biodom %>%
filter(Biodomain == 'Synapse') %>%
group_by(Subdomain) %>%
summarise(n_term = length(unique(GO_ID))) %>%
arrange(n_term)
There are 8 subdomains within the Synapse biodomain,
plus a set of terms that aren’t assigned to any subdomain (i.e. where
Subdomain is NA). The subdomains range in size
from axon regeneration (20 terms) up to
trans-synaptic signaling (162 terms). There are 822 terms
within the Synapse domain that aren’t assigned to any
sub-domains.
We can also investigate the individual genes annotated to each biodomain GO term.
R
biodom %>% filter(GOterm_Name == 'amyloid-beta formation') %>% pull(symbol) %>% unlist()
So the genes associated with amyloid-beta formation
within the APP Metabolism biodomain include PSEN1, PSEN2,
BACE1, ABCA7, NCSTN, and others.
Annotate enriched terms with biological domain
Let’s re-characterize the 5xFAD functional enrichments we computed previously with the biological domain annotations and see if we can get more context about what is changing in that model. We’ll focus on the GSEA results and start by annotating the results with biodomain groupings.
R
gse.enr.bd = gse.enr %>%
left_join(., biodom %>% select(Biodomain, pathway = GO_ID), by = 'pathway') %>%
relocate(Biodomain)
head(gse.enr.bd %>% select(pathway, TERM, Biodomain, padj, NES), n = 10)
Not all of the enriched terms are annotated to a biological domain.
Some terms are too broad and not specific (e.g. ‘cell mophogenesis’ or
‘MAPK cascade’), while others may not have been captured by a biological
domain annotation yet (e.g. ‘transcription cis-regulatory region
binding’). Remember that the conception of the biodomains
involved a requirement that they be modifiable, and these terms may be
added to the biodomain definintions in the future. Let’s modify the
Biodomain value for terms that aren’t annotated to a domain
to ‘none’.
R
gse.enr.bd$Biodomain[is.na(gse.enr.bd$Biodomain)] <- 'no domain'
head(gse.enr.bd %>% select(pathway, TERM, Biodomain, padj, NES), n = 10)
How many terms are enriched from each domain?
R
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'no domain')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
n_sig_term = gse.enr.bd$pathway[ gse.enr.bd$Biodomain == domain ] %>% unique() %>% length()
)
arrange(bd.tally, desc(n_sig_term))
Many enriched terms don’t map to a domain r bd.tally %>%
filter(domain == ‘no domain’) %>% pull(n_sig_term) %>% sum()), but
nearly half do (r bd.tally %>% filter(domain != ‘no domain’) %>%
pull(n_sig_term) %>% sum()). Of those that do, the vast majority map
into the Immune Response biodomain.
We can plot the enrichment results, stratified by biodomain:
R
enr.bd_plot <- full_join(gse.enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>%
filter(!is.na(Biodomain)) %>%
mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>%
arrange(Biodomain, padj)
ggplot(enr.bd_plot, aes(NES, Biodomain)) +
geom_violin(data = subset(enr.bd_plot, NES > 0),
aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0),
aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(padj), fill = color),
color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
This makes it really clear that in the 6 month old 5xFAD mice, the
enriched Immune Response domain terms are all associated
with up-regulated genes, while the enriched Synapse domain
terms are associated with a mix of up- and down-regulated genes. It also
highlights the several other domains with significantly enriched terms
(e.g. Proteostasis, Structural Stabilization,
Lipid Metabolism, Endolysosome, etc; even
APP Metabolism).
Now let’s look across age groups to see how domain enrichments change:
R
enr.bd_plot <- fad.enr %>% select(age, res) %>% unnest(res) %>%
full_join(., biodom %>% select(pathway = GO_ID, Biodomain, Subdomain)) %>%
full_join(., dom.lab, by = c('Biodomain' = 'domain')) %>%
filter(!is.na(age)) %>%
mutate(
Biodomain = if_else(is.na(Biodomain), 'none', Biodomain),
Subdomain = if_else(is.na(Subdomain), 'none', Subdomain),
age = factor(age, levels = c('4m','6m','10m'))
) %>%
mutate(n_sig = length(unique(pathway)), .by = Biodomain) %>%
mutate(Biodomain = fct_reorder(Biodomain, n_sig)) %>%
arrange(Biodomain, padj)
ggplot(enr.bd_plot, aes(NES, Biodomain)) +
facet_grid(cols = vars(age), scales = 'free')+
geom_violin(data = subset(enr.bd_plot, NES > 0),
aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0),
aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(padj), fill = color),
color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
The earliest transcriptomic changes in the 5xFAD animals are
associated with Immune Response and
Lipid Metabolism domain terms. By 6 months, there are
enrichments for several other domains (as above), and by 10 months the
term enrichments are even more exaggerated (i.e. more
significance, larger NES values, etc).
We can also break down these results by sub-domain to get a more clear idea of the processes affected in each case
R
enr.bd_plot <- fad.enr %>% select(age, res) %>% unnest(res) %>%
full_join(., biodom %>% select(pathway = GO_ID, Biodomain, Subdomain)) %>%
full_join(., dom.lab, by = c('Biodomain' = 'domain')) %>%
filter(!is.na(age)) %>%
mutate(
Biodomain = if_else(is.na(Biodomain), 'none', Biodomain),
Subdomain = if_else(is.na(Subdomain), 'none', Subdomain),
sd_lab = str_c(abbr,'_',Subdomain),
age = factor(age, levels = c('4m','6m','10m'))
)
ggplot(enr.bd_plot, aes(NES, sd_lab)) +
facet_grid(cols = vars(age), rows = vars(abbr), scales = 'free', space = 'free_y')+
geom_violin(data = subset(enr.bd_plot, NES > 0), aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0), aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(padj), fill = color), color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(label = ~ str_remove_all(.x, '^[A-Za-z]{2}_'), drop = F)+
scale_fill_identity()+scale_color_identity()
What trends do you notice from these results? Which sub-domain processes are affected at the earliest ages? Are there any sub-domains that change across the age groups?
Challenge 10
How could you plot the results from the ORA to show biodomain enrichements?
R
enr.ora = bind_rows(enr.up@result %>% mutate(dir = 'up'),
enr.dn@result %>% mutate(dir = 'dn')) %>%
left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%
relocate(Biodomain)
enr.ora$Biodomain[is.na(enr.ora$Biodomain)] <- 'none'
bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>%
rowwise() %>%
mutate(
n_term = biodom$GO_ID[biodom$Biodomain == domain] %>% unique() %>% length(),
n_sig_term = enr.ora$ID[enr.ora$Biodomain == domain] %>% unique() %>% length()
)
enr.ora <- full_join(enr.ora, dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>%
arrange(Biodomain, p.adjust) %>%
mutate(
signed_logP = -log10(p.adjust),
signed_logP = if_else(dir == 'dn', -1 * signed_logP, signed_logP)
)
ggplot(enr.ora, aes(signed_logP, Biodomain)) +
geom_violin(data = subset(enr.ora, dir == 'up'),
aes(color = color),
scale = 'width') +
geom_violin(data = subset(enr.ora, dir == 'dn'),
aes(color = color),
scale = 'width') +
geom_jitter(
aes(size = Count, fill = color),
color = 'grey20',
shape = 21,
alpha = .3
) +
geom_vline(xintercept = 0, lwd = .1) +
scale_y_discrete(drop = F) +
scale_fill_identity() + scale_color_identity()
Based on the gene list (up or down) we can add a sign to the
significance. When we plot this we can see there are many more
significantly enriched terms from the ORA. The dominant signal is still
the up-regulation of terms from the Immune Response
biodomain. There is also nearly exclusive up-regulation of terms from
the Autophagy, Oxidative Stress, and
APP Metabolism domains. The most down-regulated terms are
from the Synapse biodomain.
Challenge 11
Which biodomain terms are over-represented among the gene
co-expression modules? Choose one module and test for over-represented
terms at a p.adjust value less than 0.05.
What if you look at the most significantly enriched terms,
with p.adjust values less than 1e-5?
R
stg_blue.ora <- ampad_modules %>%
filter(module == 'STGblue') %>%
pull(Mouse_gene_symbol) %>%
enrichGO(., org.Mm.eg.db, keyType = 'SYMBOL')
inner_join(stg_blue.ora@result,
biodom %>% select(ID = GO_ID, Biodomain:Subdomain)) %>%
group_by(Biodomain) %>%
summarise(n = length(unique(ID))) %>%
arrange(desc(n))
inner_join(stg_blue.ora@result,
biodom %>% select(ID = GO_ID, Biodomain:Subdomain)) %>%
filter(p.adjust <= 1e-5) %>%
group_by(Biodomain) %>%
summarise(n = length(unique(ID))) %>%
arrange(desc(n))
At the default p-value the domains with the most terms enriched
include Synapse, Lipid Metabolism, and
Immune Response. If we filter to the most
significantly enriched terms, terms from the
Immune Response and Structural Stabilization
domains are represented.
Comparing enrichments
Now we know which domain processes are affected in the mouse model, let’s consider how we can compare the functional enrichments in these models with functional enrichments from the AMP-AD cohorts. We’ll need to compute and examine the functional enrichments from the human cohort data.
First let’s get the human GO set gene annotations
R
# Get GO annotations
hs.go.terms <- AnnotationDbi::select(
org.Hs.eg.db,
keys = keys( org.Hs.eg.db , keytype = "SYMBOL"),
columns = c("GO", "SYMBOL"),
keytype = "SYMBOL"
)
# Create the list of gene sets
hs.go.gene.sets <- split(hs.go.terms$SYMBOL, hs.go.terms$GO)
# Filter gene sets by size (minSize and maxSize are arguments in fgsea)
min_size <- 15
max_size <- 500
hs.go.sets.filtered <- hs.go.gene.sets[sapply(hs.go.gene.sets, function(x) length(x) >= min_size && length(x) <= max_size)]
Now we can run the enrichments
R
hs.gsea <- ampad_modules_raw %>%
filter(
Model == "Diagnosis",
Comparison == "AD-CONTROL",
Tissue %in% c('DLPFC', 'PHG','TCX'),
!is.na(hgnc_symbol)
, adj.P.Val <= 0.05
) %>%
group_by(Study, Tissue) %>%
summarise(gl = logFC %>% setNames(., hgnc_symbol) %>% sort(decreasing = T) %>% list()) %>%
ungroup() %>%
mutate(
gse = map(gl,
~fgsea::fgseaMultilevel(
pathways = hs.go.sets.filtered,
stats = .x,
minSize = min_size,
maxSize = max_size,
nproc = 1
)),
res = map(gse, ~ {
go_term_map <- AnnotationDbi::select(
GO.db::GO.db,
keys = .x$pathway,
columns = c("GOID", "TERM"),
keytype = "GOID"
)
inner_join(.x, go_term_map %>% select(pathway = GOID, TERM)) %>%
relocate(TERM, .after = pathway)
})
)
Let’s map the biodomains onto the enriched GO terms, and plot the results
R
enr.bd_plot <- hs.gsea %>% select(Study, res) %>% unnest(res) %>%
left_join(., biodom %>% select(Biodomain, Subdomain, pathway = GO_ID), by = 'pathway') %>%
full_join(., dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(
Biodomain = if_else(is.na(Biodomain), 'none', Biodomain),
Subdomain = if_else(is.na(Subdomain), 'none', Subdomain)
) %>%
mutate(n_sig = length(unique(pathway)), .by = Biodomain) %>%
mutate(Biodomain = fct_reorder(Biodomain, n_sig)) %>%
arrange(Biodomain, padj)
ggplot(enr.bd_plot, aes(NES, Biodomain)) +
facet_grid(cols = vars(Study), space = 'free', scales = 'free')+
geom_violin(data = subset(enr.bd_plot, NES > 0),
aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0),
aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(padj), fill = color),
color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
We can see fairly similar domain term enrichments from the
transcriptomic data across AMP-AD cohorts. There is strong evidence for
up-regulation among transcripts annotated to GO terms in the
Immune Response, Structural Stabilization, and
Vasculature domains, along with similarly strong evidence
of down-regulation among transcripts annotated to GO terms in the
Synapse domain.
Now lets compare enrichments between species. First we’ll combine the enrichment results for both species and plot them side-by-side:
R
enr.bd_plot <-
bind_rows(
fad.enr %>% mutate(model = str_c('5xFAD, ',age)) %>% select(model, res) %>% unnest(res),
hs.gsea %>% mutate(model = str_c(Study,', ', Tissue)) %>% select(model, res) %>% unnest(res)) %>%
left_join(., biodom %>% select(Biodomain, Subdomain, pathway = GO_ID), by = 'pathway') %>%
full_join(., dom.lab, by = c('Biodomain' = 'domain')) %>%
filter(!is.na(model)) %>%
mutate(
Biodomain = if_else(is.na(Biodomain), 'none', Biodomain),
Subdomain = if_else(is.na(Subdomain), 'none', Subdomain)
) %>%
mutate(n_sig = length(unique(pathway)), .by = Biodomain) %>%
mutate(Biodomain = fct_reorder(Biodomain, n_sig)) %>%
arrange(Biodomain, padj)
ggplot(enr.bd_plot, aes(NES, Biodomain)) +
facet_wrap(~model, nrow = 1)+
geom_violin(data = subset(enr.bd_plot, NES > 0),
aes(color = color), scale = 'width')+
geom_violin(data = subset(enr.bd_plot, NES < 0),
aes(color = color), scale = 'width')+
geom_jitter(aes(size = -log10(padj), fill = color),
color = 'grey20', shape = 21, alpha = .3)+
geom_vline(xintercept = 0, lwd = .1)+
scale_y_discrete(drop = F)+
scale_fill_identity()+scale_color_identity()
Ok, there’s definitely similarity in terms of the direction of
enrichment for domain terms between the human and mouse data. But how
can we assess if the same processes are affected in both mouse and
human? Well, let’s take a look at Immune Response terms
more specifically.
R
comb.gsea <- fad.enr %>% mutate(model = str_c('5xFAD, ',age)) %>% select(model, res) %>% unnest(res) %>%
inner_join(., hs.gsea$res[[1]] %>% select(pathway, hs.NES = NES, hs.padj = padj)) %>%
left_join(., biodom %>% select(Biodomain, Subdomain, pathway = GO_ID), by = 'pathway') %>%
mutate(across(contains('NES'), ~if_else(is.na(.x), 0, .x)))
ggplot(
data = subset(comb.gsea, Biodomain == "Immune Response"),
aes(NES, hs.NES))+
facet_wrap(~model)+
geom_point()
In this case, there are several Immune Response terms
that are significantly enriched in both datasets and all overlapping
terms are enriched with a positive NES value, indicating they are
up-regulated processes in both AD vs Control and the 5xFAD vs WT. We can
more systematically compare enriched terms within each Biodomain. First,
lets generate a list of GO terms that is specific to each Biodomain and
each Subdomain.
R
# curate biodomain and sub-domain gene lists
bd.terms <- tibble( set = unique(biodom$Biodomain) ) %>%
mutate( terms = map(set, ~ biodom %>% filter(Biodomain == .x) %>% pull(GO_ID)) )
sd.terms <- full_join( biodom, dom.lab, by = c('Biodomain'='domain') ) %>%
mutate( Subdomain = if_else(is.na(Subdomain), 'none', Subdomain) ) %>%
select(Biodomain, abbr, Subdomain, subdomain_idx) %>% distinct() %>%
rowwise() %>%
mutate(
set = paste0(abbr,'_',Subdomain),
bd = Biodomain, sdi = subdomain_idx,
terms = biodom %>% filter(Biodomain == bd, subdomain_idx == sdi) %>% pull(GO_ID) %>% list()
)
# combine
bd.term.list = bind_rows(bd.terms, sd.terms %>% select(set, terms)) %>%
filter(set != 'NA_none')
rm(bd.terms, sd.terms)
Next let’s combine all of the human GSEA results with each of the 5xFAD age-stratified GSEA results. Any term that is enriched with an adjusted p-value > 0 we will set the NES value to 0 and the p-value to 1.
R
fad.vs.amp <- crossing(fad = str_c('5xFAD, ', fad.enr$age),
amp = str_c(hs.gsea$Study,', ', hs.gsea$Tissue) ) %>%
mutate(data = map2(fad, amp, ~ {
mm = fad.enr %>% mutate(model = str_c('5xFAD, ', age)) %>% filter(model == .x) %>% pull(res) %>% .[[1]] %>%
select(GOID = pathway, term = TERM, mm.nes = NES, mm.padj = padj)
hs = hs.gsea %>% mutate(model = str_c(Study,', ', Tissue)) %>% filter(model == .y) %>% pull(res) %>% .[[1]] %>%
select(GOID = pathway, term = TERM, hs.nes = NES, hs.padj = padj)
full_join(mm, hs, by = c('GOID','term'), na_matches = 'never') %>%
mutate(
across(contains('padj'), ~ if_else(is.na(.x), 1, .x)),
across(contains('nes'), ~ if_else(is.na(.x), 0, .x)),
hs.nes = if_else(hs.padj > 0.05, 0, hs.nes),
mm.nes = if_else(mm.padj > 0.05, 0, mm.nes)
)
}))
Then we can go through each of the GO term lists and perform the
correlation analysis. First we’ll make a list combining all mouse
comparisons and each GO term list to test using the
crossing function:
R
comp <- crossing(fad.vs.amp, set = bd.term.list$set)
head(comp)
Now we can add a column for the term sub-set to consider for each correlation:
R
comp <- comp %>%
mutate( data = map2(data, set, ~ .x %>% filter(GOID %in% bd.term.list$terms[[which(bd.term.list$set == .y)]])),
n.terms = map_dbl(data, ~nrow(.x)),
n.sig.both = map_dbl(data, ~.x %>% filter(mm.padj <= 0.05 & hs.padj <= 0.05) %>% nrow()),
jaccard = n.sig.both/n.terms )
head(comp)
For several term sets there aren’t many terms enriched. We won’t be able to compute correlations with these, so we’ll filter them out.
For the correlation we’ll use a nonparametric, rank-based Kendall correlation. Given that we don’t expect NES values to be normally distributed, and that all we’re really care to know is whether the terms are enriched in the same direction or opposite directions between mouse and human, this will do the job.
R
comp <- comp %>%
filter(n.terms > 2) %>%
mutate(
cor = map(data, ~ cor.test(.x[['hs.nes']], .x[['mm.nes']], method = 'kendall')),
correlation = map_dbl(cor, ~broom::tidy(.x) %>% pull(estimate)),
p.value = map_dbl(cor, ~broom::tidy(.x) %>% pull(p.value))
)
head(comp)
Ok, now let’s take a look at the correlation among enriched Biodomain
terms (without an _ character in the set
name):
R
tmp <- comp %>%
filter(grepl('_',set)==F) %>%
mutate(fad = factor(fad, c('5xFAD, 10m','5xFAD, 6m','5xFAD, 4m')))
ggplot(tmp, aes(set, fad)) +
facet_grid(rows = vars(amp))+
geom_tile(colour = "black", fill = "white") +
geom_point(aes(fill = correlation, size = jaccard ), color = 'black', shape = 21, stroke = .3) +
geom_point(data = subset(tmp, p.value <= 5e-2), color="black", shape=0, size= 5, stroke = .8) +
scale_size( "Jaccard index", range = c(.5, 4), limits = c(1e-90,NA)) +
scale_fill_gradient2(
"Kendall's \u03C4 coefficient",
low = "#85070C", high = "#164B6E", mid = 'grey95'
, guide = guide_colorbar(ticks = T) ) +
scale_x_discrete(position = "top", drop = T) +
labs(x = NULL, y = NULL) +
theme(
plot.margin = margin(5,50,5,5),
strip.text.x = element_text(size = 10,colour = c("black")),
strip.text.y.left = element_text(angle = 0,size = 12),
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 0, size = 12),
axis.text.y = element_text(angle = 0, size = 12),
panel.background = element_blank(),
plot.title = element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
plot.title.position = "plot",
panel.grid = element_blank(),
legend.position = "bottom",
legend.box = 'vertical'
)
The largest overlap is with the Immune Response domain
terms, which are positively correlated between older 5xFAD and the Mayo
and MSSM cohorts. There are also positively correlated terms for the
Apoptosis, Autophagy,
Lipid Metabolism, Structural Stabilization,
and Vasculature domains.
We can also look at the subdomains to get a more specific picture. Let’s filter to only include significant correlations or any correlation with a tau statistic with an absolute value > 0.2:
R
l <- comp %>%
filter(
grepl('_',set)==T,
(p.value < 5e-2 | abs(correlation) > 0.3)) %>%
pull(set)
tmp <- comp %>%
ungroup() %>%
filter(set %in% l) %>%
mutate(max.n = max(n.terms), .by = set) %>%
mutate(
fad = factor(fad, c('5xFAD, 10m','5xFAD, 6m','5xFAD, 4m') %>% rev),
set1 = str_split_fixed(set, '_', 2)[,2] %>% str_c('[',max.n,'] ',.),
abbr = str_split_fixed(set, '_', 2)[,1],
)
ggplot(tmp, aes(fad,set1)) +
facet_grid(rows = vars(abbr), cols = vars(amp), scales='free', space='free')+
geom_tile(colour = "black", fill = "white") +
geom_point(aes(fill = correlation, size = jaccard ), color = 'black', shape = 21, stroke = .3) +
geom_point(data = subset(tmp, p.value <= 5e-2), color="black", shape=0, size= 5, stroke = .8) +
scale_size( "Jaccard index", range = c(.5, 4), limits = c(1e-90,NA)) +
scale_fill_gradient2(
"Kendall's \u03C4 coefficient",
low = "#85070C", high = "#164B6E", mid = 'grey95'
, guide = guide_colorbar(ticks = T) ) +
scale_x_discrete(position = "top", drop = T) +
# scale_y_discrete(position = 'right', limits = rev)+
labs(x = NULL, y = NULL) +
theme(
plot.margin = margin(5,50,5,5),
strip.text.x = element_text(size = 10,colour = c("black")),
strip.text.y.left = element_text(angle = 0,size = 12),
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 0, size = 12),
axis.text.y = element_text(angle = 0, size = 12),
panel.background = element_blank(),
plot.title = element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
plot.title.position = "plot",
panel.grid = element_blank(),
legend.position = "bottom",
legend.box = 'vertical'
)
This tells us that the positive correlations in
Immune Response are primarily due to terms within “cytokine
production”, “neuroinflammatory response”, and “behavioral defense
response”, while the correlations in the Apoptosis domain
are primarily related to “NF-kappaB signaling”, and the late
Vasculature correlations have to do with
“angiogenesis”.
Are the phagocytosis subdomain terms really negatively correlated?
R
comp %>% filter(set == 'IR_phagocytosis', amp == 'MAYO, TCX', fad == '5xFAD, 10m') %>% pull(data) %>% .[[1]]
Not really – it is just that different terms from the subdomain are enriched in each species. All terms from the subdomain are enriched among up-regulated genes in each species.
Challenge 12
One could also perform a correlation analysis on a gene-by-gene basis like we did with the co-expression modules, but instead using the genes within each of the biodomains and subdomains. Compute these results and compare the the correlations of enriched terms and modules.
first set up the lists of genes associated with biodomain and subdomain term lists
R
bd.genes <- tibble(set = unique(biodom$Biodomain)) %>%
mutate(genes = map(
set,
~ biodom %>% filter(Biodomain == .x) %>% pull(symbol) %>% unlist
))
sd.genes <- full_join(biodom, dom.lab, by = c('Biodomain' = 'domain')) %>%
mutate(Subdomain = if_else(is.na(Subdomain), 'none', Subdomain)) %>%
select(Biodomain, abbr, Subdomain, subdomain_idx) %>% distinct() %>%
rowwise() %>%
mutate(
set = paste0(abbr, '_', Subdomain),
bd = Biodomain,
sdi = subdomain_idx,
genes = biodom %>% filter(Biodomain == bd, subdomain_idx == sdi) %>% pull(symbol) %>% unlist %>% list()
)
# combine
bd.gene.lists = bind_rows(bd.genes, sd.genes %>% select(set, genes)) %>%
filter(set != 'NA_none')
rm(bd.genes, sd.genes)
next add expression data for human genes
R
bd.genes <- ampad_modules_raw %>%
filter(Model == "Diagnosis",
Comparison == "AD-CONTROL",
Tissue == 'TCX',!is.na(hgnc_symbol)) %>%
select(hgnc_symbol, logFC, adj.P.Val) %>%
left_join(.,
mouse.human.ortho %>% select(hgnc_symbol = human_symbol, symbol = mouse_symbol)) %>%
filter(!is.na(symbol)) %>% distinct() %>%
left_join(bd.gene.lists %>% unnest(genes) %>% rename(hgnc_symbol = genes),
.) %>%
filter(!is.na(logFC))
now join the data and perform the correlation analysis
R
model_vs_ampad <- inner_join(fad.deg,
bd.genes,
by = c("symbol"),
multiple = "all") %>%
mutate(model = str_c('5xFAD, ', age)) %>%
select(model,
set,
symbol,
log2FoldChange,
padj,
hgnc_symbol,
logFC,
adj.P.Val) %>%
filter(!is.na(set)) %>%
nest(
data = c(symbol, log2FoldChange, padj, hgnc_symbol, logFC, adj.P.Val),
.by = c(model, set)
) %>%
mutate(data = map(data, ~ distinct(.x)), n = map_dbl(data, ~ nrow(.x))) %>%
filter(n > 3)
cor.df <- model_vs_ampad %>%
mutate(
cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]], .x[["logFC"]], method = "pearson")),
correlation = map_dbl(cor_test, "estimate"),
p_value = map_dbl(cor_test, "p.value")
) %>%
ungroup() %>%
dplyr::select(-cor_test) %>%
mutate(significant = p_value < 0.05)
finally, plot the biodomain correlations
R
tmp <- cor.df %>% filter(grepl('_', set) == F) %>%
mutate(model = factor(model, c('5xFAD, 10m', '5xFAD, 6m', '5xFAD, 4m') %>% rev))
ggplot(data = tmp, aes(set, model)) +
geom_tile(color = 'black', fill = 'white') +
geom_point(aes(color = correlation, size = abs(correlation))) +
geom_point(
data = subset(tmp, significant),
stroke = 1.2,
shape = 0,
size = 6
) +
scale_x_discrete(position = 'top') +
scale_y_discrete(limits = rev, position = 'right') +
scale_size() + #guide = 'none'
scale_color_gradient2(
name = "Correlation",
low = "#85070C",
high = "#164B6E",
guide = guide_colorbar(ticks = FALSE)
) +
labs(x = NULL, y = NULL) +
theme(
# strip.text.x = element_text(angle = 90, size = 10,colour = c("black")),
strip.text.y.left = element_text(angle = 0, size = 12),
axis.ticks = element_blank(),
axis.text.x = element_text(
angle = 45,
hjust = 0,
vjust = 0,
size = 12
),
axis.text.y = element_text(angle = 0, size = 12),
panel.background = element_blank(),
plot.title = element_text(
angle = 0,
vjust = -54,
hjust = 0.03,
size = 12,
face = "bold"
),
plot.title.position = "plot",
panel.grid = element_blank(),
legend.position = "bottom",
plot.margin = margin(2, 2, 2, 20)
)
You can also plot the subdomain correlations, though it is best to filter these to the strongest and most significant correlations.
R
l <- cor.df %>%
filter(grepl('_', set) == T, (p_value < 5e-2 &
abs(correlation) > 0.25)) %>%
pull(set)
tmp <- cor.df %>%
ungroup() %>%
filter(set %in% l) %>%
mutate(max.n = max(n), .by = set) %>%
mutate(
model = factor(model, c('5xFAD, 10m', '5xFAD, 6m', '5xFAD, 4m') %>% rev),
set1 = str_split_fixed(set, '_', 2)[, 2] %>% str_c('[', max.n, '] ', .),
abbr = str_split_fixed(set, '_', 2)[, 1]
)
ggplot(data = tmp, aes(model, set1)) +
facet_grid(
rows = vars(abbr),
scales = 'free',
space = 'free',
switch = 'y'
) +
geom_tile(color = 'black', fill = 'white') +
geom_point(aes(color = correlation, size = abs(correlation))) +
geom_point(
data = subset(tmp, significant),
stroke = 1.2,
shape = 0,
size = 6
) +
scale_x_discrete(position = 'bottom') +
scale_y_discrete(limits = rev, position = 'right') +
scale_size() + #guide = 'none'
scale_color_gradient2(
name = "Correlation",
low = "#85070C",
high = "#164B6E",
guide = guide_colorbar(ticks = FALSE)
) +
labs(x = NULL, y = NULL) +
theme(
# strip.text.x = element_text(angle = 90, size = 10,colour = c("black")),
strip.text.y.left = element_text(angle = 0),
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(angle = 0),
panel.background = element_blank(),
plot.title = element_text(
angle = 0,
vjust = -54,
hjust = 0.03,
size = 12,
face = "bold"
),
plot.title.position = "plot",
panel.grid = element_blank(),
legend.position = "left",
plot.margin = margin(2, 2, 2, 20)
)
Most biodomains have a significant positive correlation with AD transcriptomes, some get stronger at older ages of 5xFAD. The most significant correlations among subdomains are for the APP Metabolism, Apoptosis, Autophagy, Mitochondrial Metabolism, and Myelination domains. There is also positive correlation for “tau-protein kinase activity.
[7] Conclusion:
Overall, by aligning human and mouse omics signatures through the lens of domains affected in each context, we can get a better understanding of the relationships between the biological processes affected in each context.
- AMP-AD gene modules represent major transcriptomic heterogeneity in AD
- Correlation of logFC is a practical approach for human-mouse alignment of AD-associated transcriptomic signatures
- Functional gene set signatures are also a useful point of comparison between species
- Subsetting functions into associated Biological Domains and Subdomains allows for more granular comparisons
Session Info
R
sessionInfo()
R version 4.5.2 (2025-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] BiocManager_1.30.27 compiler_4.5.2 cli_3.6.5
[4] tools_4.5.2 pillar_1.11.1 otel_0.2.0
[7] glue_1.8.0 yaml_2.3.12 vctrs_0.7.1
[10] knitr_1.51 xfun_0.56 lifecycle_1.0.5
[13] rlang_1.1.7 renv_1.1.7 evaluate_1.0.5