This lesson is still being designed and assembled (Pre-Alpha version)

Omics translation in AD

Introduction

Overview

Teaching: 0 min
Exercises: 0 min
Questions
Objectives

Key Points


Synapse and AD Knowledge Portal

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • How to work with Synapse R client?

  • How to work with data in AD Knowledge Portal?

Objectives
  • Explain how to use Synapser Package.

  • Demonstrate how to locate data and metadata in the Portal.

  • Demonstrate how to download data from the Portal programmatically.

Author: Sage Bionetworks

Setup

Install and load packages

If you haven’t already, install synapser (the Synapse R client), as well as the tidyverse family of packages.

# install synapser
install.packages("synapser", repos = c("http://ran.synapse.org", "http://cran.fhcrc.org"))

# install tidyverse if you don't already have it
install.packages("tidyverse")

We will also use the BioconductoR package manager to install biomaRt, which will help us with gene count data later.

#install.packages("XML")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomaRt")

Load libraries

library(synapser)
library(tidyverse)
library(biomaRt)

Login to Synapse

Next, you will need to log in to your Synapse account.

Login option 1: Synapser takes credentials from your Synapse web session

If you are logged into the Synapse web browser, synapser will automatically use your login credentials to log you in during your R session! All you have to do is:

synLogin()

If for whatever reason that didn’t work, try one of these options:

Login option 2: Synapse username and password

In the code below, replace the <> with your Synapse username and password.

synLogin("<username>", "<password>")

Login option 3: Synapse PAT

If you usually log in to Synapse with your Google account, you will need to use a Synapser Personal Access Token (PAT) to log in with the R client. Follow these instructions to generate a personal access token, then paste the PAT into the code below. Make sure you scope your access token to allow you to View, Download, and Modify.

synLogin(authToken = "<paste your personal access token here>")

For more information on managing Synapse credentials with synapser, see the documentation here.

Download data

While you can always download data from the AD Portal website via your web browser, it’s usually faster and often more convenient to download data programmatically.

Download a single file

To download a single file from the AD Knowledge Portal, you can click the linked file name to go to a page in the Synapse platform where that file is stored. Using the synID on that page, you can call the synGet() function from synapser to download the file.

Exercise 1: Use Explore Data to find processed RNAseq data from the Jax.IU.Pitt_5XFAD Study

This filters the table to a single file. In the “Id” column for this htseqcounts_5XFAD.txt file, there is a unique Synapse ID (synID).

We can then use that synID to download the file.

counts_id <- "syn22108847"
synGet(counts_id, downloadLocation = "../data/")

Bulk download files

Exercise 2: Use Explore Studies to find all metadata files from the Jax.IU.Pitt_5XFAD study

Use the facets and search bar to look for data you want to download from the AD Knowledge Portal. Once you’ve identified the files you want, click on the download arrow icon on the top right of the Explore Data table and select “Programmatic Options” from the drop-down menu.

In the window that pops up, select the “R” tab from the top menu bar. This will display some R code that constructs a SQL query of the Synapse data table that drives the AD Knowledge Portal. This query will allow us to download only the files that meet our search criteria.

The function synTableQuery() returns a Synapse object wrapper around a CSV file that is automatically downloaded to a Synapse cache directory .synapseCache in your home directory. You can use query$filepath to see the path to the file in the Synapse cache.

# download the results of the filtered table query
query <- synTableQuery("SELECT * FROM syn11346063.37 WHERE ( ( `study` HAS ( 'Jax.IU.Pitt_5XFAD' ) ) AND ( `resourceType` = 'metadata' ) )")

# view the file path of the resulting csv
query$filepath
[1] "/Users/auyar/.synapseCache/628/125732628/SYNAPSE_TABLE_QUERY_125732628.csv"

We’ll use read.csv to read the CSV file into R (although the provided read.table or any other base R version is also fine!). We can explore the download_table object and see that it contains information on all of the AD Portal data files we want to download. Some columns like the “id” and “parentId” columns contain info about where the file is in Synapse, and some columns contain AD Portal annotations for each file, like “dataType”, “specimenID”, and “assay”. This annotation table will later allow us to link downloaded files to additional metadata variables!

# read in the table query csv file
download_table <- read_csv(query$filepath, show_col_types = FALSE)

download_table
# A tibble: 5 × 45
    ROW_ID ROW_VERSION ROW_ETAG    id    name  study dataType assay organ tissue
     <dbl>       <dbl> <chr>       <chr> <chr> <chr> <chr>    <chr> <lgl> <lgl> 
1 22094731           2 84e4bc38-9… syn2… Jax.… "[\"… "[\"ima… "[\"… NA    NA    
2 22094732           2 266ec572-1… syn2… Jax.… "[\"… "[\"ima… "[\"… NA    NA    
3 22103212           3 7229ee91-4… syn2… Jax.… "[\"…  <NA>     <NA> NA    NA    
4 22103213           3 ce970c4c-d… syn2… Jax.… "[\"…  <NA>     <NA> NA    NA    
5 22110328           4 20cf3097-4… syn2… Jax.… "[\"… "[\"gen… "[\"… NA    NA    
# ℹ 35 more variables: species <chr>, sex <lgl>, consortium <chr>, grant <chr>,
#   modelSystemName <lgl>, treatmentType <lgl>, specimenID <lgl>,
#   individualID <lgl>, individualIdSource <lgl>, specimenIdSource <lgl>,
#   resourceType <chr>, dataSubtype <lgl>, metadataType <chr>,
#   assayTarget <lgl>, analysisType <lgl>, cellType <lgl>,
#   nucleicAcidSource <lgl>, fileFormat <chr>, group <lgl>,
#   isModelSystem <lgl>, isConsortiumAnalysis <lgl>, isMultiSpecimen <lgl>, …

Finally, we use a mapping function from the purrr package to loop through the “id” column and apply the synGet() function to each file’s synID. In this case, we use purrr::walk() because it lets us call synGet() for its side effect (downloading files to a location we specify), and returns nothing.

# loop through the column of synIDs and download each file
purrr::walk(download_table$id, ~synGet(.x, downloadLocation = "../data/"))

Congratulations, you have bulk downloaded files from the AD Knowledge Portal!

An important note: for situations where you are downloading many large files, the R client performs substantially slower than the command line client or the Python client. In these cases, you can use the instructions and code snippets for the command line or Python client provided in the “Programmatic Options” menu.


Working with AD Portal metadata

Metadata basics

We have now downloaded several metadata files and an RNAseq counts file from the portal. For our next exercises, we want to read those files in as R data so we can work with them.

We can see from the download_table we got during the bulk download step that we have five metadata files. Two of these should be the individual and biospecimen files, and three of them are assay metadata files.

#
download_table %>% 
  dplyr::select(name, metadataType, assay)
# A tibble: 5 × 3
  name                                         metadataType assay               
  <chr>                                        <chr>        <chr>               
1 Jax.IU.Pitt_5XFAD_assay_autorad_metadata.csv assay        "[\"autoradiography…
2 Jax.IU.Pitt_5XFAD_assay_PET_metadata.csv     assay        "[\"Positron Emissi…
3 Jax.IU.Pitt_5XFAD_individual_metadata.csv    individual    <NA>               
4 Jax.IU.Pitt_5XFAD_biospecimen_metadata.csv   biospecimen   <NA>               
5 Jax.IU.Pitt_5XFAD_assay_RNAseq_metadata.csv  assay        "[\"rnaSeq\"]"      

We are only interested in RNAseq data, so we will only read in the individual, biospecimen, and RNAseq assay metadata files.

# counts matrix
counts <- read_tsv("../data/htseqcounts_5XFAD.txt", show_col_types = FALSE)

# individual metadata
ind_meta <- read_csv("../data/Jax.IU.Pitt_5XFAD_individual_metadata.csv", show_col_types = FALSE)

# biospecimen metadata
bio_meta <- read_csv("../data/Jax.IU.Pitt_5XFAD_biospecimen_metadata.csv", show_col_types = FALSE)

#assay metadata
rna_meta <- read_csv("../data/Jax.IU.Pitt_5XFAD_assay_RNAseq_metadata.csv", show_col_types = FALSE)

Let’s examine the data and metadata files a bit before we begin our analyses.

Counts data

# Calling a tibble object will print the first ten rows in a nice tidy output; doing the same for a base R dataframe will print the whole thing until it runs out of memory. If you want to inspect a large dataframe, use `head(df)`
counts
# A tibble: 55,489 × 73
   gene_id `32043rh` `32044rh` `32046rh` `32047rh` `32048rh` `32049rh` `32050rh`
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 ENSG00…     22554         0         0         0     16700         0         0
 2 ENSG00…    344489         4         0         1    260935         6         8
 3 ENSMUS…      5061      3483      3941      3088      2756      3067      2711
 4 ENSMUS…         0         0         0         0         0         0         0
 5 ENSMUS…       208       162       138       127        95       154       165
 6 ENSMUS…        44        17        14        28        23        24        14
 7 ENSMUS…       143        88       121       117       115       109        75
 8 ENSMUS…        22         6        10        11        11        19        24
 9 ENSMUS…      7165      5013      5581      4011      4104      5254      4345
10 ENSMUS…      3728      2316      2238      1965      1822      1999      1809
# ℹ 55,479 more rows
# ℹ 65 more variables: `32052rh` <dbl>, `32053rh` <dbl>, `32057rh` <dbl>,
#   `32059rh` <dbl>, `32061rh` <dbl>, `32062rh` <dbl>, `32065rh` <dbl>,
#   `32067rh` <dbl>, `32068rh` <dbl>, `32070rh` <dbl>, `32073rh` <dbl>,
#   `32074rh` <dbl>, `32075rh` <dbl>, `32078rh` <dbl>, `32081rh` <dbl>,
#   `32088rh` <dbl>, `32640rh` <dbl>, `46105rh` <dbl>, `46106rh` <dbl>,
#   `46107rh` <dbl>, `46108rh` <dbl>, `46109rh` <dbl>, `46110rh` <dbl>, …

The data file has a column of ENSEMBL gene ids and then a bunch of columns with count data, where the column headers correspond to the specimenIDs. These specimenIDs should all be in the RNAseq assay metadata file, so let’s check.

# what does the RNAseq assay metadata look like?
rna_meta
# A tibble: 72 × 12
   specimenID platform   RIN   rnaBatch libraryBatch sequencingBatch libraryPrep
   <chr>      <chr>      <lgl>    <dbl>        <dbl>           <dbl> <chr>      
 1 32043rh    IlluminaN… NA           1            1               1 polyAselec…
 2 32044rh    IlluminaN… NA           1            1               1 polyAselec…
 3 32046rh    IlluminaN… NA           1            1               1 polyAselec…
 4 32047rh    IlluminaN… NA           1            1               1 polyAselec…
 5 32049rh    IlluminaN… NA           1            1               1 polyAselec…
 6 32057rh    IlluminaN… NA           1            1               1 polyAselec…
 7 32061rh    IlluminaN… NA           1            1               1 polyAselec…
 8 32065rh    IlluminaN… NA           1            1               1 polyAselec…
 9 32067rh    IlluminaN… NA           1            1               1 polyAselec…
10 32070rh    IlluminaN… NA           1            1               1 polyAselec…
# ℹ 62 more rows
# ℹ 5 more variables: libraryPreparationMethod <lgl>, isStranded <lgl>,
#   readStrandOrigin <lgl>, runType <chr>, readLength <dbl>
# are all the column headers from the counts matrix (except the first "gene_id" column) in the assay metadata?
all(colnames(counts[-1]) %in% rna_meta$specimenID)
[1] TRUE

Assay metadata

The assay metadata contains information about how data was generated on each sample in the assay. Each specimenID represents a unique sample. We can use some tools from dplyr to explore the metadata.

# how many unique specimens were sequenced?
n_distinct(rna_meta$specimenID)
[1] 72
# were the samples all sequenced on the same platform?
distinct(rna_meta, platform)
# A tibble: 1 × 1
  platform           
  <chr>              
1 IlluminaNovaseq6000
# were there multiple sequencing batches reported?
distinct(rna_meta, sequencingBatch) 
# A tibble: 1 × 1
  sequencingBatch
            <dbl>
1               1

Biospecimen metadata

The biospecimen metadata contains specimen-level information, including organ and tissue the specimen was taken from, how it was prepared, etc. Each specimenID is mapped to an individualID.

# all specimens from the RNAseq assay metadata file should be in the biospecimen file
all(rna_meta$specimenID %in% bio_meta$specimenID)
[1] TRUE
# but the biospecimen file also contains specimens from different assays
all(bio_meta$specimenID %in% rna_meta$specimenID)
[1] FALSE

Individual metadata

The individual metadata contains information about all the individuals in the study, represented by unique individualIDs. For humans, this includes information on age, sex, race, diagnosis, etc. For MODEL-AD mouse models, the individual metadata has information on model genotypes, stock numbers, diet, and more.

# all individualIDs in the biospecimen file should be in the individual file
all(bio_meta$individualID %in% ind_meta$individualID)
[1] FALSE
# which model genotypes are in this study?
distinct(ind_meta, genotype)
# A tibble: 3 × 1
  genotype        
  <chr>           
1 <NA>            
2 5XFAD_carrier   
3 5XFAD_noncarrier

Joining metadata

We use the three-file structure for our metadata because it allows us to store metadata for each study in a tidy format. Every line in the assay and biospecimen files represents a unique specimen, and every line in the individual file represents a unique individual. This means the files can be easily joined by specimenID and individualID to get all levels of metadata that apply to a particular data file. We will use the left_join() function from the dplyr package, and the %>% operator from the magrittr package. If you are unfamiliar with the pipe, think of it as a shorthand for “take this (the preceding object) and do that (the subsequent command)”. See here for more info on piping in R.

# join all the rows in the assay metadata that have a match in the biospecimen metadata
joined_meta <- rna_meta %>% #start with the rnaseq assay metadata
  left_join(bio_meta, by = "specimenID") %>%  #join rows from biospecimen that match specimenID 
  left_join(ind_meta, by = "individualID") # join rows from individual that match individualID

joined_meta
# A tibble: 144 × 50
   specimenID platform   RIN   rnaBatch libraryBatch sequencingBatch libraryPrep
   <chr>      <chr>      <lgl>    <dbl>        <dbl>           <dbl> <chr>      
 1 32043rh    IlluminaN… NA           1            1               1 polyAselec…
 2 32043rh    IlluminaN… NA           1            1               1 polyAselec…
 3 32044rh    IlluminaN… NA           1            1               1 polyAselec…
 4 32044rh    IlluminaN… NA           1            1               1 polyAselec…
 5 32046rh    IlluminaN… NA           1            1               1 polyAselec…
 6 32046rh    IlluminaN… NA           1            1               1 polyAselec…
 7 32047rh    IlluminaN… NA           1            1               1 polyAselec…
 8 32047rh    IlluminaN… NA           1            1               1 polyAselec…
 9 32049rh    IlluminaN… NA           1            1               1 polyAselec…
10 32049rh    IlluminaN… NA           1            1               1 polyAselec…
# ℹ 134 more rows
# ℹ 43 more variables: libraryPreparationMethod <lgl>, isStranded <lgl>,
#   readStrandOrigin <lgl>, runType <chr>, readLength <dbl>,
#   individualID <dbl>, specimenIdSource <chr>, organ <chr>, tissue <chr>,
#   BrodmannArea <lgl>, sampleStatus <chr>, tissueWeight <lgl>,
#   tissueVolume <lgl>, nucleicAcidSource <lgl>, cellType <lgl>,
#   fastingState <lgl>, isPostMortem <lgl>, samplingAge <lgl>, ...1 <dbl>, …

We now have a very wide dataframe that contains all the available metadata on each specimen in the RNAseq data from this study. This procedure can be used to join the three types of metadata files for every study in the AD Knowledge Portal, allowing you to filter individuals and specimens as needed based on your analysis criteria!

# convert columns of strings to month-date-year format
joined_meta_time <- joined_meta %>%
  # convert numeric ages to timepoint categories
  mutate(timepoint = case_when(ageDeath > 10 ~ "12 mo",
                               ageDeath < 10 & ageDeath > 5 ~ "6 mo",
                               ageDeath < 5 ~ "4 mo"))
covars_5XFAD <- joined_meta_time %>%
  dplyr::select(individualID, specimenID, sex, genotype, timepoint) %>% distinct() %>% as.data.frame()
rownames(covars_5XFAD) <- covars_5XFAD$specimenID

We will save joined_meta for the next lesson.

saveRDS(covars_5XFAD, file = "../data/covars_5XFAD.rds")

Single-specimen files

For files that contain data from a single specimen (e.g. raw sequencing files, raw mass spectra, etc.), we can use the Synapse annotations to associate these files with the appropriate metadata.

Excercise 3: Use Explore Data to find all RNAseq files from the Jax.IU.Pitt_5XFAD study.

If we filter for data where Study = “Jax.IU.Pitt_5XFAD” and Assay = “rnaSeq” we will get a list of 148 files, including raw fastqs and processed counts data.

Synapse entity annotations

We can use the function synGetAnnotations to view the annotations associated with any file without downloading the file.

# the synID of a random fastq file from this list
random_fastq <- "syn22108503"

# extract the annotations as a nested list
fastq_annotations <- synGetAnnotations(random_fastq)

fastq_annotations
$sex
[1] "female"

$room
[1] "JAX_MGL373"

$assay
[1] "rnaSeq"

$grant
[1] "U54AG054345"

$organ
[1] "brain"

$study
[1] "Jax.IU.Pitt_5XFAD"

$tissue
[1] "right cerebral hemisphere"

$bedding
[1] "aspen"

$birthID
[1] "RMO1223"

$climbID
[1] "298456"

$species
[1] "Mouse"

$waterpH
[1] 2.85

$ageDeath
[1] 10.81967

$dataType
[1] "geneExpression"

$genotype
[1] "5XFAD_carrier"

$matingID
[1] "M-108-17"

$dateBirth
[1] "1521417600000"

$consortium
[1] "MODEL-AD"

$fileFormat
[1] "fastq"

$generation
[1] "N1F3"

$rodentDiet
[1] "0.06"

$specimenID
[1] "32043rh"

$brainWeight
[1] 0.503

$dataSubtype
[1] "raw"

$microchipID
[1] "288646853"

$stockNumber
[1] "8730"

$individualID
[1] "32043"

$officialName
[1] "B6.Cg-Tg(APPSwFlLon,PSEN1*M146L*L286V)6799Vas/Mmjax"

$resourceType
[1] "experimentalData"

$rodentWeight
[1] 28.76

$ageDeathUnits
[1] "months"

$isModelSystem
[1] FALSE

$materialOrigin
[1] "JAX"

$isMultiSpecimen
[1] FALSE

$modelSystemName
[1] "5XFAD"

$modelSystemType
[1] "animal"

$nucleicAcidSource
[1] "bulk cell"

$genotypeBackground
[1] "C57BL6J"

$individualIdSource
[1] "JAX"

$individualCommonGenotype
[1] "5XFAD"

The file annotations let us see which study the file is associated with (Jax.IU.Pitt.5XFAD), which species it’s from (Mouse), which assay generated the file (rnaSeq), and a whole bunch of other properties. Most importantly, single-specimen files are annotated with with the specimenID of the specimen in the file, and the individualID of the individual that specimen was taken from. We can use these annotations to link files to the rest of the metadata, including metadata that is not in annotations. This is especially helpful for human studies, as potentially identifying information like age, race, and diagnosis is not included in file annotations.

# find records belonging to the individual this file maps to in our joined metadata
joined_meta %>% 
  filter(individualID == fastq_annotations$individualID[[1]])
# A tibble: 2 × 50
  specimenID platform    RIN   rnaBatch libraryBatch sequencingBatch libraryPrep
  <chr>      <chr>       <lgl>    <dbl>        <dbl>           <dbl> <chr>      
1 32043rh    IlluminaNo… NA           1            1               1 polyAselec…
2 32043rh    IlluminaNo… NA           1            1               1 polyAselec…
# ℹ 43 more variables: libraryPreparationMethod <lgl>, isStranded <lgl>,
#   readStrandOrigin <lgl>, runType <chr>, readLength <dbl>,
#   individualID <dbl>, specimenIdSource <chr>, organ <chr>, tissue <chr>,
#   BrodmannArea <lgl>, sampleStatus <chr>, tissueWeight <lgl>,
#   tissueVolume <lgl>, nucleicAcidSource <lgl>, cellType <lgl>,
#   fastingState <lgl>, isPostMortem <lgl>, samplingAge <lgl>, ...1 <dbl>,
#   climbID <dbl>, microchipID <dbl>, birthID <chr>, matingID <chr>, …

Annotations during bulk download

When bulk downloading many files, the best practice is to preserve the download manifest that is generated which lists all the files, their synIDs, and all their annotations. If using the Synapse R client, follow the instructions in the Bulk download files section above.

If we use the “Programmatic Options” tab in the AD Portal download menu to download all 148 rnaSeq files from the 5XFAD study, we would get a table query that looks like this:

query <- synTableQuery("SELECT * FROM syn11346063.37 WHERE ( ( \"study\" HAS ( 'Jax.IU.Pitt_5XFAD' ) ) AND ( \"assay\" HAS ( 'rnaSeq' ) ) )")

As we saw previously, this downloads a csv file with the results of our AD Portal query. Opening that file lets us see which specimens are associated with which files:

#
annotations_table <- read_csv(query$filepath, show_col_types = FALSE)

annotations_table
# A tibble: 148 × 45
     ROW_ID ROW_VERSION ROW_ETAG   id    name  study dataType assay organ tissue
      <dbl>       <dbl> <chr>      <chr> <chr> <chr> <chr>    <chr> <chr> <chr> 
 1 22108503           1 458ff182-… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
 2 22108508           1 163d5067-… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
 3 22108512           1 d02e16c5-… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
 4 22108519           1 59eba082-… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
 5 22108525           1 9fed0677-… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
 6 22108530           1 47cff701-… syn2… 3205… "[\"… "[\"gen… "[\"… brain "[\"r…
 7 22108534           1 03af7c0a-… syn2… 3206… "[\"… "[\"gen… "[\"… brain "[\"r…
 8 22108539           1 502d1468-… syn2… 3206… "[\"… "[\"gen… "[\"… brain "[\"r…
 9 22108543           1 47b8ffe9-… syn2… 3206… "[\"… "[\"gen… "[\"… brain "[\"r…
10 22108550           1 471ec9a1-… syn2… 3207… "[\"… "[\"gen… "[\"… brain "[\"r…
# ℹ 138 more rows
# ℹ 35 more variables: species <chr>, sex <chr>, consortium <chr>, grant <chr>,
#   modelSystemName <chr>, treatmentType <lgl>, specimenID <chr>,
#   individualID <dbl>, individualIdSource <chr>, specimenIdSource <lgl>,
#   resourceType <chr>, dataSubtype <chr>, metadataType <chr>,
#   assayTarget <lgl>, analysisType <lgl>, cellType <lgl>,
#   nucleicAcidSource <chr>, fileFormat <chr>, group <lgl>, …

You could then use purrr::walk(download_table$id, ~synGet(.x, downloadLocation = <your-download-directory>)) to walk through the column of synIDs and download all 148 files. However, because these are large files, it might be preferable to use the Python client or command line client for increased speed.

Once you’ve downloaded all the files in the id column, you can link those files to their annotations by the name column.

# We'll use the "random fastq" that we got annotations for earlier
# To avoid downloading the whole 3GB file, we'll use synGet with "downloadFile = FALSE" to get only the Synapse entity information, rather than the file. 
# If we downloaded the actual file, we could find it in the directory and search using the filename. Since we're just downloading the Synapse entity wrapper object, we'll use the file name listed in the object properties.

fastq <- synGet(random_fastq, downloadFile = FALSE)

# filter the annotations table to rows that match the fastq filename
annotations_table %>% 
  filter(name == fastq$properties$name)
# A tibble: 1 × 45
    ROW_ID ROW_VERSION ROW_ETAG    id    name  study dataType assay organ tissue
     <dbl>       <dbl> <chr>       <chr> <chr> <chr> <chr>    <chr> <chr> <chr> 
1 22108503           1 458ff182-d… syn2… 3204… "[\"… "[\"gen… "[\"… brain "[\"r…
# ℹ 35 more variables: species <chr>, sex <chr>, consortium <chr>, grant <chr>,
#   modelSystemName <chr>, treatmentType <lgl>, specimenID <chr>,
#   individualID <dbl>, individualIdSource <chr>, specimenIdSource <lgl>,
#   resourceType <chr>, dataSubtype <chr>, metadataType <chr>,
#   assayTarget <lgl>, analysisType <lgl>, cellType <lgl>,
#   nucleicAcidSource <chr>, fileFormat <chr>, group <lgl>,
#   isModelSystem <lgl>, isConsortiumAnalysis <lgl>, isMultiSpecimen <lgl>, …

Multispecimen files

Multispecimen files in the AD Knowledge Portal are files that contain data or information from more than one specimen. They are not annotated with individualIDs or specimenIDs, since these files may contain numbers of specimens that exceed the annotation limits. These files are usually processed or summary data (gene counts, peptide quantifications, etc), and are always annotated with isMultiSpecimen = TRUE.

If we look at the processed data files in the table of 5XFAD RNAseq file annotations we just downloaded, we will see that it isMultiSpecimen = TRUE, but individualID and specimenID are blank:

#
annotations_table %>% 
  filter(fileFormat == "txt") %>% 
  dplyr::select(name, individualID, specimenID, isMultiSpecimen)
# A tibble: 3 × 4
  name                  individualID specimenID isMultiSpecimen
  <chr>                        <dbl> <chr>      <lgl>          
1 htseqcounts_5XFAD.txt           NA <NA>       TRUE           
2 tpm_gene_5XFAD.txt              NA <NA>       TRUE           
3 tpm_isoform_5XFAD.txt           NA <NA>       TRUE           

The multispecimen file should contain a row or column of specimenIDs that correspond to the specimenIDs used in a study’s metadata, as we have seen with the 5XFAD counts file.

# In this example, we take a slice of the counts data to reduce computation, transpose it so that each row represents a single specimen, and then join it to the joined metadata by the specimenID
counts %>% 
  slice_head(n = 5) %>% 
  t() %>% 
  as_tibble(rownames = "specimenID") %>% 
  left_join(joined_meta, by = "specimenID")
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is
omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# A tibble: 145 × 55
   specimenID V1    V2    V3    V4    V5    platform RIN   rnaBatch libraryBatch
   <chr>      <chr> <chr> <chr> <chr> <chr> <chr>    <lgl>    <dbl>        <dbl>
 1 gene_id    "ENS… "ENS… "ENS… "ENS… "ENS… <NA>     NA          NA           NA
 2 32043rh    " 22… "344… "  5… "   … "   … Illumin… NA           1            1
 3 32043rh    " 22… "344… "  5… "   … "   … Illumin… NA           1            1
 4 32044rh    "   … "   … "348… "   … " 16… Illumin… NA           1            1
 5 32044rh    "   … "   … "348… "   … " 16… Illumin… NA           1            1
 6 32046rh    "   … "   … "394… "   … " 13… Illumin… NA           1            1
 7 32046rh    "   … "   … "394… "   … " 13… Illumin… NA           1            1
 8 32047rh    "   … "   … "308… "   … " 12… Illumin… NA           1            1
 9 32047rh    "   … "   … "308… "   … " 12… Illumin… NA           1            1
10 32048rh    " 16… "260… "  2… "   … "   … Illumin… NA           1            1
# ℹ 135 more rows
# ℹ 45 more variables: sequencingBatch <dbl>, libraryPrep <chr>,
#   libraryPreparationMethod <lgl>, isStranded <lgl>, readStrandOrigin <lgl>,
#   runType <chr>, readLength <dbl>, individualID <dbl>,
#   specimenIdSource <chr>, organ <chr>, tissue <chr>, BrodmannArea <lgl>,
#   sampleStatus <chr>, tissueWeight <lgl>, tissueVolume <lgl>,
#   nucleicAcidSource <lgl>, cellType <lgl>, fastingState <lgl>, …

Challenge 1

Download the raw read counts and the metadata from the Jax.IU.Pitt_APOE4.TREM2.R47H study (syn22107627, syn23613784).

Solution to Challenge 1

counts_id <- "syn22107627"
synGet(counts_id, downloadLocation = "../data/")
metadata_id <- 'syn23613784'
synGet(metadata_id, downloadLocation = "../data/")

Key Points

  • Use your Synapse login credentials to access the Portal.

  • Use Synapser package to download data from the Portal.


Differential Expression Analysis

Overview

Teaching: 40 min
Exercises: 10 min
Questions
  • What transcriptomic changes we observe in mouse models carrying AD-related mutations?

Objectives
  • Read in a count matrix and metadata.

  • Understand the data from AD mouse models

  • Format the data for differential analysis

  • Perform differential analysis using DESeq2.

  • Pathway enrichment of differentially expressed genes

  • Save data for next lessons

Author: Ravi Pandey, Jackson Laboratory

LOAD libraries

suppressPackageStartupMessages(library("DESeq2"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("AnnotationDbi"))
suppressPackageStartupMessages(library("org.Mm.eg.db"))
suppressPackageStartupMessages(library("GO.db"))
suppressPackageStartupMessages(library("EnhancedVolcano"))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(clusterProfiler))

Reading Gene Expression Count matrix from Previous Lesson

In this lesson, we will use the raw counts matrix and metadata downloaded in the previous lesson and will perform differential expression analysis.

RNA-Seq data from 5xFAD mouse models

counts <- read.delim("../data/htseqcounts_5XFAD.txt", check.names = FALSE)

Reading Sample Metadata from Previous Lesson

covars <- readRDS("../data/covars_5XFAD.rds")

Let’s explore the data:

Let’s look at the top of the metadata.

head(covars)
        individualID specimenID    sex         genotype timepoint
32043rh        32043    32043rh female    5XFAD_carrier     12 mo
32044rh        32044    32044rh   male 5XFAD_noncarrier     12 mo
32046rh        32046    32046rh   male 5XFAD_noncarrier     12 mo
32047rh        32047    32047rh   male 5XFAD_noncarrier     12 mo
32049rh        32049    32049rh female 5XFAD_noncarrier     12 mo
32057rh        32057    32057rh female 5XFAD_noncarrier     12 mo

identify distinct groups using sample metadata

distinct(covars, sex, genotype, timepoint)
           sex         genotype timepoint
32043rh female    5XFAD_carrier     12 mo
32044rh   male 5XFAD_noncarrier     12 mo
32049rh female 5XFAD_noncarrier     12 mo
46105rh female 5XFAD_noncarrier      6 mo
46108rh   male 5XFAD_noncarrier      6 mo
46131rh female 5XFAD_noncarrier      4 mo
46877rh   male 5XFAD_noncarrier      4 mo
46887rh female    5XFAD_carrier      4 mo
32053rh   male    5XFAD_carrier     12 mo
46111rh female    5XFAD_carrier      6 mo
46865rh   male    5XFAD_carrier      6 mo
46866rh   male    5XFAD_carrier      4 mo

We’re going to explore the data further using a series of Challenges. You will be asked to look at the contents of some of the columns to see how the data are distributed.

Challenge 1

How many mice were used to produce this data?

Solution to Challenge 1

covars %>% group_by(sex,genotype,timepoint) %>% count()
dplyr::count(metadata, sex, genotype,timepoint)

How many rows and columns are there in counts?

dim(counts)
[1] 55489    73

In the counts matrix, genes are in rows and samples are in columns. Let’s look at the first few rows.

head(counts,n=5)
             gene_id 32043rh 32044rh 32046rh 32047rh 32048rh 32049rh 32050rh
1    ENSG00000080815   22554       0       0       0   16700       0       0
2    ENSG00000142192  344489       4       0       1  260935       6       8
3 ENSMUSG00000000001    5061    3483    3941    3088    2756    3067    2711
4 ENSMUSG00000000003       0       0       0       0       0       0       0
5 ENSMUSG00000000028     208     162     138     127      95     154     165
  32052rh 32053rh 32057rh 32059rh 32061rh 32062rh 32065rh 32067rh 32068rh
1   19748   14023       0   17062       0   15986      10       0   18584
2  337456  206851       1  264748       0  252248     172       4  300398
3    3334    3841    4068    3306    4076    3732    3940    4238    3257
4       0       0       0       0       0       0       0       0       0
5     124     103     164     116     108     134     204     239     148
  32070rh 32073rh 32074rh 32075rh 32078rh 32081rh 32088rh 32640rh 46105rh
1       1       0       0   22783   17029   16626   15573   12721       4
2       4       2       9  342655  280968  258597  243373  188818      19
3    3351    3449    4654    4844    3132    3334    3639    3355    4191
4       0       0       0       0       0       0       0       0       0
5     159     167     157     211     162     149     160     103     158
  46106rh 46107rh 46108rh 46109rh 46110rh 46111rh 46112rh 46113rh 46115rh
1       0       0       0       0       0   17931       0   19087       0
2       0       0       1       5       1  293409       8  273704       1
3    3058    4265    3248    3638    3747    3971    3192    3805    3753
4       0       0       0       0       0       0       0       0       0
5     167     199     113     168     175     203     158     108     110
  46121rh 46131rh 46132rh 46134rh 46138rh 46141rh 46142rh 46862rh 46863rh
1       0       0   12703   18833       0   18702   17666       0   14834
2       0       1  187975  285048       0  284499  250600       0  218494
3    4134    3059    3116    3853    3682    2844    3466    3442    3300
4       0       0       0       0       0       0       0       0       0
5     179     137     145     183     171     138      88     154     157
  46865rh 46866rh 46867rh 46868rh 46871rh 46872rh 46873rh 46874rh 46875rh
1   10546   10830   10316   10638   15248       0       0   11608   11561
2  169516  152769  151732  190150  229063       6       1  165941  171303
3    3242    3872    3656    3739    3473    3154    5510    3657    4121
4       0       0       0       0       0       0       0       0       0
5     131     152     152     155     140      80     240     148     112
  46876rh 46877rh 46878rh 46879rh 46881rh 46882rh 46883rh 46884rh 46885rh
1       0       0   12683   15613       0   14084   20753       0       0
2       0       2  183058  216122       0  199448  306081       0       5
3    3422    3829    3996    4324    2592    2606    4600    2913    3614
4       0       0       0       0       0       0       0       0       0
5     147     166     169     215     115     101     174     127     151
  46886rh 46887rh 46888rh 46889rh 46890rh 46891rh 46892rh 46893rh 46895rh
1   16639   16072       0   16680   13367       0   25119      92       0
2  242543  258061       0  235530  196721       0  371037    1116       0
3    3294    3719    3899    4173    4008    3037    5967    3459    4262
4       0       0       0       0       0       0       0       0       0
5     139     128     210     127     156     116     260     161     189
  46896rh 46897rh
1   15934       0
2  235343       6
3    3923    3486
4       0       0
5     179     117

As you can see, the gene ids are ENSEBL IDs. There is some risk that these may not be unique. Let’s check whether any of the gene symbols are duplicated. We will sum the number of duplicated gene symbols.

sum(duplicated(rownames(counts)))
[1] 0

The sum equals zero, so there are no duplicated gene symbols, which is good. Similarly, samples should be unique. Once again, let’s verify this:

sum(duplicated(colnames(counts)))
[1] 0

Formatting the count matrix

Now, as we see that gene_id is in first column of count matrix, but we will need only count data in matrix, so we will change the gene_id column to rownames.
Converting the gene_id as rownames of count matrix

counts <- counts %>% column_to_rownames(.,var="gene_id") %>% as.data.frame()

let’s confirm if change is done correctly

head(counts,n=5)
                   32043rh 32044rh 32046rh 32047rh 32048rh 32049rh 32050rh
ENSG00000080815      22554       0       0       0   16700       0       0
ENSG00000142192     344489       4       0       1  260935       6       8
ENSMUSG00000000001    5061    3483    3941    3088    2756    3067    2711
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     208     162     138     127      95     154     165
                   32052rh 32053rh 32057rh 32059rh 32061rh 32062rh 32065rh
ENSG00000080815      19748   14023       0   17062       0   15986      10
ENSG00000142192     337456  206851       1  264748       0  252248     172
ENSMUSG00000000001    3334    3841    4068    3306    4076    3732    3940
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     124     103     164     116     108     134     204
                   32067rh 32068rh 32070rh 32073rh 32074rh 32075rh 32078rh
ENSG00000080815          0   18584       1       0       0   22783   17029
ENSG00000142192          4  300398       4       2       9  342655  280968
ENSMUSG00000000001    4238    3257    3351    3449    4654    4844    3132
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     239     148     159     167     157     211     162
                   32081rh 32088rh 32640rh 46105rh 46106rh 46107rh 46108rh
ENSG00000080815      16626   15573   12721       4       0       0       0
ENSG00000142192     258597  243373  188818      19       0       0       1
ENSMUSG00000000001    3334    3639    3355    4191    3058    4265    3248
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     149     160     103     158     167     199     113
                   46109rh 46110rh 46111rh 46112rh 46113rh 46115rh 46121rh
ENSG00000080815          0       0   17931       0   19087       0       0
ENSG00000142192          5       1  293409       8  273704       1       0
ENSMUSG00000000001    3638    3747    3971    3192    3805    3753    4134
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     168     175     203     158     108     110     179
                   46131rh 46132rh 46134rh 46138rh 46141rh 46142rh 46862rh
ENSG00000080815          0   12703   18833       0   18702   17666       0
ENSG00000142192          1  187975  285048       0  284499  250600       0
ENSMUSG00000000001    3059    3116    3853    3682    2844    3466    3442
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     137     145     183     171     138      88     154
                   46863rh 46865rh 46866rh 46867rh 46868rh 46871rh 46872rh
ENSG00000080815      14834   10546   10830   10316   10638   15248       0
ENSG00000142192     218494  169516  152769  151732  190150  229063       6
ENSMUSG00000000001    3300    3242    3872    3656    3739    3473    3154
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     157     131     152     152     155     140      80
                   46873rh 46874rh 46875rh 46876rh 46877rh 46878rh 46879rh
ENSG00000080815          0   11608   11561       0       0   12683   15613
ENSG00000142192          1  165941  171303       0       2  183058  216122
ENSMUSG00000000001    5510    3657    4121    3422    3829    3996    4324
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     240     148     112     147     166     169     215
                   46881rh 46882rh 46883rh 46884rh 46885rh 46886rh 46887rh
ENSG00000080815          0   14084   20753       0       0   16639   16072
ENSG00000142192          0  199448  306081       0       5  242543  258061
ENSMUSG00000000001    2592    2606    4600    2913    3614    3294    3719
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     115     101     174     127     151     139     128
                   46888rh 46889rh 46890rh 46891rh 46892rh 46893rh 46895rh
ENSG00000080815          0   16680   13367       0   25119      92       0
ENSG00000142192          0  235530  196721       0  371037    1116       0
ENSMUSG00000000001    3899    4173    4008    3037    5967    3459    4262
ENSMUSG00000000003       0       0       0       0       0       0       0
ENSMUSG00000000028     210     127     156     116     260     161     189
                   46896rh 46897rh
ENSG00000080815      15934       0
ENSG00000142192     235343       6
ENSMUSG00000000001    3923    3486
ENSMUSG00000000003       0       0
ENSMUSG00000000028     179     117

As you can see from count table there are some genes that start with “ENSG” and others start with “ENSMUSG”. “ENSG” referes to human gene ENSEMBL id and “ENSMUSG” refer to mouse ENSEMBL id. Let’s check how many gene_ids are NOT from the mouse genome by searching for the string “MUS” (as in Mus musculus) in the rownames of count matrix

counts[,1:6] %>% 
  filter(!str_detect(rownames(.), "MUS"))
                32043rh 32044rh 32046rh 32047rh 32048rh 32049rh
ENSG00000080815   22554       0       0       0   16700       0
ENSG00000142192  344489       4       0       1  260935       6

Ok, so we see there are two human genes in out count matrix. Why? What genes are they?

Briefly, 5xFAD mouse strain harbors two human transgenes APP (“ENSG00000142192”) and PSEN1 (“ENSG00000080815”) and inserted into exon 2 of the mouse Thy1 gene. To validate 5XFAD strain and capture expression of human transgene APP and PS1, a custom mouse genomic sequences was created and we quantified expression of human as well as mouse App (“ENSMUSG00000022892”) and Psen1 (“ENSMUSG00000019969”) genes by our MODEL-AD RNA-Seq pipeline.

Validation of 5xFAD mouse strain

#First we convert the dataframe to longer format and join our covariates by MouseID

count_tpose <- counts  %>%
                rownames_to_column(.,var="gene_id") %>% 
                filter(gene_id %in% c("ENSG00000080815","ENSMUSG00000019969","ENSG00000142192","ENSMUSG00000022892")) %>% 
                pivot_longer(.,cols = -"gene_id",names_to = "specimenID",values_to="counts") %>% as.data.frame() %>%
                left_join(covars ,by="specimenID") %>% as.data.frame()
head(count_tpose) 
          gene_id specimenID counts individualID    sex         genotype
1 ENSG00000080815    32043rh  22554        32043 female    5XFAD_carrier
2 ENSG00000080815    32044rh      0        32044   male 5XFAD_noncarrier
3 ENSG00000080815    32046rh      0        32046   male 5XFAD_noncarrier
4 ENSG00000080815    32047rh      0        32047   male 5XFAD_noncarrier
5 ENSG00000080815    32048rh  16700        32048 female    5XFAD_carrier
6 ENSG00000080815    32049rh      0        32049 female 5XFAD_noncarrier
  timepoint
1     12 mo
2     12 mo
3     12 mo
4     12 mo
5     12 mo
6     12 mo
#make the age column a factor and re-order the levels
count_tpose$timepoint <- factor(count_tpose$timepoint,levels=c("4 mo","6 mo","12 mo"))

# rename the gene id to gene symbol
count_tpose$gene_id[count_tpose$gene_id %in% "ENSG00000142192"] <- "Human APP"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSG00000080815"] <- "Human PSEN1"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSMUSG00000022892"] <- "Mouse App"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSMUSG00000019969"] <- "Mouse Psen1"
#Create simple box plots showing normalized counts by genotype and time point, faceted by sex.
count_tpose %>% 
  ggplot(aes(x=timepoint, y=counts, color=genotype)) +
  geom_boxplot() + 
  geom_point(position=position_jitterdodge()) +
  facet_wrap(~sex+gene_id) +theme_bw()
plot of chunk counts_boxplot

plot of chunk counts_boxplot

You will notice expression of Human APP is higher in 5XFAD carriers but lower in non-carriers. However mouse App expressed in both 5XFAD carrier and non-carrier.

We are going to sum the counts from both ortholgous genes (human APP and mouse App; human PSEN1 and mouse Psen1) and save summed expression as expression of mouse genes, respectively to match with gene names in control mice.

#merge mouse and human APP gene raw count
counts[rownames(counts) %in% "ENSMUSG00000022892",] <- counts[rownames(counts) %in% "ENSMUSG00000022892",] + counts[rownames(counts) %in% "ENSG00000142192",]
counts <- counts[!rownames(counts) %in% c("ENSG00000142192"),]

#merge mouse and human PS1 gene raw count
counts[rownames(counts) %in% "ENSMUSG00000019969",] <- counts[rownames(counts) %in% "ENSMUSG00000019969",] + counts[rownames(counts) %in% "ENSG00000080815",]
counts <- counts[!rownames(counts) %in% c("ENSG00000080815"),]

Let’s verify if expression of both human genes have been merged or not:

counts[,1:6] %>% 
  filter(!str_detect(rownames(.), "MUS"))
[1] 32043rh 32044rh 32046rh 32047rh 32048rh 32049rh
<0 rows> (or 0-length row.names)

What proportion of genes have zero counts in all samples?

gene_sums <- data.frame(gene_id = rownames(counts),
                        sums    = Matrix::rowSums(counts))
sum(gene_sums$sums == 0)
[1] 9691

We can see that 9691 (17%) genes have no reads at all associated with them. In the next lesson, we will remove genes that have no counts in any samples.

Differential Analysis using DESeq2

Now, after exploring and formatting the data, We will look for differential expression between the control and 5xFAD mice at different ages for both sexes. The differentially expressed genes (DEGs) can inform our understanding of how the 5XFAD mutation affect the biological processes.

DESeq2 analysis consist of multiple steps. We are going to briefly understand some of the important steps using subset of data and then we will perform differential analysis on whole dataset. For detailed analysis see the DESeq2 tutorial.

First, order the data (so counts and metadata match) and save in another variable

rawdata <- counts[,sort(colnames(counts))]
metadata <- covars[sort(rownames(covars)),]

subset the counts matrix and sample metadata to include only 12 month old male mice. You can amend the code to compare wild type and 5XFAD mice from either sex, at any time point.

meta.12M.Male <- metadata[(metadata$sex=="male" & metadata$timepoint=='12 mo'),]

meta.12M.Male
        individualID specimenID  sex         genotype timepoint
32044rh        32044    32044rh male 5XFAD_noncarrier     12 mo
32046rh        32046    32046rh male 5XFAD_noncarrier     12 mo
32047rh        32047    32047rh male 5XFAD_noncarrier     12 mo
32053rh        32053    32053rh male    5XFAD_carrier     12 mo
32059rh        32059    32059rh male    5XFAD_carrier     12 mo
32061rh        32061    32061rh male 5XFAD_noncarrier     12 mo
32062rh        32062    32062rh male    5XFAD_carrier     12 mo
32073rh        32073    32073rh male 5XFAD_noncarrier     12 mo
32074rh        32074    32074rh male 5XFAD_noncarrier     12 mo
32075rh        32075    32075rh male    5XFAD_carrier     12 mo
32088rh        32088    32088rh male    5XFAD_carrier     12 mo
32640rh        32640    32640rh male    5XFAD_carrier     12 mo
dat <- as.matrix(rawdata[,colnames(rawdata) %in% rownames(meta.12M.Male)])
colnames(dat)
 [1] "32044rh" "32046rh" "32047rh" "32053rh" "32059rh" "32061rh" "32062rh"
 [8] "32073rh" "32074rh" "32075rh" "32088rh" "32640rh"
rownames(meta.12M.Male)
 [1] "32044rh" "32046rh" "32047rh" "32053rh" "32059rh" "32061rh" "32062rh"
 [8] "32073rh" "32074rh" "32075rh" "32088rh" "32640rh"
match(colnames(dat),rownames(meta.12M.Male))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

Next, we build the DESeqDataSet using the following function:

ddsHTSeq <- DESeqDataSetFromMatrix(countData=dat, 
                                   colData=meta.12M.Male, 
                                   design = ~genotype)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
ddsHTSeq
class: DESeqDataSet 
dim: 55487 12 
metadata(1): version
assays(1): counts
rownames(55487): ENSMUSG00000000001 ENSMUSG00000000003 ...
  ENSMUSG00000118487 ENSMUSG00000118488
rowData names(0):
colnames(12): 32044rh 32046rh ... 32088rh 32640rh
colData names(5): individualID specimenID sex genotype timepoint

Pre-filtering

While it is not necessary to pre-filter low count genes before running the DESeq2 functions, there are two reasons which make pre-filtering useful: by removing rows in which there are very few reads, we reduce the memory size of the dds data object, and we increase the speed of the transformation and testing functions within DESeq2. It can also improve visualizations, as features with no information for differential expression are not plotted.

Here we perform a minimal pre-filtering to keep only rows that have at least 10 reads total.

ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) >= 10,]

ddsHTSeq
class: DESeqDataSet 
dim: 33059 12 
metadata(1): version
assays(1): counts
rownames(33059): ENSMUSG00000000001 ENSMUSG00000000028 ...
  ENSMUSG00000118486 ENSMUSG00000118487
rowData names(0):
colnames(12): 32044rh 32046rh ... 32088rh 32640rh
colData names(5): individualID specimenID sex genotype timepoint

Reference level

By default, R will choose a reference level for factors based on alphabetical order. Then, if you never tell the DESeq2 functions which level you want to compare against (e.g. which level represents the control group), the comparisons will be based on the alphabetical order of the levels.

specifying the reference-level to 5XFAD_noncarrier:

ddsHTSeq$genotype <- relevel(ddsHTSeq$genotype,ref="5XFAD_noncarrier")  

Run the standard differential expression analysis steps that is wrapped into a single function, DESeq.

dds <- DESeq(ddsHTSeq,parallel = TRUE)

Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. By default the argument alpha is set to 0.1. If the adjusted p value cutoff will be a value other than 0.1, alpha should be set to that value:

res <- results(dds,alpha=0.05)  # setting 0.05 as significant threshold
res
log2 fold change (MLE): genotype 5XFAD carrier vs 5XFAD noncarrier 
Wald test p-value: genotype 5XFAD carrier vs 5XFAD noncarrier 
DataFrame with 33059 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat    pvalue
                   <numeric>      <numeric> <numeric> <numeric> <numeric>
ENSMUSG00000000001 3737.9009      0.0148125 0.0466948  0.317219 0.7510777
ENSMUSG00000000028  138.5635     -0.0712500 0.1550130 -0.459639 0.6457754
ENSMUSG00000000031   29.2983      0.6705922 0.3563441  1.881867 0.0598541
ENSMUSG00000000037  123.6482     -0.2184054 0.1554362 -1.405113 0.1599876
ENSMUSG00000000049   15.1733      0.3657555 0.3924376  0.932009 0.3513317
...                      ...            ...       ...       ...       ...
ENSMUSG00000118473   1.18647      -0.377971  1.531586 -0.246784  0.805075
ENSMUSG00000118477  59.10359      -0.144081  0.226690 -0.635586  0.525046
ENSMUSG00000118479  24.64566      -0.181992  0.341445 -0.533006  0.594029
ENSMUSG00000118486   1.92048       0.199838  1.253875  0.159376  0.873372
ENSMUSG00000118487  65.78311      -0.191362  0.218593 -0.875427  0.381342
                        padj
                   <numeric>
ENSMUSG00000000001  0.943421
ENSMUSG00000000028  0.913991
ENSMUSG00000000031  0.352346
ENSMUSG00000000037  0.566360
ENSMUSG00000000049  0.765640
...                      ...
ENSMUSG00000118473        NA
ENSMUSG00000118477  0.863565
ENSMUSG00000118479  0.893356
ENSMUSG00000118486        NA
ENSMUSG00000118487  0.785845

We can order our results table by the smallest p value:

resOrdered <- res[order(res$pvalue),]

head(resOrdered,n=10)
log2 fold change (MLE): genotype 5XFAD carrier vs 5XFAD noncarrier 
Wald test p-value: genotype 5XFAD carrier vs 5XFAD noncarrier 
DataFrame with 10 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat       pvalue
                    <numeric>      <numeric> <numeric> <numeric>    <numeric>
ENSMUSG00000019969  13860.942        1.90740 0.0432685   44.0828  0.00000e+00
ENSMUSG00000030579   2367.096        2.61215 0.0749325   34.8600 2.99877e-266
ENSMUSG00000046805   7073.296        2.12247 0.0635035   33.4229 6.38260e-245
ENSMUSG00000032011  80423.476        1.36195 0.0424006   32.1210 2.24185e-226
ENSMUSG00000022892 271265.838        1.36140 0.0434167   31.3567 7.88666e-216
ENSMUSG00000038642  10323.969        1.69717 0.0549488   30.8864 1.81838e-209
ENSMUSG00000023992   2333.227        2.62290 0.0882818   29.7105 5.61657e-194
ENSMUSG00000079293    761.313        5.12514 0.1738382   29.4822 4.86644e-191
ENSMUSG00000040552    617.149        2.22726 0.0781799   28.4889 1.60622e-178
ENSMUSG00000069516   2604.926        2.34471 0.0847390   27.6697 1.61585e-168
                           padj
                      <numeric>
ENSMUSG00000019969  0.00000e+00
ENSMUSG00000030579 3.60828e-262
ENSMUSG00000046805 5.11991e-241
ENSMUSG00000032011 1.34876e-222
ENSMUSG00000022892 3.79585e-212
ENSMUSG00000038642 7.29321e-206
ENSMUSG00000023992 1.93090e-190
ENSMUSG00000079293 1.46389e-187
ENSMUSG00000040552 4.29486e-175
ENSMUSG00000069516 3.88853e-165

we can summarize some basic tallies using the summary function.

summary(res)

out of 33059 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 1098, 3.3%
LFC < 0 (down)     : 505, 1.5%
outliers [1]       : 33, 0.1%
low counts [2]     : 8961, 27%
(mean count < 8)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

How many adjusted p-values were less than 0.05?

sum(res$padj < 0.05, na.rm=TRUE)
[1] 1603

Challenge 2

How many adjusted p-values were less than 0.1?

Solution to Challenge 2

sum(res$padj < 0.1, na.rm=TRUE) 

Function to convert ensembleIDs to common gene names

We’ll use a package to translate mouse ENSEMBL IDS to gene names. Run this function and they will be called up when assembling results from the differential expression analysis

map_function.df <- function(x, inputtype, outputtype) {
  mapIds(
    org.Mm.eg.db,
    keys = row.names(x),
    column = outputtype,
    keytype = inputtype,
    multiVals = "first"
  )
}

Generating Result table

All_res <- as.data.frame(res) %>% 
  mutate(symbol = map_function.df(res,"ENSEMBL","SYMBOL")) %>%    ##run map_function to add symbol of gene corresponding to ENSEBL ID
  mutate(EntrezGene = map_function.df(res,"ENSEMBL","ENTREZID")) %>%  ##run map_function to add Entrez ID of gene corresponding to ENSEBL ID
  dplyr::select("symbol", "EntrezGene","baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")

head(All_res)
                   symbol EntrezGene   baseMean log2FoldChange      lfcSE
ENSMUSG00000000001  Gnai3      14679 3737.90089     0.01481247 0.04669482
ENSMUSG00000000028  Cdc45      12544  138.56354    -0.07125004 0.15501305
ENSMUSG00000000031    H19      14955   29.29832     0.67059217 0.35634414
ENSMUSG00000000037  Scml2     107815  123.64823    -0.21840544 0.15543617
ENSMUSG00000000049   Apoh      11818   15.17325     0.36575554 0.39243763
ENSMUSG00000000056   Narf      67608 5017.30216    -0.06713961 0.04466809
                         stat     pvalue      padj
ENSMUSG00000000001  0.3172187 0.75107767 0.9434210
ENSMUSG00000000028 -0.4596390 0.64577536 0.9139907
ENSMUSG00000000031  1.8818667 0.05985412 0.3523457
ENSMUSG00000000037 -1.4051134 0.15998757 0.5663600
ENSMUSG00000000049  0.9320093 0.35133170 0.7656399
ENSMUSG00000000056 -1.5030778 0.13281898 0.5203330

Extracting genes that are significantly expressed

dseq_res <- subset(All_res[order(All_res$padj), ], padj < 0.05)

Wow! We have a lot of genes with apparently very strong statistically significant differences between the control and 5xFAD carrier.

dim(dseq_res)
[1] 1603    8
head(dseq_res)
                   symbol EntrezGene   baseMean log2FoldChange      lfcSE
ENSMUSG00000019969  Psen1      19164  13860.942       1.907397 0.04326854
ENSMUSG00000030579 Tyrobp      22177   2367.096       2.612152 0.07493255
ENSMUSG00000046805  Mpeg1      17476   7073.296       2.122467 0.06350346
ENSMUSG00000032011   Thy1      21838  80423.476       1.361953 0.04240065
ENSMUSG00000022892    App      11820 271265.838       1.361405 0.04341673
ENSMUSG00000038642   Ctss      13040  10323.969       1.697172 0.05494882
                       stat        pvalue          padj
ENSMUSG00000019969 44.08278  0.000000e+00  0.000000e+00
ENSMUSG00000030579 34.86005 2.998775e-266 3.608276e-262
ENSMUSG00000046805 33.42286 6.382596e-245 5.119906e-241
ENSMUSG00000032011 32.12104 2.241854e-226 1.348756e-222
ENSMUSG00000022892 31.35669 7.886662e-216 3.795850e-212
ENSMUSG00000038642 30.88641 1.818378e-209 7.293211e-206

Exploring and exporting results

Exporting results to CSV files

we can save results file into a csv file like this:

write.csv(All_res,file="../results/All_5xFAD_12months_male.csv")
write.csv(dseq_res,file="../results/DEG_5xFAD_12months_male.csv")

Volcano plot

We can visualize the differential expression results using Volcano plot function from EnhancedVolcano package. For the most basic volcano plot, only a single data-frame, data-matrix, or tibble of test results is required, containing point labels, log2FC, and adjusted or unadjusted P values. The default cut-off for log2FC is >|2|; the default cut-off for P value is 10e-6.

EnhancedVolcano(All_res,
                                   lab = (All_res$symbol),
                                   x = 'log2FoldChange',
                                   y = 'padj',legendPosition = 'none',
                                   title = 'Volcano plot:Differential Expression Results',
                                   subtitle = '',
                                   FCcutoff = 0.1,
                                   pCutoff = 0.05,
                                   xlim = c(-3, 6))
Warning: One or more p-values is 0. Converting to 10^-1 * current lowest
non-zero p-value...
plot of chunk volcanoplot

plot of chunk volcanoplot

You can see that some top significantly expressed are immune/inflammation-related genes such as Ctsd, C4b, Csf1 etc. These genes are upregulated in the 5XFAD strain.

Principal component plot of the samples

ddsHTSeq <- DESeqDataSetFromMatrix(countData=as.matrix(rawdata), colData=metadata, design= ~ genotype)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors
ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)>1) >= 10, ]
dds <- DESeq(ddsHTSeq,parallel = TRUE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 8 workers
mean-dispersion relationship
final dispersion estimates, fitting model and testing: 8 workers
-- replacing outliers and refitting for 42 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
vsd <- varianceStabilizingTransformation(dds, blind=FALSE)

plotPCA(vsd, intgroup=c("genotype", "sex","timepoint"))
plot of chunk PCA

plot of chunk PCA

It is also possible to customize the PCA plot using the ggplot function.

pcaData <- plotPCA(vsd, intgroup=c("genotype", "sex","timepoint"), returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
ggplot(pcaData, aes(PC1, PC2,color=genotype, shape=sex)) + 
  geom_point(size=3) +
  geom_text(aes(label=timepoint),hjust=0.5, vjust=2,size =3.5) +
  labs(x= paste0("PC1: ",percentVar[1],"% variance"), y= paste0("PC2: ",percentVar[2],"% variance"))
plot of chunk PCA2

plot of chunk PCA2

PCA identified genotype and sex being a major source of variation in between 5XFAD and WT mice. Female and male samples clustered distinctly at all ages, suggesting the presence of sex-biased molecular changes in animals.

Function for Differential analysis using DESeq2

Finally, we can built a function for differential analysis that consist of all above discussed steps. It will require to input sorted raw count matrix, sample metadata and define the reference group.

DEG <- function(rawdata,meta,
                   include.batch = FALSE,
                   ref = ref) {
  dseq_res <- data.frame()
  All_res <- data.frame()
  
  if (include.batch) {
    cat("Including batch as covariate\n")
    design_formula <- ~ Batch + genotype
  }
  else{
    design_formula <- ~ genotype
  }
  
  dat2 <- as.matrix(rawdata[, colnames(rawdata) %in% rownames(meta)])
  ddsHTSeq <-
    DESeqDataSetFromMatrix(countData = dat2,
                           colData = meta,
                           design = design_formula)
  ddsHTSeq <- ddsHTSeq[rowSums(counts(ddsHTSeq)) >= 10, ]
  ddsHTSeq$genotype <- relevel(ddsHTSeq$genotype, ref = ref)
  dds <- DESeq(ddsHTSeq, parallel = TRUE)
  
  res <- results(dds, alpha = 0.05)
  #summary(res)
  
  res$symbol <- map_function.df(res,"ENSEMBL","SYMBOL")
  
  res$EntrezGene <- map_function.df(res,"ENSEMBL","ENTREZID")
  
  All_res <<- as.data.frame(res[, c("symbol", "EntrezGene","baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj")])

}

Let’s use this function to analyze all groups present in our data.

Differential Analysis of all groups

First, we add a Group column to our metadata table that will combine all variable of interest for that sample.

metadata$Group <- paste0(metadata$genotype,"-",metadata$sex,"-",metadata$timepoint)

unique(metadata$Group)
 [1] "5XFAD_carrier-female-12 mo"    "5XFAD_noncarrier-male-12 mo"  
 [3] "5XFAD_noncarrier-female-12 mo" "5XFAD_carrier-male-12 mo"     
 [5] "5XFAD_noncarrier-female-6 mo"  "5XFAD_noncarrier-male-6 mo"   
 [7] "5XFAD_carrier-female-6 mo"     "5XFAD_noncarrier-female-4 mo" 
 [9] "5XFAD_carrier-female-4 mo"     "5XFAD_carrier-male-6 mo"      
[11] "5XFAD_carrier-male-4 mo"       "5XFAD_noncarrier-male-4 mo"   

Next, we create a comparison table that has all cases and controls that we would like to compare with each other. Here I have made comparison group for age and sex-matched 5xFAD carriers vs 5xFAD_noncarriers:

comparisons <-  data.frame(control=c("5XFAD_noncarrier-male-4 mo", "5XFAD_noncarrier-female-4 mo", "5XFAD_noncarrier-male-6 mo", 
                                     "5XFAD_noncarrier-female-6 mo","5XFAD_noncarrier-male-12 mo", "5XFAD_noncarrier-female-12 mo"), 
                           case=c("5XFAD_carrier-male-4 mo", "5XFAD_carrier-female-4 mo", "5XFAD_carrier-male-6 mo", 
                                  "5XFAD_carrier-female-6 mo","5XFAD_carrier-male-12 mo", "5XFAD_carrier-female-12 mo")
                           )
comparisons
                        control                       case
1    5XFAD_noncarrier-male-4 mo    5XFAD_carrier-male-4 mo
2  5XFAD_noncarrier-female-4 mo  5XFAD_carrier-female-4 mo
3    5XFAD_noncarrier-male-6 mo    5XFAD_carrier-male-6 mo
4  5XFAD_noncarrier-female-6 mo  5XFAD_carrier-female-6 mo
5   5XFAD_noncarrier-male-12 mo   5XFAD_carrier-male-12 mo
6 5XFAD_noncarrier-female-12 mo 5XFAD_carrier-female-12 mo

Finally, we implement our DEG function on each comparison and store the result table in a list and data frame:

# initiate an empty list and data frame to save results
DE_5xFAD.list <- list()
DE_5xFAD.df <- data.frame()

for (i in 1:nrow(comparisons))
  {
    meta <- metadata[metadata$Group %in% comparisons[i,],]
    DEG(rawdata,meta,ref = "5XFAD_noncarrier")
    
    #append results in data frame
    DE_5xFAD.df <- rbind(DE_5xFAD.df,All_res %>% mutate(model="5xFAD",sex=unique(meta$sex),age=unique(meta$timepoint)))
    
    #append results in list
    DE_5xFAD.list[[i]] <- All_res
    names(DE_5xFAD.list)[i] <- paste0(comparisons[i,2])
}

Let’s explore the result stored in our list:

names(DE_5xFAD.list)
[1] "5XFAD_carrier-male-4 mo"    "5XFAD_carrier-female-4 mo" 
[3] "5XFAD_carrier-male-6 mo"    "5XFAD_carrier-female-6 mo" 
[5] "5XFAD_carrier-male-12 mo"   "5XFAD_carrier-female-12 mo"

We can easily extract result table for any group of interest by using $ and name of group. Let’s check top few rows from 5XFAD_carrier-male-4 mo group:

head(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`)
                   symbol EntrezGene   baseMean log2FoldChange     lfcSE
ENSMUSG00000000001  Gnai3      14679 3707.53159   -0.023085865 0.0381646
ENSMUSG00000000028  Cdc45      12544  159.76225   -0.009444949 0.1322609
ENSMUSG00000000031    H19      14955   35.96987    0.453401555 0.2785245
ENSMUSG00000000037  Scml2     107815  126.82414    0.089394561 0.1377406
ENSMUSG00000000049   Apoh      11818   19.99721    0.115325837 0.3154875
ENSMUSG00000000056   Narf      67608 5344.21741   -0.100413295 0.0381182
                          stat      pvalue      padj
ENSMUSG00000000001 -0.60490264 0.545243691 0.9999518
ENSMUSG00000000028 -0.07141148 0.943070275 0.9999518
ENSMUSG00000000031  1.62786960 0.103552539 0.9999518
ENSMUSG00000000037  0.64900659 0.516334116 0.9999518
ENSMUSG00000000049  0.36554806 0.714702337 0.9999518
ENSMUSG00000000056 -2.63426085 0.008432068 0.5696800

Let’s check the result stored as dataframe:

head(DE_5xFAD.df)
                   symbol EntrezGene   baseMean log2FoldChange     lfcSE
ENSMUSG00000000001  Gnai3      14679 3707.53159   -0.023085865 0.0381646
ENSMUSG00000000028  Cdc45      12544  159.76225   -0.009444949 0.1322609
ENSMUSG00000000031    H19      14955   35.96987    0.453401555 0.2785245
ENSMUSG00000000037  Scml2     107815  126.82414    0.089394561 0.1377406
ENSMUSG00000000049   Apoh      11818   19.99721    0.115325837 0.3154875
ENSMUSG00000000056   Narf      67608 5344.21741   -0.100413295 0.0381182
                          stat      pvalue      padj model  sex  age
ENSMUSG00000000001 -0.60490264 0.545243691 0.9999518 5xFAD male 4 mo
ENSMUSG00000000028 -0.07141148 0.943070275 0.9999518 5xFAD male 4 mo
ENSMUSG00000000031  1.62786960 0.103552539 0.9999518 5xFAD male 4 mo
ENSMUSG00000000037  0.64900659 0.516334116 0.9999518 5xFAD male 4 mo
ENSMUSG00000000049  0.36554806 0.714702337 0.9999518 5xFAD male 4 mo
ENSMUSG00000000056 -2.63426085 0.008432068 0.5696800 5xFAD male 4 mo

Check if result is present for all ages:

unique((DE_5xFAD.df$age))
[1] "4 mo"  "6 mo"  "12 mo"

Check if result is present for both sexes:

unique((DE_5xFAD.df$sex))
[1] "male"   "female"

Check number of genes in each group:

count(DE_5xFAD.df,model,sex,age)
  model    sex   age     n
1 5xFAD female 12 mo 33120
2 5xFAD female  4 mo 32930
3 5xFAD female  6 mo 33249
4 5xFAD   male 12 mo 33059
5 5xFAD   male  4 mo 33119
6 5xFAD   male  6 mo 33375

Check number of genes significantly differentially expressed in all cases compared to age and sex-matched controls:

degs.up <- map(DE_5xFAD.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange>0)))
degs.down <- map(DE_5xFAD.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange<0)))
deg <- data.frame(Cases=names(degs.up), Up_DEGs.pval.05=as.vector(unlist(degs.up)),Down_DEGs.pval.05=as.vector(unlist(degs.down)))
knitr::kable(deg)
Cases Up_DEGs.pval.05 Down_DEGs.pval.05
5XFAD_carrier-male-4 mo 86 11
5XFAD_carrier-female-4 mo 522 90
5XFAD_carrier-male-6 mo 714 488
5XFAD_carrier-female-6 mo 1081 409
5XFAD_carrier-male-12 mo 1098 505
5XFAD_carrier-female-12 mo 1494 1023

Interestingly, in females more genes are differentially expressed at 4 months and with age more genes are differentially expressed. Male mice is catching up female mice at later time-point.

Pathway Enrichment

We may wish to look for enrichment of biological pathways in a list of differentially expressed genes. Here we will test for enrichment of KEGG pathways using using enrichKEGG function in clusterProfiler package.

dat <- list(FAD_M_4=subset(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`[order(DE_5xFAD.list$`5XFAD_carrier-male-4 mo`$padj), ], padj < 0.05) %>% pull(EntrezGene),
            FAD_F_4=subset(DE_5xFAD.list$`5XFAD_carrier-female-4 mo`[order(DE_5xFAD.list$`5XFAD_carrier-female-4 mo`$padj), ], padj < 0.05) %>% pull(EntrezGene))

## perform enrichment analysis
enrich_pathway <- compareCluster(dat,
                                 fun = "enrichKEGG",
                                  pvalueCutoff = 0.05,
                                  organism = "mmu"
                                 )

enrich_pathway@compareClusterResult$Description <- gsub(" - Mus musculus \\(house mouse)","",enrich_pathway@compareClusterResult$Description)

Let’s plot top enriched functions using dotplot function of clusterProfiler package.

clusterProfiler::dotplot(enrich_pathway,showCategory=10,font.size = 14,title="Enriched KEGG Pathways")
plot of chunk pathways

plot of chunk pathways

What does this plot infer?

Save Data for Next Lesson

We will use the results data in the next lesson. Save it now and we will load it at the beginning of the next lesson. We will use R’s save command to save the objects in compressed, binary format. The save command is useful when you want to save multiple objects in one file.

save(DE_5xFAD.df,DE_5xFAD.list,file="../results/DEAnalysis_5XFAD.Rdata")

Challenge 3

Draw volcano plot for 6 months old female 5xFAD_carrier ?

Solution to Challenge 3

EnhancedVolcano(DE_5xFAD.list$`5XFAD_carrier-female-6 mo,
                                  lab = (DE_5xFAD.list$`5XFAD_carrier-female-6 mo`$symbol),
                                  x = 'log2FoldChange',
                                  y = 'padj',legendPosition = 'none',
                                  title = 'Volcano plot:Differential Expression Results',
                                  subtitle = '',
                                  FCcutoff = 0.1,
                                  pCutoff = 0.05,
                                  xlim = c(-5, 5))

Loading RNA-Seq data from LOAD1 (APOE4*Trem2 Mouse Model)

There is another RNA-Seq data from LOAD1 mouse cohort. This cohort has three distinct mice strains: *APOE4: humanized Apoe knock-in strain, mouse Apoe allele was replaced by human APOE gene sequence; *Trem2.R47H:knock out mutant of the Trem2, R47H point mutation with two silent mutation and *APOE4.Trem2.R47H (LOAD1): double mutant strain carries humanized APOE (isoform 4) knock in mutation and R47H point mutation of the Trem2 gene.

Let’s read the count data and sample metadata as we did for 5XFAD samples.

counts <- read.delim("../data/htseqcounts_APTR.txt", check.names = FALSE)
ind_meta <- read.csv("../data/metadata_RNASEQ_ALLJAX_402Samples.csv") 

Let’s look at the first few rows.

head(ind_meta)
  Mouse.ID Sex Genotype Age Batch
1      123   M C57BL/6J   4     1
2      181   F C57BL/6J   4     1
3      191   M C57BL/6J   4     1
4      193   M C57BL/6J   4     1
5      221   F C57BL/6J   4     1
6      225   F C57BL/6J   4     1

Let’s quickly explore the sample metadata:

dim(ind_meta)
[1] 402   5

How many strains are in this sample metadata?

table(ind_meta$Genotype)

     5XFAD      Apoe4      APOE4 APOE4Trem2     ApoeKO        APP       Bin1 
        36          6         57         59          6         26          6 
  C57BL/6J      Cd2ap        Clu      Trem2 
       137          6          6         57 

How many samples are in each genotype for each sex and age group?

ind_meta %>% group_by(Sex,Genotype,Age) %>% count()
# A tibble: 51 × 4
# Groups:   Sex, Genotype, Age [51]
   Sex   Genotype     Age     n
   <chr> <chr>      <int> <int>
 1 F     5XFAD          4     6
 2 F     5XFAD          6     6
 3 F     5XFAD         12     6
 4 F     APOE4          4    12
 5 F     APOE4          8     6
 6 F     APOE4         12     5
 7 F     APOE4         24     6
 8 F     APOE4Trem2     4    12
 9 F     APOE4Trem2     8     6
10 F     APOE4Trem2    12     5
# ℹ 41 more rows

We notice that there is a column Batch in sample metadata.

Let’s see how many samples are in each batch.

table(ind_meta$Batch)

  1   2   3   4   5   6 
140  94  72  36  48  12 

So, total 5 batches. In bulk RNA-Seq experiments, it is usually vital that we apply a correction for samples profiled in different batches. Due to tim-constraint we do not want to do it in this lesson. So, we will use samples from only one batch in this lesson.

Let’s check which genotype are in which batch?

table(ind_meta$Batch,ind_meta$Genotype)
   
    5XFAD Apoe4 APOE4 APOE4Trem2 ApoeKO APP Bin1 C57BL/6J Cd2ap Clu Trem2
  1     0     0    33         35      0   0    0       36     0   0    36
  2     0     0    24         24      0   0    0       25     0   0    21
  3    36     0     0          0      0   0    0       36     0   0     0
  4     0     6     0          0      6   0    6        6     6   6     0
  5     0     0     0          0      0  16    0       32     0   0     0
  6     0     0     0          0      0  10    0        2     0   0     0

We are going to use samples from Batch 1 as it has more samples and contain most of our LOAD1 cohort samples that we are interested.

Subset sample metadata for batch 1 only and Convert Mouse ID column to rownames.

covar <- ind_meta %>% filter(Batch==1) %>% remove_rownames() %>% column_to_rownames(.,var="Mouse.ID")  

How many rows and columns are there in covar?

dim(covar)
[1] 140   4

Let’s look at the first few rows.

head(covar)
    Sex Genotype Age Batch
123   M C57BL/6J   4     1
181   F C57BL/6J   4     1
191   M C57BL/6J   4     1
193   M C57BL/6J   4     1
221   F C57BL/6J   4     1
225   F C57BL/6J   4     1

Check mouse strains and are there numbers in covar?

table(covar$Genotype)

     APOE4 APOE4Trem2   C57BL/6J      Trem2 
        33         35         36         36 

Challenge 4

How many samples are in each genotype for each sex and age group?

Solution to Challenge 4

covar %>% group_by(Sex,Genotype,Age) %>% count()

Let’s explore count data.

head(counts,n=5)
             gene_ID  116  123   12  181 18421 18422 18424 18425 18427 18428
1    ENSG00000130203   54   37  112   90    16    35    29    36     6     6
2 ENSMUSG00000000001 2347 2648 2302 2585  3291  2256  2791  3029  1880  2298
3 ENSMUSG00000000003    0    0    0    0     0     0     0     0     0     0
4 ENSMUSG00000000028   89  114   60  113   109    72    76    86    73    93
5 ENSMUSG00000000031   47   20    7   18    20    17    12    16     6    12
  18429 18430 18431 18432 18465  18469 18472  191  193 19682 19696 19700 20246
1     6     8    21    10 75147 106325    14   47   82 21898 46442 45987     0
2  2677  1884  2023  1745  3554   4040  4438 2494 2996  2545  2736  2851  4255
3     0     0     0     0     0      0     0    0    0     0     0     0     0
4    73    88    56    67   123    181   106  107  102    76    96   117   186
5    12     4    19    13    63     18    36   15   67    14    24    41    33
  20248 20322 20357 20440 20443 20465 20466 20467 20495  20801  20804  20817
1     0 34116 46491 75376 77544    52    52    30 25511 117185 108849 108754
2  3247  2185  3238  3482  3903  2282  3075  2597  2124   4801   3994   4299
3     0     0     0     0     0     0     0     0     0      0      0      0
4   121    72    84   106   194   111    88    95    60    197    167    163
5    32    22    14    51    30    29    34    11    22     44     32     36
  20822 20837 20875 20877 20879 20882 20887 20892 20893 20894 20895 20898 20899
1 70324 78966 78796 74305 80772 94969     4    48    62    32    39    76    12
2  3236  4267  3871  2534  3601  3543  2405  2282  2279  2600  2553  2971  1984
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4   201   131   162   118   156    96    64    88   124   122    94    76    57
5    19    15    43    16    13    43    17    17     6     8    13    44    11
  22001 22005 22078 22079 22083 22085 22087 22088 22090 22094 22095 22096 22097
1 51479 40252 52870 25029 31474 37812 36563 26525 48147 31663 26470 45566 55277
2  3161  2867  3400  2356  3121  3433  2569  1983  2933  2303  2190  2876  3571
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4    89   127   112    89   116   102   127    86   108    84   105    67   124
5    16    10    16    12    24    21     9     9     8    14    20    65    30
  22098 22144 22149 22168 22171 22172 22174  221 22514 22523  225 22705 22711
1 34337 36095 42308 34796 41583 30591 47146   41 27832 43621   64 30363 26211
2  2205  2732  2336  2094  2309  2214  2863 2605  2384  2536 3133  2091  1859
3     0     0     0     0     0     0     0    0     0     0    0     0     0
4    94   110    70    56   127   104   121  103    89   114  117   108    48
5    10    16     4     9    20    16    39   16    38    36   17     9    12
  22712 22723 22746 22754 22755 22770 22869 23110 23156 23161 23168 23170 23184
1 30339 53178 37198 34847 48921 30110 43811 83574 74727 62068 85174 63851 73676
2  2487  2574  2763  2607  2740  3087  3298  4001  3578  3430  3365  2862  3198
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4    91   111    95   123    73   105   130   135   134   187   149   126   142
5    24    39    25     3    29    14    45    82    14    12    29     8    33
  23409 23410 23412 23414 23415 23450 23451 23454  239  242  251  252   25
1 75553 59847 71130 67167 91016     5    11    14  234   35   43  133   47
2  3537  3043  3514  3189  3944  2472  2226  2884 4137 2394 2945 2800 1989
3     0     0     0     0     0     0     0     0    0    0    0    0    0
4   115   134   131   102   194    61    94   137  125  156   80   87   66
5    36    30    34    35    44    20    22    26    7   21   35  100    5
  26305 26313 26318 26330 26332 26334 26339 26346 26347 26617 26623 26624 27104
1     0     0     0     1     7     0     0     0     6 36805 38549 42588    39
2  3491  4493  3175  3577  3536  4017  3868  3685  2881  2483  2513  2992  2459
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4   132   213   136   149   141   162   121   129   110    69    90   110    88
5    34    69    35    29    21    28    54    43     9    32    13    45    32
  27168 27192 27194 27208 27209  272 27354 27355 27356 27357 27385 27386 27392
1    47 33724 40868    55    26   74    85    79    49    61    53   124    43
2  2358  2153  3043  2383  2339 2933  3605  2466  2813  3246  2996  2508  2517
3     0     0     0     0     0    0     0     0     0     0     0     0     0
4    87    98   112   107    55  100   124    94    68   150   123    88    97
5    10    12    30    26    20   13   233    29    24    35    20    32    18
  27394 27395 27396 27397 27398 27399 27488 27495 27499 27504 27507 27509 27510
1    77    69    37    40    18    52 42890 53940 40501 21691 27115 32449 18986
2  3436  3214  3143  1757  2670  2282  2992  3289  2878  2306  2665  2912  2181
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4   146    87   101    52    94   103    92   121    67    79    78   109    56
5    28    13    31     3    10    27    24    42    28     4    14    28    11
  27585 27586 27590 27593 27594 27596 27598 27599 27600 27707 27708   27 28129
1 34478 40459 32038 20594 46469 27813 28542 25804 35189 31781 28269   32   104
2  2372  3032  2757  1967  3634  2270  2784  2398  2986  2281  2777 2324  2679
3     0     0     0     0     0     0     0     0     0     0     0    0     0
4    59    98   118    97   122    55    93    65   114   106    97   54   138
5    23    48    22     6    39     5     8     5    28    36    32   10   706
  288810708 289457705 289461928 289470196 289478142 289482201 289494346
1        35         0        55        23        36        58         0
2      2091      3096      2148      2507      2615      2514      4320
3         0         0         0         0         0         0         0
4        63       116        87       131        55        99       214
5        14        28        16        20        28         8        19
  289535121 289576914 289666353 289674340  299 30269 30270 30271   30  344  347
1        64        65        83        10   26    62    26    49   76   41    6
2      2567      3353      2985      2079 2381  3073  2374  2823 4084 3185 3295
3         0         0         0         0    0     0     0     0    0    0    0
4        95       134       109        79  126   102    89    94  128  170  123
5        20        31        30        12    8    29    10    21   49   26   19
     35  364  367  393  394  399    3  400 40407 40408 40416 40423 40425 40434
1 35251   20    7   58   19   17   27   10 63779 91374    19     1     0     0
2  2796 3799 3344 5148 3413 2649 2290 3987  3191  4044  3726  3041  2857  3724
3     0    0    0    0    0    0    0    0     0     0     0     0     0     0
4   122  173  186  205  156  103  121  179   100   186   180   161   103   172
5    12   36   32   71   24   17   11   30    17    54    54    45    37    36
  40438 40439  40571 40573 40575 40576 40579 40580 40581 40585 40587 40590
1     0     0 128540 78818 55469    59     0     1     0     0 60212 58952
2  3685  2737   6846  4165  3458  2748  4149  3596  4379  3948  3197  3512
3     0     0      0     0     0     0     0     0     0     0     0     0
4   151   104    271   197   139    90   175   156   186   123   101   114
5    53    28     74    67    20    46    54    44    56    44    50    38
  40591 40593 40595 40808 40823 40834 40837 40839 40845 40877 40880    42    44
1 78712     0 74833 69807     1 72482 84527 78285 77004 62765 94765 40950 22335
2  4050  3129  3799  3056  4161  3497  3606  3510  4246  3958  4091  2675  2471
3     0     0     0     0     0     0     0     0     0     0     0     0     0
4   169   120   139    75   175   148   138   160   146   139   201   131    86
5    68    33    35    19    45    29    39    21    21    62    37    13    34
  46611 46618 46621 46623 46630 46642  46647 46648 46649 46650 46651 46652
1 61971 57335 66163 60289     0 68907 107855     0     0     0     0     0
2  3396  2746  2999  3303  3264  3407   4788  3566  3046  3240  4054  3681
3     0     0     0     0     0     0      0     0     0     0     0     0
4   158   149   149   120   111   167    193   140   160    97   135   172
5    20    26    56    29    34    11     41    22    37     6    53    51
  46653 46654 46662 46686 46687 46688 46698    46    54    56    57    58    8
1     0     0 59014     0     9     1 58161 53318 41514 42211 37472 45864   77
2  3628  4260  2984  5382  3955  3028  2780  3244  2533  3570  2421  2677 3300
3     0     0     0     0     0     0     0     0     0     0     0     0    0
4   102   173   132   139   233   108   118   162    94   150    71    93  166
5    12    38    11    53    66    33    46    14    11    20     3    19   43

Converting the gene_id as rownames of count matrix

counts <- counts %>% column_to_rownames(.,var="gene_ID") %>% as.data.frame()

How many rows and columns are there in counts?

dim(counts)
[1] 55488   234

In the counts matrix, genes are in rows and samples are in columns.

There are 140 samples in batch 1 as we saw in covar and samples in counts matrix is 234. So, we need to subset counts matrix for samples in covar.

counts <- counts[,colnames(counts) %in% rownames(covar)]
dim(counts)
[1] 55488   140

Now there are count data from 140 samples present in metadata.

Check how many gene_ids are NOT from the mouse genome by searching for the string “MUS” (as in Mus musculus) in the rownames of count matrix

counts[,1:6] %>% 
  filter(!str_detect(rownames(.), "MUS"))
                116 123  12 181 18421 18422
ENSG00000130203  54  37 112  90    16    35

We see entry of gene id “ENSG00000130203”. This is human APOE gene that has been quantified by our MODEL-AD RNA-Seq pipeline.

Here you can also check expression of mouse Apoe.

counts[rownames(counts) %in% "ENSMUSG00000002985",1:10]
                     116   123    12   181 18421 18422 18424 18425 18427 18428
ENSMUSG00000002985 44173 51114 45209 58427 72939 47959 37884 43401 49936 54232

Let’s visualize the expression of mouse and human APOE gene across all samples using ggplot2 package.

Validation of mouse strains

#First we convert the dataframe to longer format and join our covariates by MouseID

count_tpose <- counts  %>%
                rownames_to_column(.,var="gene_id") %>% 
                filter(gene_id %in% c("ENSG00000130203","ENSMUSG00000002985")) %>% 
                pivot_longer(.,cols = -"gene_id",names_to = "Mouse.ID",values_to="counts") %>% as.data.frame() %>%
                left_join(ind_meta %>% mutate("Mouse.ID"=as.character(Mouse.ID)),by="Mouse.ID") %>% as.data.frame()
head(count_tpose) 
          gene_id Mouse.ID counts Sex Genotype Age Batch
1 ENSG00000130203      116     54   F    Trem2   8     1
2 ENSG00000130203      123     37   M C57BL/6J   4     1
3 ENSG00000130203       12    112   M C57BL/6J   8     1
4 ENSG00000130203      181     90   F C57BL/6J   4     1
5 ENSG00000130203    18421     16   F    Trem2  12     1
6 ENSG00000130203    18422     35   M    Trem2   8     1
#make the age column a factor and re-order the levels
count_tpose$Age <- factor(count_tpose$Age,levels=c(4,8,12))

# rename the gene id to gene symbol
count_tpose$gene_id[count_tpose$gene_id %in% "ENSG00000130203"] <- "Human APOE"
count_tpose$gene_id[count_tpose$gene_id %in% "ENSMUSG00000002985"] <- "Mouse Apoe"
#Create simple box plots showing normalized counts by genotype and time point, faceted by sex.
count_tpose %>% 
  ggplot(aes(x=Age, y=counts, color=Genotype)) +
  geom_boxplot() + 
  geom_point(position=position_jitterdodge()) +
  facet_wrap(~Sex+gene_id) +theme_bw()
plot of chunk boxplot_APOE4_cohort

plot of chunk boxplot_APOE4_cohort

You will notice expression of mouse Apoe is higher in samples carrying humanized APOE gene sequences but lower in other samples and vice-versa for human APOE. This also suggest that mouse Apoe gene has been correctly replaced by human APOE gene sequences in APOE4 and APOE4Trem strains. This way we can validated the insertion of human transgene.

As before, we need to merge expression of human APOE and mouse Apoe:

counts[rownames(counts) %in% "ENSMUSG00000002985",] <- counts[rownames(counts) %in% "ENSMUSG00000002985",] + counts[rownames(counts) %in% "ENSG00000130203",]
counts <- counts[!rownames(counts) %in% c("ENSG00000130203"),]

Let’s verify

counts[,1:6] %>% 
  filter(!str_detect(rownames(.), "MUS"))
[1] 116   123   12    181   18421 18422
<0 rows> (or 0-length row.names)

Formatting the count and metadata for differential analysis

rawdata <- counts[,sort(colnames(counts))]
metadata <- covar[sort(rownames(covar)),] 

## renaming column names
colnames(metadata) <- c("sex","genotype","timepoint","Batch")

metadata$sex[metadata$sex %in% "F"] <- "female"
metadata$sex[metadata$sex %in% "M"] <- "male"

Differential Analysis of all groups in LOAD1 cohort

Next, we create a comparison table that has all cases and controls that we would like to compare with each other. Here I have made comparison group for age and sex-matched distinct mouse strains vs C57BL/6J (control mice):

metadata$Group <- paste0(metadata$genotype,"-",metadata$sex,"-",metadata$timepoint)
comparisons <-  data.frame(
                      control=rep(c("C57BL/6J-male-4", "C57BL/6J-female-4", "C57BL/6J-male-8", "C57BL/6J-female-8", "C57BL/6J-male-12", "C57BL/6J-female-12"),3), 
                              case=c("APOE4-male-4", "APOE4-female-4", "APOE4-male-8", "APOE4-female-8", "APOE4-male-12", "APOE4-female-12",
                                  "Trem2-male-4", "Trem2-female-4", "Trem2-male-8", "Trem2-female-8", "Trem2-male-12", "Trem2-female-12",
                                  "APOE4Trem2-male-4", "APOE4Trem2-female-4", "APOE4Trem2-male-8", "APOE4Trem2-female-8", "APOE4Trem2-male-12", "APOE4Trem2-female-12")
                      )
head(comparisons)
             control            case
1    C57BL/6J-male-4    APOE4-male-4
2  C57BL/6J-female-4  APOE4-female-4
3    C57BL/6J-male-8    APOE4-male-8
4  C57BL/6J-female-8  APOE4-female-8
5   C57BL/6J-male-12   APOE4-male-12
6 C57BL/6J-female-12 APOE4-female-12

Finally, we implement our DEG function on each comparison and store the result table in a list and data frame:

DE_LOAD1.list <- list()
DE_LOAD1.df <- data.frame()

for (i in 1:nrow(comparisons)){
  meta <- metadata[metadata$Group %in% comparisons[i,],]
  DEG(rawdata,meta,ref = "C57BL/6J")
  
  #append results in data frame
  DE_LOAD1.df <- rbind(DE_LOAD1.df,All_res %>% mutate(model=gsub("-.*$","",comparisons[i,2])[1],sex=unique(meta$sex),age=unique(meta$timepoint)))
  
  #append results in list
  DE_LOAD1.list[[i]] <- All_res
  names(DE_LOAD1.list)[i] <- paste0(comparisons[i,2])
}

save(DE_LOAD1.df,DE_LOAD1.list,file="../results/DEAnalysis_LOAD1.Rdata")

check number of genes significantly expressed (adjusted p =0.05) in all cases compared to age and sex-matched controls:

degs.up <- map(DE_LOAD1.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange>0)))
degs.down <- map(DE_LOAD1.list, ~length(which(.x$padj<0.05 & .x$log2FoldChange<0)))
deg <- data.frame(comparison=names(degs.up), Up_DEGs.pval.05=as.vector(unlist(degs.up)),Down_DEGs.pval.05=as.vector(unlist(degs.down)))
knitr::kable(deg)
comparison Up_DEGs.pval.05 Down_DEGs.pval.05
APOE4-male-4 0 3
APOE4-female-4 0 1
APOE4-male-8 42 103
APOE4-female-8 11 14
APOE4-male-12 0 1
APOE4-female-12 0 3
Trem2-male-4 5 59
Trem2-female-4 5 13
Trem2-male-8 3 29
Trem2-female-8 2 9
Trem2-male-12 118 88
Trem2-female-12 113 172
APOE4Trem2-male-4 15 8
APOE4Trem2-female-4 1 2
APOE4Trem2-male-8 7 32
APOE4Trem2-female-8 2 1
APOE4Trem2-male-12 0 2
APOE4Trem2-female-12 1 4

Challenge 5

Save the data for next lesson?

Solution to Challenge 5

save(DE_LOAD1.df,DE_LOAD1.list,file="../result/DEAnalysis_LOAD1.Rdata")

Session Info

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] clusterProfiler_4.8.1       lubridate_1.9.2            
 [3] forcats_1.0.0               stringr_1.5.0              
 [5] dplyr_1.1.2                 purrr_1.0.1                
 [7] readr_2.1.4                 tidyr_1.3.0                
 [9] tibble_3.2.1                tidyverse_2.0.0            
[11] EnhancedVolcano_1.18.0      ggrepel_0.9.3              
[13] GO.db_3.17.0                org.Mm.eg.db_3.17.0        
[15] AnnotationDbi_1.62.1        ggplot2_3.4.2              
[17] DESeq2_1.40.1               SummarizedExperiment_1.30.2
[19] Biobase_2.60.0              MatrixGenerics_1.12.0      
[21] matrixStats_1.0.0           GenomicRanges_1.52.0       
[23] GenomeInfoDb_1.36.0         IRanges_2.34.0             
[25] S4Vectors_0.38.1            BiocGenerics_0.46.0        
[27] knitr_1.43                 

loaded via a namespace (and not attached):
 [1] DBI_1.1.3               bitops_1.0-7            gson_0.1.0             
 [4] shadowtext_0.1.2        gridExtra_2.3           rlang_1.1.1            
 [7] magrittr_2.0.3          DOSE_3.26.1             compiler_4.3.0         
[10] RSQLite_2.3.1           png_0.1-8               vctrs_0.6.2            
[13] reshape2_1.4.4          pkgconfig_2.0.3         crayon_1.5.2           
[16] fastmap_1.1.1           XVector_0.40.0          labeling_0.4.2         
[19] ggraph_2.1.0            utf8_1.2.3              HDO.db_0.99.1          
[22] tzdb_0.4.0              enrichplot_1.20.0       bit_4.0.5              
[25] xfun_0.39               zlibbioc_1.46.0         cachem_1.0.8           
[28] aplot_0.1.10            jsonlite_1.8.5          blob_1.2.4             
[31] highr_0.10              DelayedArray_0.26.3     BiocParallel_1.34.2    
[34] tweenr_2.0.2            parallel_4.3.0          R6_2.5.1               
[37] RColorBrewer_1.1-3      stringi_1.7.12          GOSemSim_2.26.0        
[40] Rcpp_1.0.10             downloader_0.4          Matrix_1.5-4           
[43] splines_4.3.0           igraph_1.4.3            timechange_0.2.0       
[46] tidyselect_1.2.0        viridis_0.6.3           qvalue_2.32.0          
[49] codetools_0.2-19        lattice_0.21-8          plyr_1.8.8             
[52] treeio_1.24.1           withr_2.5.0             KEGGREST_1.40.0        
[55] evaluate_0.21           gridGraphics_0.5-1      scatterpie_0.2.1       
[58] polyclip_1.10-4         Biostrings_2.68.1       ggtree_3.8.0           
[61] pillar_1.9.0            ggfun_0.0.9             generics_0.1.3         
[64] RCurl_1.98-1.12         hms_1.1.3               tidytree_0.4.2         
[67] munsell_0.5.0           scales_1.2.1            glue_1.6.2             
[70] lazyeval_0.2.2          tools_4.3.0             data.table_1.14.8      
[73] fgsea_1.26.0            locfit_1.5-9.8          graphlayouts_1.0.0     
[76] fastmatch_1.1-3         tidygraph_1.2.3         cowplot_1.1.1          
[79] grid_4.3.0              ape_5.7-1               colorspace_2.1-0       
[82] nlme_3.1-162            patchwork_1.1.2         GenomeInfoDbData_1.2.10
[85] ggforce_0.4.1           cli_3.6.1               fansi_1.0.4            
[88] viridisLite_0.4.2       S4Arrays_1.0.4          gtable_0.3.3           
[91] yulab.utils_0.0.6       digest_0.6.31           ggplotify_0.1.0        
[94] farver_2.1.1            memoise_2.0.1           lifecycle_1.0.3        
[97] httr_1.4.6              bit64_4.0.5             MASS_7.3-58.4          

Key Points

  • Validate metadata prior to data analysis.

  • Use function for repeated differential expression analysis in multiple comparisons.


Mouse-human alignment of transcriptomic signatures

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How well transcriptomic changes we observe in mouse models carrying AD-related mutations align with human AD data?

  • How do we perfrom corss-species comparison

Objectives
  • Understand the human AD modules.

  • Approach to align mouse data to human data

  • Perform correlation analysis.

  • visualize the results

Author: Ravi Pandey, Jackson Laboratory

Aligning Human and Mouse Phenotype

ALZHEIMER’S DISEASE (AD) is complex disease, we do not expect these mouse models have complete LOAD (late-onset AD) pathology. We can do MRI as well PET imaging to match back to human imaging study ADNI, we can do neuropathology, finally we can do lots of Genomics, proteomics, and metabolomics. These omics study allows us to do real direct homology comparison between human and mouse as genes are overwhemly shared between these two species.

Aligning Human and Mouse Phenotype

Overview of Human transcriptomic data

Three independent human brain transcriptome studies ROSMAP [Religious Orders Study and the Memory and Aging Project], MSSM [Mount Sinai School of Medicine], and Mayo collected human postmortem brain RNA-seq data from seven distinct regions: dorsolateral prefrontal cortex (DLPFC), temporal cortex (TCX), inferior frontal gyrus (IFG), superior temporal gyrus (STG), frontal pole (FP), parahippocampal gyrus (PHG), and cerebellum (CBE),

These postmortem samples are generally balanced for AD, MCI, and non-effected controls. This really give us broad assessment how AD as affecetd multiple brain region in 3 different population around the US.

Transcriptomic of Alzheimer's Disease

Overview of Human Consensus RNA-Seq Coexpression Modules

The Accelerating Medicines Partnership-Alzheimer’s Disease (AMP-AD) Consortium has generated RNA-seq profiles from more than 1,200 human brains and is applying systems biology approaches toward the goal of elucidating AD mechanisms and potential therapeutic targets.

Wan, et al. performed meta analysis including all available AMP-AD RNA-seq datasets and systematically define correspondences between gene expression changes associated with AD in human brains. Briefly, Wan, et al. performed library normalization and covariate adjustments for each human study separately using fixed/mixed effects modeling to account for batch effects. Among the 2978 AMP-AD modules identified across all tissues, 660 modules were selected which showed an enrichment for at least one AD-specific differential expressed gene set from the meta-analysis in cases compared to controls.

Next, they performed multi method co-expression network analysis followed by differential analysis and found 30 co-expression modules related LOAD pathology from human cohort study. Among the 30 aggregate co-expression modules, five consensus clusters have been described by Wan, et al. These consensus clusters consist of a subset of modules which are associated with similar AD related changes across the multiple studies and brain regions. Further, they looked for enrichment of cell type signature in these modules using expression-weighted cell type enrichment analysis Skene and Grant, 2016 as well as applied functional annotation to these modules.

First module block enriched in astrocytes, next block is enriched in endothelial and microglial genes suggesting strong inflammation component, next block in strongly enriched in neuron suggesting neurodegeneration, next is enriched in oligodendrocytes and glial genes suggesting myelination and finally mixed modules that have things to do like stress response and response to unfolded proteins. Stress response not cell specific,so they may be throughout many cells in brain.

AMP-AD GENE Modules

Here we are showing matrix view of gene content overlap between these module, and you can see few strongly overlapping group of modules, implicating similar pathology in different studies in different brain regions.

Reading AMP-AD modules data

You can download data on the 30 human AMP-AD co-expression modules was obtained from the Synapse data repository (https://www.synapse.org/#!Synapse:syn11932957/tables/; SynapseID: syn11932957).

query <- synTableQuery("SELECT * FROM syn11932957")
module_table <- read.table(query$filepath, sep = ",",header = TRUE)

Let’s look at module table

head(module_table)
  ROW_ID ROW_VERSION          GeneID         Module    method
1      0           0 ENSG00000168439 DLPFCturquoise aggregate
2      1           0 ENSG00000086061 DLPFCturquoise aggregate
3      2           0 ENSG00000204389 DLPFCturquoise aggregate
4      3           0 ENSG00000114416 DLPFCturquoise aggregate
5      4           0 ENSG00000110172 DLPFCturquoise aggregate
6      5           0 ENSG00000099622 DLPFCturquoise aggregate
               ModuleName brainRegion               ModuleNameFull
1 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
2 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
3 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
4 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
5 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
6 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
  external_gene_name
1              STIP1
2             DNAJA1
3             HSPA1A
4               FXR1
5            CHORDC1
6              CIRBP

Here you see total 9 columns in this table. Column of our interest are: *Colum 2: human ensembl gene ID, *column 3: module name in which gene is clustered and *column 7: is brain tissue. *column 9: is gene names.

How many distinct modules are in table?

length(unique(module_table$Module))
[1] 30

What are the name of modules?

unique(module_table$Module)
 [1] "DLPFCturquoise" "DLPFCblue"      "DLPFCbrown"     "DLPFCyellow"   
 [5] "CBEturquoise"   "CBEblue"        "CBEbrown"       "CBEyellow"     
 [9] "TCXturquoise"   "TCXblue"        "TCXbrown"       "TCXyellow"     
[13] "TCXgreen"       "IFGturquoise"   "IFGblue"        "IFGbrown"      
[17] "IFGyellow"      "STGturquoise"   "STGblue"        "STGbrown"      
[21] "STGyellow"      "PHGturquoise"   "PHGblue"        "PHGbrown"      
[25] "PHGyellow"      "PHGgreen"       "FPturquoise"    "FPblue"        
[29] "FPbrown"        "FPyellow"      

and how many genes are in each module?

table(module_table$Module)

       CBEblue       CBEbrown   CBEturquoise      CBEyellow      DLPFCblue 
          4509            504           1977           1739           1751 
    DLPFCbrown DLPFCturquoise    DLPFCyellow         FPblue        FPbrown 
           882           2489           3019           1991           1289 
   FPturquoise       FPyellow        IFGblue       IFGbrown   IFGturquoise 
          1001           4426           2885           4673           1456 
     IFGyellow        PHGblue       PHGbrown       PHGgreen   PHGturquoise 
           743           3733           2123           1151           1195 
     PHGyellow        STGblue       STGbrown   STGturquoise      STGyellow 
           910           1171           3414           2404           1799 
       TCXblue       TCXbrown       TCXgreen   TCXturquoise      TCXyellow 
          1713           1851           2766           1131           2013 

You can also visualize this as bar plot using ggplot2 package.

ggplot(module_table,aes(x=Module)) + 
  geom_bar() + 
  theme_bw() + 
  theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 0, size = 12))
plot of chunk module_nGenes

plot of chunk module_nGenes

Challenge 1

What are other ways to count genes in each module?

Solution to Challenge 1

dplyr::count(module_table ,Module)

We can also check total unique genes in table

length(unique((module_table$GeneID)))
[1] 17033

You can also check which modules come from which brain region.

unique(module_table[c("Module", "brainRegion")])
              Module brainRegion
1     DLPFCturquoise       DLPFC
2490       DLPFCblue       DLPFC
4241      DLPFCbrown       DLPFC
5123     DLPFCyellow       DLPFC
8142    CBEturquoise         CBE
10119        CBEblue         CBE
14628       CBEbrown         CBE
15132      CBEyellow         CBE
16871   TCXturquoise         TCX
18002        TCXblue         TCX
19715       TCXbrown         TCX
21566      TCXyellow         TCX
23579       TCXgreen         TCX
26345   IFGturquoise         IFG
27801        IFGblue         IFG
30686       IFGbrown         IFG
35359      IFGyellow         IFG
36102   STGturquoise         STG
38506        STGblue         STG
39677       STGbrown         STG
43091      STGyellow         STG
44890   PHGturquoise         PHG
46085        PHGblue         PHG
49818       PHGbrown         PHG
51941      PHGyellow         PHG
52851       PHGgreen         PHG
54002    FPturquoise          FP
55003         FPblue          FP
56994        FPbrown          FP
58283       FPyellow          FP

Mouse-Human orthologous gene conversion

In module table, we have human ENSEMBL ids and gene names. But we will need corresponding mouse gene name, so that we can compare with our results from mouse models. To do this, we are going to add mouse orthologous gene names corresponding to human ENSEMBL id. Mouse orthologs for human genes were extracted using the HCOP tool (The HGNC Comparison of Orthology Predictions) by Wan, et al.. We are going to read that table from the Synapse data repository (https://doi.org/10.7303/syn17010253.1,synapse id:syn17010253)

mouse.human.ortho <- fread(synapser::synGet("syn17010253")$path,check.names = F,header=T)

Let’s see top rows of this ortholog table:

head(mouse.human.ortho)
   human_entrez_gene human_ensembl_gene hgnc_id
1:                 -    ENSG00000274059       -
2:                 -    ENSG00000212595       -
3:                 -    ENSG00000277418       -
4:                 -    ENSG00000274759       -
5:                 -    ENSG00000274663       -
6:                 -    ENSG00000277488       -
                                   human_name human_symbol human_chr
1: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA         1
2: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA         X
3: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA         8
4: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA         4
5: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA        17
6: 5S ribosomal RNA [Source:RFAM;Acc:RF00001]      5S_rRNA        17
   human_assert_ids mouse_entrez_gene mouse_ensembl_gene      mgi_id
1:  ENSG00000274059                 - ENSMUSG00000089601 MGI:5451871
2:  ENSG00000212595                 - ENSMUSG00000088132 MGI:5452363
3:  ENSG00000277418                 - ENSMUSG00000088814 MGI:5452162
4:  ENSG00000274759                 - ENSMUSG00000084431 MGI:4421893
5:  ENSG00000274663                 - ENSMUSG00000084588 MGI:4421913
6:  ENSG00000277488                 - ENSMUSG00000084588 MGI:4421913
                   mouse_name mouse_symbol mouse_chr   mouse_assert_ids support
1:      predicted gene, 22094      Gm22094         - ENSMUSG00000089601 Ensembl
2:      predicted gene, 22586      Gm22586         - ENSMUSG00000088132 Ensembl
3:      predicted gene, 22385      Gm22385         - ENSMUSG00000088814 Ensembl
4: nuclear encoded rRNA 5S 48      n-R5s48         - ENSMUSG00000084431 Ensembl
5: nuclear encoded rRNA 5S 68      n-R5s68         - ENSMUSG00000084588 Ensembl
6: nuclear encoded rRNA 5S 68      n-R5s68         - ENSMUSG00000084588 Ensembl

Add mouse gene names from ortholog table to module table by matching human ENSEMBL ids from both tables .

module_table$Mouse_gene_name <- mouse.human.ortho$mouse_symbol[match(module_table$GeneID,mouse.human.ortho$human_ensembl_gene)]
head(module_table)
  ROW_ID ROW_VERSION          GeneID         Module    method
1      0           0 ENSG00000168439 DLPFCturquoise aggregate
2      1           0 ENSG00000086061 DLPFCturquoise aggregate
3      2           0 ENSG00000204389 DLPFCturquoise aggregate
4      3           0 ENSG00000114416 DLPFCturquoise aggregate
5      4           0 ENSG00000110172 DLPFCturquoise aggregate
6      5           0 ENSG00000099622 DLPFCturquoise aggregate
               ModuleName brainRegion               ModuleNameFull
1 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
2 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
3 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
4 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
5 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
6 aggregateDLPFCturquoise       DLPFC aggregateDLPFCturquoiseDLPFC
  external_gene_name Mouse_gene_name
1              STIP1           Stip1
2             DNAJA1          Dnaja1
3             HSPA1A          Hspa1a
4               FXR1            Fxr1
5            CHORDC1         Chordc1
6              CIRBP           Cirbp

We will only keep column of our interest and non-empty entries:

ampad_modules <- module_table %>%
  distinct(tissue = brainRegion, module = Module, gene = GeneID, Mouse_gene_name) %>%
  filter(Mouse_gene_name != "")
head(ampad_modules)
  tissue         module            gene Mouse_gene_name
1  DLPFC DLPFCturquoise ENSG00000168439           Stip1
2  DLPFC DLPFCturquoise ENSG00000086061          Dnaja1
3  DLPFC DLPFCturquoise ENSG00000204389          Hspa1a
4  DLPFC DLPFCturquoise ENSG00000114416            Fxr1
5  DLPFC DLPFCturquoise ENSG00000110172         Chordc1
6  DLPFC DLPFCturquoise ENSG00000099622           Cirbp

Reading differential expression result of human data from meta-analysis

Differential expression meta-analysis of reprocessed RNASeq data from AMP-AD (all 7 brain regions). LogFC values for human transcripts were obtained via the AMP-AD knowledge portal(https://www.synapse.org/#!Synapse:syn11180450; SynapseID: syn11180450).

ampad_modules_raw <- fread(synapser::synGet("syn11180450")$path,check.names = F,header=T)

Let’s check the data

head(ampad_modules_raw)
             Model Tissue Comparison ensembl_gene_id      logFC       CI.L
1: SourceDiagnosis    CBE AD-CONTROL ENSG00000078043 -0.4455482 -0.5474320
2: SourceDiagnosis    CBE AD-CONTROL ENSG00000205302  0.4534512  0.3452585
3: SourceDiagnosis    CBE AD-CONTROL ENSG00000134982  0.5778206  0.4383669
4: SourceDiagnosis    CBE AD-CONTROL ENSG00000173230  0.5924321  0.4493164
5: SourceDiagnosis    CBE AD-CONTROL ENSG00000115204 -0.3426223 -0.4257147
6: SourceDiagnosis    CBE AD-CONTROL ENSG00000163867 -0.3176858 -0.3965217
         CI.R   AveExpr         t      P.Value    adj.P.Val        B Direction
1: -0.3436645 1.3320969 -8.592843 1.206969e-16 2.053175e-12 27.12967      DOWN
2:  0.5616439 2.9361993  8.235175 1.699316e-15 1.445353e-11 24.54943        UP
3:  0.7172742 3.5498628  8.141480 3.365041e-15 1.530863e-11 23.85060        UP
4:  0.7355477 3.0759381  8.133894 3.599700e-15 1.530863e-11 23.81776        UP
5: -0.2595299 2.2634337 -8.102128 4.512223e-15 1.535148e-11 23.62730      DOWN
6: -0.2388499 0.4404036 -7.918097 1.695179e-14 4.806115e-11 22.34035      DOWN
   hgnc_symbol percentage_gc_content gene.length Sex Study
1:       PIAS2                 38.59       17527 ALL  MAYO
2:        SNX2                 36.08        4254 ALL  MAYO
3:         APC                 37.63       12440 ALL  MAYO
4:      GOLGB1                 38.07       11865 ALL  MAYO
5:       MPV17                 50.33        5625 ALL  MAYO
6:       ZMYM6                 38.80        9591 ALL  MAYO

Data came from how many tissues?

unique(ampad_modules_raw$Tissue)
[1] "CBE"   "TCX"   "FP"    "IFG"   "PHG"   "STG"   "DLPFC"

We see that data is from all 7 brain regions.

AMP-AD data has been processed many ways and using different models and comparisons. Let’s see how many ways data has been analyzed,

unique(ampad_modules_raw$Comparison)
 [1] "AD-CONTROL"                "PATH_AGE-CONTROL"         
 [3] "PSP-CONTROL"               "AD-PATH_AGE"              
 [5] "AD-OTHER"                  "OTHER-CONTROL"            
 [7] "2-0"                       "2-1"                      
 [9] "1-0"                       "CDR"                      
[11] "4-1"                       "4-2"                      
[13] "FEMALE-MALE"               "AD-CONTROL.IN.FEMALE-MALE"
[15] "AOD"                      

For our analysis, we need data for “Diagnosis” model and comparison between cases vs controls. So, we subset the logFC data for these conditions. Also, we need only three columns: “Tissue”,”Gene” and “logFC”. So, we will filter and subset the data.

ampad_fc <- ampad_modules_raw %>%
  as_tibble() %>%
  filter(Model == "Diagnosis", Comparison == "AD-CONTROL") %>%
  dplyr::select(tissue = Tissue, gene = ensembl_gene_id, ampad_fc = logFC)

Combine with modules so correlation can be done per module

Next, we will combine fold change table ampad_fc and module table ampad_modules. First, look at both tables to check how can we merge them together?

head(ampad_fc)
# A tibble: 6 × 3
  tissue gene            ampad_fc
  <chr>  <chr>              <dbl>
1 CBE    ENSG00000085224    0.363
2 CBE    ENSG00000131016    0.481
3 CBE    ENSG00000078043   -0.361
4 CBE    ENSG00000243943   -0.257
5 CBE    ENSG00000093167   -0.477
6 CBE    ENSG00000058272    0.330
head(ampad_modules)
  tissue         module            gene Mouse_gene_name
1  DLPFC DLPFCturquoise ENSG00000168439           Stip1
2  DLPFC DLPFCturquoise ENSG00000086061          Dnaja1
3  DLPFC DLPFCturquoise ENSG00000204389          Hspa1a
4  DLPFC DLPFCturquoise ENSG00000114416            Fxr1
5  DLPFC DLPFCturquoise ENSG00000110172         Chordc1
6  DLPFC DLPFCturquoise ENSG00000099622           Cirbp

In both tables, common columns are “gene” and “tissue”. So we will merge both datasets using these two columns. Reminder: Every gene can be present in multiple brain regions but only one module form any brain region. Let’s check that:

ampad_modules[ampad_modules$gene %in% "ENSG00000168439",]
      tissue         module            gene Mouse_gene_name
1      DLPFC DLPFCturquoise ENSG00000168439           Stip1
9376     CBE        CBEblue ENSG00000168439           Stip1
17627    TCX       TCXbrown ENSG00000168439           Stip1
31448    STG   STGturquoise ENSG00000168439           Stip1
40130    PHG        PHGblue ENSG00000168439           Stip1
48317     FP         FPblue ENSG00000168439           Stip1

We can see that this gene is present in six distinct modules and all modules are from different brain regions. You can do for any other gene as well.

Let’s merge using inner_join function from tidyverse:

ampad_modules_fc <- ampad_modules %>%
  inner_join(ampad_fc, by = c("gene", "tissue")) %>% 
  dplyr::select(symbol = Mouse_gene_name, module, ampad_fc) 
head(ampad_modules_fc)
   symbol         module    ampad_fc
1   Stip1 DLPFCturquoise  0.03342379
2  Dnaja1 DLPFCturquoise  0.02350154
3  Hspa1a DLPFCturquoise  0.10220020
4    Fxr1 DLPFCturquoise  0.03842087
5 Chordc1 DLPFCturquoise  0.06350503
6   Cirbp DLPFCturquoise -0.24214640

We will use ampad_modules_fc dataset to compare with log fold change data from mouse models.

Preparing module information for correlation plot

mod <- c("TCXblue","PHGyellow","IFGyellow","DLPFCblue","CBEturquoise","STGblue","PHGturquoise","IFGturquoise","TCXturquoise","FPturquoise","IFGbrown","STGbrown","DLPFCyellow","TCXgreen","FPyellow","CBEyellow","PHGbrown","DLPFCbrown","STGyellow","PHGgreen","CBEbrown","TCXyellow","IFGblue","FPblue","FPbrown","CBEblue","DLPFCturquoise","TCXbrown","STGturquoise","PHGblue")

cluster_a <- tibble(
  module = c("TCXblue", "PHGyellow", "IFGyellow"),
  cluster = "Consensus Cluster A (ECM organization)",
  cluster_label = "Consensus Cluster A\n(ECM organization)"
)

cluster_b <- tibble(
  module = c("DLPFCblue", "CBEturquoise", "STGblue", "PHGturquoise", "IFGturquoise", "TCXturquoise", "FPturquoise"),
  cluster = "Consensus Cluster B (Immune system)",
  cluster_label = "Consensus Cluster B\n(Immune system)"
)

cluster_c <- tibble(
  module = c("IFGbrown", "STGbrown", "DLPFCyellow", "TCXgreen", "FPyellow", "CBEyellow", "PHGbrown"),
  cluster = "Consensus Cluster C (Neuronal system)",
  cluster_label = "Consensus Cluster C\n(Neuronal system)"
)

cluster_d <- tibble(
  module = c("DLPFCbrown", "STGyellow", "PHGgreen", "CBEbrown", "TCXyellow", "IFGblue", "FPblue"),
  cluster = "Consensus Cluster D (Cell Cycle, NMD)",
  cluster_label = "Consensus Cluster D\n(Cell Cycle, NMD)"
)

cluster_e <- tibble(
  module = c("FPbrown", "CBEblue", "DLPFCturquoise", "TCXbrown", "STGturquoise", "PHGblue"),
  cluster = "Consensus Cluster E (Organelle Biogensis, Cellular stress response)",
  cluster_label = "Consensus Cluster E\n(Organelle Biogenesis,\nCellular stress response)"
)

module_clusters <- cluster_a %>%
  bind_rows(cluster_b) %>%
  bind_rows(cluster_c) %>%
  bind_rows(cluster_d) %>%
  bind_rows(cluster_e) %>%
  mutate(cluster_label = fct_inorder(cluster_label))

head(module_clusters)
# A tibble: 6 × 3
  module       cluster                                cluster_label             
  <chr>        <chr>                                  <fct>                     
1 TCXblue      Consensus Cluster A (ECM organization) "Consensus Cluster A\n(EC…
2 PHGyellow    Consensus Cluster A (ECM organization) "Consensus Cluster A\n(EC…
3 IFGyellow    Consensus Cluster A (ECM organization) "Consensus Cluster A\n(EC…
4 DLPFCblue    Consensus Cluster B (Immune system)    "Consensus Cluster B\n(Im…
5 CBEturquoise Consensus Cluster B (Immune system)    "Consensus Cluster B\n(Im…
6 STGblue      Consensus Cluster B (Immune system)    "Consensus Cluster B\n(Im…
save(ampad_modules_fc,module_clusters,mod,file="../data/AMPADModuleData_Correlation.RData")

Correlation between mouse models and human AD modules

There are two approaches that we adopted to compute correlation between mouse data with human AD modules:

Human-mouse logFCcorrelation

These appraoches allow us to assess directional coherence between AMP-AD modules and the effects of genetic perturbations in mice. In this lesson, we are going to use first approach. Let’s start!

Step0: Reading Gene Expression Count matrix from Previous Lesson

We first read the result table saved after differential analysis in last lesson. We start with 5XFAD mouse model to understand all required steps to perform correlation with human AD modules.

load("../results/DEAnalysis_5XFAD.Rdata")

We can also load AMP-AD module data.

load("../data/AMPADModuleData_Correlation.RData")

Step1: Measure the correlation between mouse models for each sex at each age and AMP-AD modules using common genes from both datasets

We compute pearson correlation between changes in expression for each gene in a given module (log fold change for cases minus controls) with each mouse model (log fold change of the 5XFAD mice minus sex and age-matched B6 mice).

First, we add both mouse DE_5xFAD.df and human ampad_modules_fc log fold change datasets for all genes.

model_vs_ampad <- DE_5xFAD.df %>%
    inner_join(ampad_modules_fc, by = c("symbol"),multiple = "all") 
head(model_vs_ampad)
  symbol EntrezGene  baseMean log2FoldChange     lfcSE       stat    pvalue
1  Gnai3      14679 3707.5316    -0.02308587 0.0381646 -0.6049026 0.5452437
2  Gnai3      14679 3707.5316    -0.02308587 0.0381646 -0.6049026 0.5452437
3  Gnai3      14679 3707.5316    -0.02308587 0.0381646 -0.6049026 0.5452437
4  Gnai3      14679 3707.5316    -0.02308587 0.0381646 -0.6049026 0.5452437
5  Scml2     107815  126.8241     0.08939456 0.1377406  0.6490066 0.5163341
6  Scml2     107815  126.8241     0.08939456 0.1377406  0.6490066 0.5163341
       padj model  sex  age       module    ampad_fc
1 0.9999518 5xFAD male 4 mo      TCXblue  0.18017910
2 0.9999518 5xFAD male 4 mo IFGturquoise -0.01064322
3 0.9999518 5xFAD male 4 mo      STGblue -0.01063594
4 0.9999518 5xFAD male 4 mo  FPturquoise -0.03142185
5 0.9999518 5xFAD male 4 mo STGturquoise  0.07891395
6 0.9999518 5xFAD male 4 mo      PHGblue  0.09451007

Next, we create a list-columns of data frame using nest function of tidyverse package. Nesting is implicitly a summarising operation: you get one row for each group defined by the non-nested columns.

df <- model_vs_ampad %>%
        dplyr::select(module, model, sex, age, symbol, log2FoldChange, ampad_fc) %>%
        group_by(module, model, sex,age) %>%
        nest(data = c(symbol, log2FoldChange, ampad_fc))
head(df)
# A tibble: 6 × 5
# Groups:   module, model, sex, age [6]
  module       model sex   age   data                
  <chr>        <chr> <chr> <chr> <list>              
1 TCXblue      5xFAD male  4 mo  <tibble [1,462 × 3]>
2 IFGturquoise 5xFAD male  4 mo  <tibble [1,310 × 3]>
3 STGblue      5xFAD male  4 mo  <tibble [1,092 × 3]>
4 FPturquoise  5xFAD male  4 mo  <tibble [927 × 3]>  
5 STGturquoise 5xFAD male  4 mo  <tibble [1,748 × 3]>
6 PHGblue      5xFAD male  4 mo  <tibble [2,841 × 3]>

total number of groups in data table

dim(df)
[1] 180   5

Let’s check first row:

head(df[1,]$data)
[[1]]
# A tibble: 1,462 × 3
   symbol  log2FoldChange ampad_fc
   <chr>            <dbl>    <dbl>
 1 Gnai3         -0.0231    0.180 
 2 Gna12         -0.00445   0.444 
 3 Sdhd          -0.00152   0.0631
 4 Lhx2          -0.0959    0.252 
 5 Gmpr           0.0105    0.801 
 6 Tpd52l1       -0.103     0.714 
 7 Cdh4          -0.0111    0.109 
 8 Dbt           -0.0460   -0.0103
 9 Tbrg4          0.00746  -0.129 
10 Galnt1        -0.0105    0.191 
# ℹ 1,452 more rows

Next, we compute correlation coefficients using cor.test function built in R as following:

cor.df <- df  %>%
          mutate(
              cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]], .x[["ampad_fc"]], method = "pearson")),
              estimate = map_dbl(cor_test, "estimate"),
              p_value = map_dbl(cor_test, "p.value")
              ) %>%
          ungroup() %>%
          dplyr::select(-cor_test)

head(cor.df)
# A tibble: 6 × 7
  module       model sex   age   data                 estimate  p_value
  <chr>        <chr> <chr> <chr> <list>                  <dbl>    <dbl>
1 TCXblue      5xFAD male  4 mo  <tibble [1,462 × 3]>  0.0681  9.19e- 3
2 IFGturquoise 5xFAD male  4 mo  <tibble [1,310 × 3]>  0.227   9.10e-17
3 STGblue      5xFAD male  4 mo  <tibble [1,092 × 3]>  0.284   9.55e-22
4 FPturquoise  5xFAD male  4 mo  <tibble [927 × 3]>    0.103   1.72e- 3
5 STGturquoise 5xFAD male  4 mo  <tibble [1,748 × 3]> -0.0538  2.44e- 2
6 PHGblue      5xFAD male  4 mo  <tibble [2,841 × 3]> -0.00934 6.19e- 1

Step2: add signifcant correlations and join module cluster information to correlation table

model_module <- cor.df %>%
    mutate(significant = p_value < 0.05) %>%
    left_join(module_clusters, by = "module") %>%
    dplyr::select(cluster, cluster_label, module, model, sex,age,correlation = estimate, p_value, significant)
head(model_module)
# A tibble: 6 × 9
  cluster            cluster_label module model sex   age   correlation  p_value
  <chr>              <fct>         <chr>  <chr> <chr> <chr>       <dbl>    <dbl>
1 Consensus Cluster… "Consensus C… TCXbl… 5xFAD male  4 mo      0.0681  9.19e- 3
2 Consensus Cluster… "Consensus C… IFGtu… 5xFAD male  4 mo      0.227   9.10e-17
3 Consensus Cluster… "Consensus C… STGbl… 5xFAD male  4 mo      0.284   9.55e-22
4 Consensus Cluster… "Consensus C… FPtur… 5xFAD male  4 mo      0.103   1.72e- 3
5 Consensus Cluster… "Consensus C… STGtu… 5xFAD male  4 mo     -0.0538  2.44e- 2
6 Consensus Cluster… "Consensus C… PHGbl… 5xFAD male  4 mo     -0.00934 6.19e- 1
# ℹ 1 more variable: significant <lgl>

Step3: Create a dataframe to use as input for plotting the results

order.model <- c("5xFAD (Male)", "5xFAD (Female)")

correlation_for_plot <- model_module %>%
    arrange(cluster) %>%
    mutate(
      module = factor(module,levels=mod),
      model_sex = glue::glue("{model} ({str_to_title(sex)})"),
    ) %>%
    arrange(model_sex) %>%
    mutate(
      model_sex = factor(model_sex,levels=order.model),
      model_sex = fct_rev(model_sex),
    )

head(correlation_for_plot)
# A tibble: 6 × 10
  cluster cluster_label module model sex   age   correlation p_value significant
  <chr>   <fct>         <fct>  <chr> <chr> <chr>       <dbl>   <dbl> <lgl>      
1 Consen… "Consensus C… TCXbl… 5xFAD fema… 4 mo       0.114  1.31e-5 TRUE       
2 Consen… "Consensus C… PHGye… 5xFAD fema… 4 mo       0.201  9.14e-9 TRUE       
3 Consen… "Consensus C… IFGye… 5xFAD fema… 4 mo       0.0757 5.12e-2 FALSE      
4 Consen… "Consensus C… TCXbl… 5xFAD fema… 6 mo       0.115  9.69e-6 TRUE       
5 Consen… "Consensus C… PHGye… 5xFAD fema… 6 mo       0.0668 5.87e-2 FALSE      
6 Consen… "Consensus C… IFGye… 5xFAD fema… 6 mo       0.0707 6.86e-2 FALSE      
# ℹ 1 more variable: model_sex <fct>

Visualizing the Correlation plot

Now, we will use above matrix and visualize the correlation results using ggplot2 package.

data <- correlation_for_plot
range(correlation_for_plot$correlation)
[1] -0.1749441  0.3684976
ggplot2::ggplot() +
    ggplot2::geom_tile(data = data, ggplot2::aes(x = .data$module, y = .data$model_sex), colour = "black", fill = "white") +
    ggplot2::geom_point(data = dplyr::filter(data), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation, size = abs(.data$correlation))) +
    ggplot2::geom_point(data = dplyr::filter(data, .data$significant), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation),color="black",shape=0,size=9) +
    ggplot2::scale_x_discrete(position = "top") +
    ggplot2::scale_size(guide = "none", limits = c(0, 0.4)) +
    ggplot2::scale_color_gradient2(limits = c(-0.4, 0.4), breaks = c(-0.4, 0, 0.4), low = "#85070C", high = "#164B6E", name = "Correlation", guide = ggplot2::guide_colorbar(ticks = FALSE)) +
    ggplot2::labs(x = NULL, y = NULL) +
    ggplot2::facet_grid( rows = dplyr::vars(.data$age),cols = dplyr::vars(.data$cluster_label), scales = "free", space = "free",switch = 'y') +
    ggplot2::theme(
      strip.text.x = ggplot2::element_text(size = 10,colour = c("black")),
      strip.text.y.left = ggplot2::element_text(angle = 0,size = 12),
      axis.ticks = ggplot2::element_blank(),
      axis.text.x = ggplot2::element_text(angle = 90, hjust = 0, size = 12),
      axis.text.y = ggplot2::element_text(angle = 0, size = 12),
      panel.background = ggplot2::element_blank(),
      plot.title = ggplot2::element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
      plot.title.position = "plot",
      panel.grid = ggplot2::element_blank(),
      legend.position = "right"
    )

AMPAD - 5XFAD correlation plot

In above plot, top row represent 30 AMP-AD modules grouped into 5 consensus clusters describing the major functional groups of AD-related alterations and left column represent mouse models. Positive correlations are shown in blue and negative correlations in red. Color intensity and size of the circles are proportional to the correlation coefficient. Black square around dots represent significant correlation at p-value=0.05 and non-significant correlations are left blank.

Male and female 5XFAD mice display gene expression alterations across all five consensus clusters, with the most pronounced alterations observed in Consensus Cluster B, which consists of immune system pathways.

We also saw in Volcano plot that immune-related genes were signifcantly overexpressed in the 5XFAD model. This significant positive correlation suggest that 5XFAD model capture inflammatory changes observed in human AD patients.

Combing all steps into a function to perform correlation analysis

We can combine all above steps to perform correlation analysis into one function to use for other data by just inputting log fod change matrix obtained from differential analysis.

corr_function <- function(mydat)
{
  model_vs_ampad <- mydat %>%
    inner_join(ampad_modules_fc, by = c("symbol"),multiple = "all") %>%
    dplyr::select(module, model, sex, age, symbol, log2FoldChange, ampad_fc) %>%
    group_by(module, model, sex,age) %>%
    nest(data = c(symbol, log2FoldChange, ampad_fc)) %>%
    mutate(
      cor_test = map(data, ~ cor.test(.x[["log2FoldChange"]], .x[["ampad_fc"]], method = "pearson")),
      estimate = map_dbl(cor_test, "estimate"),
      p_value = map_dbl(cor_test, "p.value")
    ) %>%
    ungroup() %>%
    dplyr::select(-cor_test)
  
  model_module <- model_vs_ampad %>%
    mutate(significant = p_value < 0.05) %>%
    left_join(module_clusters, by = "module") %>%
    dplyr::select(cluster, cluster_label, module, model, sex,age,correlation = estimate, p_value, significant)

  correlation_for_plot <- model_module %>%
    arrange(cluster) %>%
    mutate(
      module = factor(module,levels=mod),
      model_sex = glue::glue("{model} ({str_to_title(sex)})"),
    ) %>%
    arrange(model_sex) %>%
    mutate(
      model_sex = factor(model_sex,levels=order.model),
      model_sex = fct_rev(model_sex),
    )
}
magora_corrplot <- function(data,ran) 
  {
  
  ggplot2::ggplot() +
    ggplot2::geom_tile(data = data, ggplot2::aes(x = .data$module, y = .data$model_sex), colour = "black", fill = "white") +
    ggplot2::geom_point(data = dplyr::filter(data), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation, size = abs(.data$correlation))) +
    ggplot2::geom_point(data = dplyr::filter(data, .data$significant), ggplot2::aes(x = .data$module, y = .data$model_sex, colour = .data$correlation),color="black",shape=0,size=9) +
    ggplot2::scale_x_discrete(position = "top") +
    ggplot2::scale_size(guide = "none", limits = c(0, ran)) +
    ggplot2::scale_color_gradient2(limits = c(-ran, ran), breaks = c(-ran, 0, ran), low = "#85070C", high = "#164B6E", name = "Correlation", guide = ggplot2::guide_colorbar(ticks = FALSE)) +
    ggplot2::labs(x = NULL, y = NULL) +
    #ggplot2::ggtitle("Model(Sex)| Age/Control") +
    ggplot2::facet_grid( rows = dplyr::vars(.data$age),cols = dplyr::vars(.data$cluster_label), scales = "free", space = "free",switch = 'y') +
    ggplot2::theme(
      strip.text.x = ggplot2::element_text(size = 10,colour = c("black")),
      strip.text.y.left = ggplot2::element_text(angle = 0,size = 12),
      axis.ticks = ggplot2::element_blank(),
      axis.text.x = ggplot2::element_text(angle = 90, hjust = 0, size = 12),
      axis.text.y = ggplot2::element_text(angle = 0, size = 12),
      panel.background = ggplot2::element_blank(),
      plot.title = ggplot2::element_text(angle = 0, vjust = -54, hjust = 0.03,size=12,face="bold"),
      plot.title.position = "plot",
      panel.grid = ggplot2::element_blank(),
      legend.position = "right"
    )
}

We can use these functions directly onto our other dataset from LOAD1 cohort.

Correlation between LOAD1 models and human AD modules

Reading Gene Expression Count matrix of LOAD1 cohort from Previous Lesson

load("../results/DEAnalysis_LOAD1.Rdata")

First, check the data:

head(DE_LOAD1.df)
                   symbol EntrezGene   baseMean log2FoldChange      lfcSE
ENSMUSG00000000001  Gnai3      14679 2587.33246   -0.027251874 0.05341582
ENSMUSG00000000028  Cdc45      12544   96.38913   -0.086316094 0.19815501
ENSMUSG00000000031    H19      14955   23.15992   -0.402670365 0.47978365
ENSMUSG00000000037  Scml2     107815   94.24871   -0.093235144 0.23666227
ENSMUSG00000000049   Apoh      11818   11.94380   -0.220259366 0.59500274
ENSMUSG00000000056   Narf      67608 3464.49361    0.009537733 0.05167193
                         stat    pvalue      padj model  sex age
ENSMUSG00000000001 -0.5101836 0.6099228 0.9968456 APOE4 male   4
ENSMUSG00000000028 -0.4355989 0.6631278 0.9968456 APOE4 male   4
ENSMUSG00000000031 -0.8392749 0.4013151 0.9932419 APOE4 male   4
ENSMUSG00000000037 -0.3939586 0.6936116 0.9968456 APOE4 male   4
ENSMUSG00000000049 -0.3701821 0.7112468 0.9968456 APOE4 male   4
ENSMUSG00000000056  0.1845825 0.8535565 0.9968456 APOE4 male   4
unique(DE_LOAD1.df$model)
[1] "APOE4"      "Trem2"      "APOE4Trem2"
unique(DE_LOAD1.df$sex)
[1] "male"   "female"
unique(DE_LOAD1.df$age)
[1]  4  8 12

Perform the correlation analysis

We directly input logFC table DE_LOAD1.df to corr_function:

order.model <- c("APOE4 (Male)", "APOE4 (Female)","Trem2 (Male)", "Trem2 (Female)","APOE4Trem2 (Male)", "APOE4Trem2 (Female)")

correlation_for_plot <- corr_function(DE_LOAD1.df)
head(correlation_for_plot)
# A tibble: 6 × 10
  cluster cluster_label module model sex     age correlation p_value significant
  <chr>   <fct>         <fct>  <chr> <chr> <int>       <dbl>   <dbl> <lgl>      
1 Consen… "Consensus C… TCXbl… APOE4 fema…     4     -0.0251 3.37e-1 FALSE      
2 Consen… "Consensus C… PHGye… APOE4 fema…     4     -0.148  2.34e-5 TRUE       
3 Consen… "Consensus C… IFGye… APOE4 fema…     4      0.0911 1.86e-2 TRUE       
4 Consen… "Consensus C… TCXbl… APOE4 fema…     8     -0.0230 3.79e-1 FALSE      
5 Consen… "Consensus C… PHGye… APOE4 fema…     8     -0.110  1.73e-3 TRUE       
6 Consen… "Consensus C… IFGye… APOE4 fema…     8     -0.0566 1.45e-1 FALSE      
# ℹ 1 more variable: model_sex <fct>

Visualize the correlation plot

#range of correlation
range(correlation_for_plot$correlation)
[1] -0.2256401  0.1073370
magora_corrplot(correlation_for_plot,0.25)

AMPAD - LOAD1 correlation plot

What do you conclude from this plot?

Key Points

  • AMP-AD gene modules represent major transcriptomic heterogeneity in AD.

  • Correlation of logFC is a practical approach for human-mouse alignment of AD-associated transcriptomic signatures.


Functional Alignment using AD Biological Domains

Overview

Teaching: 45 min
Exercises: 15 min
Questions
  • How to compare functional signatures from omics data between species?

Objectives
  • Understand how to detect changes in groups of related genes within a dataset.

  • Identify the common pitfalls of performing functional analyses.

  • Understand the source and utility of the biological domains of Alzheimer’s disease.

  • Use biological domain annotations to summarise and compare functional enrichments.

Author: Greg Cary, The Jackson Laboratory

Load libraries

suppressPackageStartupMessages(library("tidyverse"))
suppressPackageStartupMessages(library("org.Mm.eg.db"))
suppressPackageStartupMessages(library("org.Hs.eg.db"))
suppressPackageStartupMessages(library("clusterProfiler"))
suppressPackageStartupMessages(library("cowplot"))
suppressPackageStartupMessages(library("synapser"))
suppressPackageStartupMessages(library("ggridges"))

Detecting functional coherence of gene sets from omics data

Most omics analysis approaches produce data on many thousands of genomic features (i.e. transcripts, proteins, etc.) for each condition tested. Simply looking at these lists of genes and associated statistics can be daunting and uninformative. We need approaches to identify which biological functions are being impacted by a given experiment from these systems-level measurements.

Gene functional enrichment analysis describes a variety of statistical methods that identify groups of genes that share a particular biological function or process and show differential association with experimental conditions. Most approaches compare some statistically valid set of differentially expressed features to sets of functional annotations for those features. There are many different functional annotation sets available, some of the more commonly used include:

These are the resources that are the foundation for many genomics knowledge bases, such as MGI, monarch initiative, and the Alliance of Genome Resources. The the precise nature of each of these resources varies, but the core information contained within each is the relationship of sets of genes to biologically meaningful annotations. These annotations are typically expertly curated from the published literature.

There are a variety of statistical approaches that can be employed to test for functional enrichment among genes identified from an omics dataset. Two of the most common broad classes of tests are over-representation analysis (ORA) and gene set enrichment analysis (GSEA). For example, consider the figure below from Zhao & Rhee, Trends in Genetics (2023). Let’s consider each in a bit more detail.

Functional enrichment approaches

Over-representation analysis

ORA involves statistical tests of overlap between two lists of genes: one derived from the experiment and one from the functional annotations. For example, one might ask what is the overlap between the genes in an annotation class, such as “Lysosome”, and the genes that are significantly up-regulated in a given experimental condition. These tests usually rely on some form of Fisher’s exact test (e.g. fisher.test()) or hypergeometric test (e.g. phyper()). If the gene lists consist of a larger number of overlapping genes than would be expected at random - given the sample sizes involved - then there is said to be a statistical over-representation of the annotation class within the experimental condition.

Of course, these overlap tests are generally considered for all annotation classes, which can number in the hundreds to thousands. Performing this many statistical tests ensures that many will be identified as significant by chance. Therefore, there is typically a requirement to correct for multiple testing errors (e.g. p.adjust()).

There are many R packages available to handle the statistical tests and corrections involved in ORA. Today we’ll use clusterProfiler::enrichGO(), which wraps statistical testing for overlap with GO terms and multiple test correction in one function. Let’s start by considering the enrichments from the mouse data analyzed previously.

theme_set(theme_bw())
load('../results/DEAnalysis_5XFAD.RData')
#load('../data/DEAnalysis_5XFAD.RData')

We’ll start by considering the genes that are significantly DE in 12 month old male animals

gene.list.up <- DE_5xFAD.df %>% 
  filter(sex == 'male', 
         age == '12 mo',
         padj <= 0.05, 
         log2FoldChange > 0) %>%
  arrange(desc(log2FoldChange)) %>% 
  filter(!duplicated(symbol), !is.na(symbol)) %>% 
  pull(symbol) %>% 
  unique()

gene.list.dn <- DE_5xFAD.df %>% 
  filter(sex == 'male', 
         age == '12 mo',
         padj <= 0.05, 
         log2FoldChange < 0) %>%
  arrange(desc(log2FoldChange)) %>% 
  filter(!duplicated(symbol), !is.na(symbol)) %>% 
  pull(symbol) %>% 
  unique()

length(gene.list.up)
[1] 1054
length(gene.list.dn)
[1] 491

There are a total of 1,055 significantly up-regulated genes and 491 significantly down-regulated genes in this cohort. Now test for over representation of GO terms among the DEGs. First, we need to identify the universe of all possible genes which includes the genes that were both measured by the experiment and contained within the annotation set.

univ <- as.data.frame(org.Mm.egGO) %>% 
  pull(gene_id) %>% 
  unique() %>% 
  bitr(., fromType = "ENTREZID", toType = 'SYMBOL', OrgDb = org.Mm.eg.db, drop = T) %>% 
  pull('SYMBOL') %>% 
  intersect(., DE_5xFAD.df$symbol)
'select()' returned 1:1 mapping between keys and columns

Now let’s test for enriched GO terms (this can take 3-4 minutes)

enr.up <- enrichGO(gene.list.up, 
                   ont = 'all', 
                   OrgDb = org.Mm.eg.db, 
                   keyType = 'SYMBOL',
                   universe = univ
                   )
enr.dn <- enrichGO(gene.list.dn, 
                   ont = 'all', 
                   OrgDb = org.Mm.eg.db, 
                   keyType = 'SYMBOL',
                   universe = univ
                   )

How many significant terms are identified:

enr.up@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
[1] 1597
enr.dn@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
[1] 531

Challenge 1

1) How many significant terms are identified from the up-regulated gene list if you do not specify the “universe”?

Solution to Challenge 1

enrichGO(gene.list.up, ont = 'all', OrgDb = org.Mm.eg.db, keyType = 'SYMBOL') %>% .@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()  

Plot the top 10 significant terms:

cowplot::plot_grid(
  dotplot(enr.dn, showCategory = 10) + ggtitle('down'),
  dotplot(enr.up, showCategory = 10) + ggtitle('up')
)

significantTerms

From this we can see that nervous system related terms (e.g. “learning or memory”, “neurotransmitter transport”) are down in 5xFAD males at 12 months, while immune functions (e.g. “cytokine-mediated signaling pathway” and “leukocyte mediated immunity”) are up in 5xFAD males at 12 months.

Gene set enrichment analysis

GSEA is an alternative approach that uses a statistical measure from the omics data (e.g. the log fold change or significance) to rank the genes. An “enrichment score” is calculated for each annotation set based on where the genes annotated to the term sit in the overall distribution.

Let’s analyze those same data with the clusterProfiler::gseGO() function. First we’ll specify the gene list and use the log2FoldChange value to rank the genes in the list.

gene.list <- DE_5xFAD.df %>% 
  filter(sex == 'male', 
         age == '12 mo',
         padj <= 0.05
         ) %>%
  arrange(desc(log2FoldChange)) %>% 
  filter(!duplicated(symbol), !is.na(symbol)) %>% 
  pull(log2FoldChange, name = symbol)

Now we’ll test for enrichment:

enr <- gseGO(gene.list, ont = 'all', OrgDb = org.Mm.eg.db, keyType = 'SYMBOL')

How many significant terms are identified:

enr@result %>% filter(p.adjust <= 0.05) %>% pull(ID) %>% unique() %>% length()
[1] 295

Challenge 2

1) The tally above represents all genes, both up- and down-regulatd. To compare between GSEA and ORA, can you identify how many GSEA enriched terms are associated with up-regulated genes and how many are associated with down-regulated genes? (Hint: the NES value within the results relates to the directionality of enrichment).

Solution to Challenge 2

enr@result %>% filter(p.adjust <= 0.05, NES < 0) %>% pull(ID) %>% unique() %>% length()  
enr@result %>% filter(p.adjust <= 0.05, NES > 0) %>% pull(ID) %>% unique() %>% length()

Plot the term enrichments

ridgeplot(enr, core_enrichment = T, showCategory = 20)
Picking joint bandwidth of 0.224

ridgeplot

This plot shows the top 20 GSEA enriched terms. The displayed terms are all from the up-regulated class (i.e. NES > 0), and consist of many immune-relevant terms (e.g. “immune response” and “cytokine production”).

Common pitfalls & best practices

These kinds of functional enrichment analyses are very common, but not all results reported are equal! A recent paper describes an “Urgent need for consistent standards in functional enrichment analysis”{target=’_blank’}. They examine how the results of functional enrichment analyses are reported in the literature, and identify several common shortcomings. Watch out for these common mistakes when performing and reporting your own analyses!

Common pitfalls

Alzheimers’s Disease Biological Domains

While these results are informative and help to provide essential biological context to the results of the omics experiment, it is difficult to understand all of the enriched terms and what that means for the biology of disease. It would be useful to group each of the enriched terms within broader areas of disease biology. There are several common molecular endophenotypes that are detected within human Alzheimer’s cohorts across omics data types (transcriptomics, proteomics, and genetics). Several of these endophenotypic areas are shown in the figure below (source: Jesse Wiley).

Endophenotypes of AD

These endophenotypes describe molecular processes that are dysregulated by or during the disease process. Very similar sets of endophenotypes have been cataloged among the targets of therapeutics in clinical trials for AD (see below, Cummings et al, Alzheimer’s disease drug development pipeline: 2023, Alz Dem TRCI{target=’_blank’}).

Common AD Research Ontology

Drugs in Pipeline by CADRO

In order to formalize and operationalize these endophenotypic areas, the Emory-Sage-SGC center of the TREAT-AD consortium has developed the biological domains of AD. In the enumeration of these domains, several criteria were established; the biological domains defined should be:

In all 19 distinct biological domains (aka biodoms) have been identified. These biological domains are defined using sets of GO terms that align with the endophenotype. For example, terms like “cytokine receptor binding” and “leukocyte migration” are annotated to the Immune Response biodomain, while terms like “postsynaptic density” and “synaptic vesicle cycle” are annotated to the Synapse biodomain. Of all terms in the GO, 7,127 terms (16.4%) are annotated to one of the biological domains. Because the GO terms have genes annotated, genes can be associate with specific endophenotypes via the biological domain groupings of terms. In all, 18,289 genes annotated to at least 1 biodom term. While the biodomains exhibit very few overlapping GO terms (panel A), due to gene pleiotropy (etc) the number of overlapping genes between biological domains is quite a bit higher (panel B). The development of the biological domains is described in more detail in the medRxiv preprint Genetic and Multi-omic Risk Assessment of Alzheimer’s Disease Implicates Core Associated Biological Domains.

AD Biodomain Demographics

Download and explore the biological domain annotations

First let’s download the biodomain definition file from synapse.

synLogin()
NULL
# biodomain definitions
biodom <- readRDS(synGet('syn25428992')$path)
dom.lab <- read_csv(synGet('syn26856828')$path)
Rows: 20 Columns: 4
── Column specification ───────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): domain, abbr, label, color

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

What is in the file?

head(biodom)
# A tibble: 6 × 12
# Rowwise: 
  GO_ID      GOterm_Name  Biodomain n_symbol symbol uniprot n_ensGene ensembl_id
  <chr>      <chr>        <chr>        <int> <list> <list>      <int> <list>    
1 GO:0006955 immune resp… Immune R…     1544 <chr>  <chr>         785 <chr>     
2 GO:0019882 antigen pro… Immune R…      101 <chr>  <chr>         151 <chr>     
3 GO:0002376 immune syst… Immune R…     2165 <chr>  <chr>        1215 <chr>     
4 GO:0045087 innate immu… Immune R…      733 <chr>  <chr>         799 <chr>     
5 GO:0006954 inflammator… Immune R…      537 <chr>  <chr>         490 <chr>     
6 GO:0002250 adaptive im… Immune R…      585 <chr>  <chr>         733 <chr>     
# ℹ 4 more variables: n_entrezGene <int>, entrez_id <list>, n_hgncSymbol <int>,
#   hgnc_symbol <list>

You can see the file is a list of GO term accessions (GO_ID) and names (GOterm_Name) as well as the corresponding endophenotype (Biodomain) to which the term is annotated. There are also gene annotations for each term. These annotation sets come from two sources: (1) the symbol and uniprot annotations are drawn directly from the Gene Ontology via the provided API, (2) ensembl_id, entrez_id, and hgnc_symbol are from BioMart annotations (e.g. biomaRt::getLDS()).

We can see how many GO terms are annotated to each biodomain:

biodom %>% group_by(Biodomain) %>% summarise(n_term = length(unique(GO_ID))) %>% arrange(n_term) 
# A tibble: 19 × 2
   Biodomain                     n_term
   <chr>                          <int>
 1 Tau Homeostasis                   10
 2 APP Metabolism                    37
 3 RNA Spliceosome                   51
 4 Myelination                       66
 5 Oxidative Stress                  98
 6 Autophagy                        112
 7 DNA Repair                       122
 8 Apoptosis                        218
 9 Endolysosome                     236
10 Metal Binding and Homeostasis    301
11 Cell Cycle                       320
12 Vasculature                      374
13 Epigenetic                       432
14 Structural Stabilization         498
15 Mitochondrial Metabolism         532
16 Proteostasis                     758
17 Lipid Metabolism                 875
18 Immune Response                  979
19 Synapse                         1379

The biodomains range in size from Tau Homeostasis (10 terms) up to Synapse (1,379 terms).

We can also investigate the individual genes annotated to each biodomain term.

biodom %>% filter(GOterm_Name == 'amyloid-beta formation') %>% pull(symbol) %>% unlist()
[1] "PSEN1"  "APH1B"  "ABCA7"  "BACE1"  "DYRK1A" "NCSTN"  "PSEN2"  "PSENEN"
[9] "APH1A" 

So the genes associated with amyloid-beta formation within the APP Metabolism biodomain include PSEN1, PSEN2, BACE1, ABCA7, NCSTN, and others.

Annotate enriched terms with biological domain

Let’s re-characterize the 5xFAD functional enrichments we computed previously with the biological domain annotations and see if we can get more context about what is changing in that model. We’ll focus first on the GSEA results and start by annotating the results with biodomain groupings.

enr.bd = enr@result %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain)

head(enr.bd[,1:4])
        Biodomain ONTOLOGY         ID                          Description
1 Immune Response       BP GO:0002376                immune system process
2 Immune Response       BP GO:0006955                      immune response
3            <NA>       BP GO:0006952                     defense response
4            <NA>       BP GO:0002682  regulation of immune system process
5            <NA>       BP GO:0009607          response to biotic stimulus
6            <NA>       BP GO:0043207 response to external biotic stimulus

Not all of the enriched terms are annotated to a biological domain. Some terms are too broad and not specific (e.g. ‘defense response’), while others may not have been captured by a biological domain annotation yet (e.g. ‘regulation of immune system process’). Remember that the conception of the biodomains involved a requirement that they be modifiable, and these terms may be added to the biodomain in the future. Let’s modify the Biodomain value for terms that aren’t annotated to a domain to ‘none’.

enr.bd$Biodomain[is.na(enr.bd$Biodomain)] <- 'none'
head(enr.bd[,1:4])
        Biodomain ONTOLOGY         ID                          Description
1 Immune Response       BP GO:0002376                immune system process
2 Immune Response       BP GO:0006955                      immune response
3            none       BP GO:0006952                     defense response
4            none       BP GO:0002682  regulation of immune system process
5            none       BP GO:0009607          response to biotic stimulus
6            none       BP GO:0043207 response to external biotic stimulus

How many terms are enriched from each domain?

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )
arrange(bd.tally, desc(n_sig_term))
# A tibble: 20 × 3
# Rowwise: 
   domain                        n_term n_sig_term
   <chr>                          <int>      <int>
 1 none                               0        141
 2 Immune Response                  979        122
 3 Structural Stabilization         498          8
 4 Apoptosis                        218          5
 5 Synapse                         1379          5
 6 Autophagy                        112          4
 7 Proteostasis                     758          4
 8 Lipid Metabolism                 875          3
 9 Vasculature                      374          1
10 Myelination                       66          1
11 Mitochondrial Metabolism         532          1
12 Endolysosome                     236          0
13 Epigenetic                       432          0
14 Oxidative Stress                  98          0
15 APP Metabolism                    37          0
16 Tau Homeostasis                   10          0
17 RNA Spliceosome                   51          0
18 Cell Cycle                       320          0
19 Metal Binding and Homeostasis    301          0
20 DNA Repair                       122          0

Many enriched terms don’t map to a domain (134), but most do (154). Of those that do, the vast majority map into the Immune Response biodomain.

We can plot the enrichment results, stratified by biodomain:

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>% 
  arrange(Biodomain, p.adjust) 

ggplot(enr.bd, aes(NES, Biodomain)) +
  geom_violin(data = subset(enr.bd, NES > 0), 
              aes(color = color), scale = 'width')+
  geom_violin(data = subset(enr.bd, NES < 0), 
              aes(color = color), scale = 'width')+
  geom_jitter(aes(size = -log10(p.adjust), fill = color), 
              color = 'grey20', shape = 21, alpha = .3)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  scale_y_discrete(drop = F)+
  scale_fill_identity()+scale_color_identity()
Warning: Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Groups with fewer than two data points have been dropped.
Warning: Removed 9 rows containing missing values (`geom_point()`).
plot of chunk enrichment_results

plot of chunk enrichment_results

We can produce a similar plot for the ORA results. Let’s combine the up and down analyses into one, and make the adjusted p-value signed so that we can look at terms that are up and down separately:

enr.ora = bind_rows(
  enr.up@result %>% mutate(dir = 'up'),
  enr.dn@result %>% mutate(dir = 'dn')
  ) %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain)

enr.ora$Biodomain[is.na(enr.ora$Biodomain)] <- 'none'

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.ora$ID[ enr.ora$Biodomain == domain ] %>% unique() %>% length()
    )

enr.ora <- full_join(enr.ora, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, n_sig_term) %>% pull(domain))) %>% 
  arrange(Biodomain, p.adjust) %>% 
  mutate(signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'dn', -1*signed_logP, signed_logP))
  
ggplot(enr.ora, aes(signed_logP, Biodomain)) +
  geom_violin(data = subset(enr.ora, dir == 'up'), aes(color = color), scale = 'width')+
  geom_violin(data = subset(enr.ora, dir == 'dn'), aes(color = color), scale = 'width')+
  geom_jitter(aes(size = Count, fill = color), color = 'grey20', shape = 21, alpha = .3)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  scale_y_discrete(drop = F)+
  scale_fill_identity()+scale_color_identity()
plot of chunk ORA_results

plot of chunk ORA_results

Because there are many more significantly enriched terms from the ORA, there is more detail to consider. The dominant signal is stil the up-regulation of terms from the Immune Response biodomain. There is also nearly exclusive up-regulation of terms from the Autophagy, Metal Binding and Homeostasis, Oxidative Stress, and APP Metabolism domains. The most down-regulated terms are from the Synapse biodomain.

Challenge 3

1) While there are many more terms identified using ORA compared with GSEA, how many Biodomain terms are identified in both analyses vs unique to each? 2) Which biodomains and terms are uniquely identified in the GSEA analysis?

Solution to Challenge 3

1).

intersect(enr.bd$Description[enr.bd$Biodomain!='none'], enr.ora$Description[enr.ora$Biodomain!='none']) %>% length()  

133 in common

setdiff(enr.bd$Description[enr.bd$Biodomain!='none'], enr.ora$Description[enr.ora$Biodomain!='none']) %>% length()

20 unique to GSEA

setdiff(enr.ora$Description[enr.ora$Biodomain!='none'], enr.bd$Description[enr.bd$Biodomain!='none']) %>% length()

908 unique to ORA

2).

enr.bd %>% filter( Description %in% setdiff(enr.bd$Description[enr.bd$Biodomain!='none'], enr.ora$Description[enr.ora$Biodomain!='none']) ) %>% group_by(Biodomain) %>% summarise(terms = paste0(Description, collapse = ', '))

Most terms unique to GSEA are from the ‘Immune Response’ domain (13), but there are also terms from ‘Structural Stabilization’ (4), ‘Apoptosis’ (2), and ‘Synapse’ (1) that are unique to the GSEA results.

Biological domains over model age

Let’s take a look at how biodomain enrichments from the MODEL-AD strains change with respect to the age of the model. To save some time, let’s load some pre-computed enrichment results for mouse & human data.

mm.ora <- bind_rows(
  read_tsv('../data/5xFAD_ora_results.tsv', col_types = cols()) %>%
    mutate(age = str_remove_all(age, ' mo') %>% as.double()),
  read_tsv('../data/LOAD1_ora_results.tsv', col_types = cols())
)
mm.gsea <- bind_rows(
  read_tsv('../data/5xFAD_gsea_results.tsv', col_types = cols()) %>%
    mutate(age = str_remove_all(age, ' mo') %>% as.double()),
  read_tsv('../data/LOAD1_gsea_results.tsv', col_types = cols())
)

hs.ora <- read_tsv('../data/Hsap_ora_results.tsv', col_types = cols())
hs.gsea <- read_tsv('../data/Hsap_gsea_results.tsv', col_types = cols())

Annotate and plot the 5xFAD Male results, this time faceting by biodomain to compare across the age groups:

enr.bd = mm.ora %>% 
  filter(model == '5xFAD', sex == 'male') %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain) %>% 
  mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain),
         signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'ora_dn', -1*signed_logP, signed_logP))

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  mutate(age = factor(age, enr.bd$age %>% unique() %>% sort(decreasing = T) ),
         Biodomain = factor(Biodomain, levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>% 
  arrange(Biodomain, p.adjust)

ggplot(enr.bd, aes(signed_logP, age)) +
  geom_violin(data = subset(enr.bd, signed_logP > 0), aes(color = color), scale = 'width')+
  geom_violin(data = subset(enr.bd, signed_logP < 0), aes(color = color), scale = 'width')+
  geom_point(aes(size = Count, fill = color, group = ID), 
              color = 'grey20', shape = 21, alpha = .5, position = position_jitter(width = .25, seed = 2)  )+
  facet_wrap(~Biodomain)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  scale_fill_identity()+scale_color_identity()
plot of chunk Male_5xFAD_biodomain

plot of chunk Male_5xFAD_biodomain

The Immune Response terms are over-represented among the up-regulated genes at all ages, but the overall significance of the associations increase with age. Similar age-dependent trends are apparent in other domains as well. Of note is the increasing number of terms from the Synapse domain over-represented among the down-regulated genes as the animals age.

Let’s consider the APOE4Trem2 results. Annotate the biodomains and plot the results of ORA from the DE genes in APOE4Trem2 Males:

enr.bd = mm.ora %>% 
  filter(model == 'APOE4Trem2', sex == 'male') %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain) %>% 
  mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain),
         signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'ora_dn', -1*signed_logP, signed_logP))

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>%  
  filter(!is.na(age)) %>% 
  mutate(age = factor(age, levels = enr.bd$age %>% unique() %>% sort(decreasing = T)),
         Biodomain = factor(Biodomain, 
                            levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>% 
  arrange(age, desc(Biodomain),  p.adjust) 

ggplot(enr.bd, aes(signed_logP, age)) +
  geom_violin(data = subset(enr.bd, signed_logP > 0), aes(color = color), scale = 'width')+
  geom_violin(data = subset(enr.bd, signed_logP < 0), aes(color = color), scale = 'width')+
  geom_jitter(aes(size = Count, fill = color), color = 'grey20', shape = 21, alpha = .5)+
  facet_wrap(~Biodomain)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  scale_fill_identity()+scale_color_identity()
plot of chunk APOE4Trem2_biodomain

plot of chunk APOE4Trem2_biodomain

There aren’t any terms enriched at the 4 month time point. Unlike the 5xFAD mice, APOE4Trem2 mice show a significant down-regulation of Immune Response terms, particularly at 8 months, as well as down-regulation in terms from APP Metabolism domain. There is an enrichment in terms from the Synapse domain for down-regulated genes, despite enrichments among down-regulated genes in cell death related domains Autophagy and Apoptosis.

Challenge 4

Terms from Immune Response are enriched among the genes up in 5xFAD, but down in APOE4Trem2. Are these similar or different terms in each set? Focus on the 12 month old animals from each model.

Solution to Challenge 4

First, get the list of terms associated with up- and down-regulated genes in each model:

xf = mm.ora %>% filter(model == '5xFAD', sex == 'male', age == 12, dir == 'ora_up') %>% left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% filter(Biodomain == 'Immune Response') %>% pull(Description)
at = mm.ora %>% filter(model == 'APOE4Trem2', sex == 'male', age == 12, dir == 'ora_dn') %>% left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% filter(Biodomain == 'Immune Response') %>% pull(Description)

Next how many terms are in each list and how many terms overlap?

length(xf)
length(at)
length(intersect(at,xf))

Almost all terms (95.5%) enriched in the down-regulated analysis from APOE4Trem2 are also found in the enrichments from the up-regulated genes in 5xFAD.

Comparing Human and Mouse functional enrichments

Finally let’s consider how we can compare the functional enrichments in the mouse models with functional enrichments from the AMP-AD cohorts. We can first take a look at the functional enrichments from the different human cohorts.

The cohorts ans tissues available are:

hs.ora %>% select(Study, Tissue) %>% distinct()
# A tibble: 6 × 2
  Study  Tissue
  <chr>  <chr> 
1 MAYO   CBE   
2 MAYO   TCX   
3 MSSM   STG   
4 MSSM   PHG   
5 MSSM   IFG   
6 ROSMAP DLPFC 

We can focus on one tissue per cohort, let’s look at TCX for Mayo, PHG for Mt Sinai, and DLPFC for ROSMAP.

enr.bd = hs.ora %>% 
  filter(Tissue %in% c('TCX','PHG','DLPFC')) %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain) %>% 
  mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain),
         signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'ora_dn', -1*signed_logP, signed_logP))

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  filter(!is.na(Study)) %>% 
  mutate(Biodomain = factor(Biodomain, levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>% 
  arrange(Biodomain, p.adjust)

ggplot(enr.bd, aes(signed_logP, Biodomain)) +
  geom_violin(data = subset(enr.bd, signed_logP > 0), aes(color = color), scale = 'width')+
  geom_violin(data = subset(enr.bd, signed_logP < 0), aes(color = color), scale = 'width')+
  geom_point(aes(size = Count, fill = color, group = ID), 
              color = 'grey20', shape = 21, alpha = .5, position = position_jitter(width = .25, seed = 2)  )+
  facet_wrap(~paste0(Study, ': ', Tissue))+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  scale_fill_identity()+scale_color_identity()
plot of chunk biodomain_one_tissue_per_cohort

plot of chunk biodomain_one_tissue_per_cohort

Huh, the Mayo and MSSM look largely similar, but the ROSMAP DLPFC doesn’t have very many enriched terms. Let’s ignore the ROSMAP result for now and compare the terms enriched between Mayo TCX and the 5xFAD males. First we can join up the enriched terms between the sets into one data frame.

comb_5x = bind_rows(
  hs.ora %>% 
    filter(Study == 'MAYO', Tissue == 'TCX') %>% 
    mutate(species = 'hs') %>% 
    select(ID, Description, species, dir, p.adjust, Count),
  mm.ora %>% 
    filter(model == '5xFAD', sex == 'male', age == 12) %>% 
    mutate(species = 'mm') %>% 
    select(ID, Description, species, dir, p.adjust, Count)) %>% 
  mutate(signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'ora_dn', -1*signed_logP, signed_logP)) %>% 
  pivot_wider(id_cols = c(ID, Description), 
              names_from = species, 
              values_from = signed_logP
  ) %>% 
  unnest(cols = c(hs,mm)) %>% 
  mutate(across(c(hs,mm), ~ if_else(is.na(.x), 0, .x)))
Warning: Values from `signed_logP` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(ID, Description, species) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)
head(comb_5x)
# A tibble: 6 × 4
  ID         Description                              hs    mm
  <chr>      <chr>                                 <dbl> <dbl>
1 GO:0048514 blood vessel morphogenesis             34.3  0   
2 GO:0001525 angiogenesis                           28.0  0   
3 GO:0009611 response to wounding                   16.8  0   
4 GO:0001503 ossification                           16.2  2.00
5 GO:1901342 regulation of vasculature development  16.2 10.1 
6 GO:0045765 regulation of angiogenesis             16.1 10.3 

Great! Now let’s annotate the biodomains and plot it up:

enr.bd <- comb_5x %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain) %>% 
  mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain))

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  mutate(Biodomain = factor(Biodomain, 
                            levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>% 
  arrange(Biodomain)

ggplot(enr.bd, aes(hs, mm)) +
  geom_point(aes(fill = color),color = 'grey20', shape = 21, alpha = .5)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  geom_hline(yintercept = 0, lwd = .5, lty = 2)+
  facet_wrap(~Biodomain, scales = 'free')+
  scale_fill_identity()
plot of chunk annotated_biodomains

plot of chunk annotated_biodomains

There are several terms from multiple biodomains that are enriched in similar directions (i.e. from the up-regulated genes in both human AD and 5xFAD). There are fewer that are enriched in opposite directions (i.e. up in human AD and down in 5xFAD), but there are many terms that are unique to either the human or the mouse data set.

What if we compare the APOE4Trem2 males to Mayo TCX?

comb_at = bind_rows(
  hs.ora %>% 
    filter(Study == 'MAYO', Tissue == 'TCX') %>% 
    mutate(species = 'hs') %>% 
    select(ID, Description, species, dir, p.adjust, Count),
  mm.ora %>% 
    filter(model == 'APOE4Trem2', sex == 'male', age == 12) %>% 
    mutate(species = 'mm') %>% 
    select(ID, Description, species, dir, p.adjust, Count)) %>% 
  mutate(signed_logP = -log10(p.adjust),
         signed_logP = if_else(dir == 'ora_dn', -1*signed_logP, signed_logP)) %>% 
  pivot_wider(id_cols = c(ID, Description), 
              names_from = species, 
              values_from = signed_logP
  ) %>% 
  unnest(cols = c(hs,mm)) %>% 
  mutate(across(c(hs,mm), ~ if_else(is.na(.x), 0, .x)))

enr.bd <- comb_at %>% 
  left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% 
  relocate(Biodomain) %>% 
  mutate(Biodomain = if_else( is.na(Biodomain), 'none', Biodomain))

bd.tally = tibble(domain = c(unique(biodom$Biodomain), 'none')) %>% 
  rowwise() %>% 
  mutate(
    n_term = biodom$GO_ID[ biodom$Biodomain == domain ] %>% unique() %>% length(),
    n_sig_term = enr.bd$ID[ enr.bd$Biodomain == domain ] %>% unique() %>% length()
    )

enr.bd <- full_join(enr.bd, dom.lab, by = c('Biodomain' = 'domain')) %>% 
  mutate(Biodomain = factor(Biodomain, 
                            levels = arrange(bd.tally, desc(n_sig_term)) %>% pull(domain))) %>% 
  arrange(Biodomain)

ggplot(enr.bd, aes(hs, mm)) +
  geom_point(aes(fill = color),color = 'grey20', shape = 21, alpha = .5)+
  geom_vline(xintercept = 0, lwd = .5, lty = 2)+
  geom_hline(yintercept = 0, lwd = .5, lty = 2)+
  facet_wrap(~Biodomain, scales = 'free')+
  scale_fill_identity()
Warning: Removed 1 rows containing missing values (`geom_point()`).
plot of chunk APOE4Trem2_MayoTCX

plot of chunk APOE4Trem2_MayoTCX

Many of the terms that are enriched among the down-regulated genes from APOE4Trem2 are enriched from genes up-regulated in AD, especially from Immune Response and Lipid Metabolism domains.

Overall, by aligning human and mouse omics signatures through the lens of domains affected in each context, we can get a better understanding of the relationships between the biological processes affected in each context.

Challenge 5

We haven’t focused on it much, but there are terms that aren’t in a biodomain and are enriched in opposite direction between the human and mouse model tests. Are there any terms that aren’t annotated to a biodomain, but are up in humand and down in both animal model tests?

Solution to Challenge 5

First, get the list of terms associated with up- and down-regulated genes in each model:

intersect( comb_5x %>% left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>%  filter(is.na(Biodomain), hs > 0, mm < 0) %>% pull(Description), comb_at %>% left_join(., biodom %>% select(Biodomain, ID = GO_ID), by = 'ID') %>% filter(is.na(Biodomain), hs > 0, mm < 0) %>% pull(Description) )

Intriguingly there are some apparently relevant terms, such as “negative regulation of gliogenesis”, that are up in human cohort and down in both mouse models. The biological domain annotations are a helpful tool, but certainly not the whole picture.

Session Info

sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.6.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggridges_0.5.4        synapser_1.0.59       cowplot_1.1.1        
 [4] clusterProfiler_4.8.1 org.Hs.eg.db_3.17.0   org.Mm.eg.db_3.17.0  
 [7] AnnotationDbi_1.62.1  IRanges_2.34.0        S4Vectors_0.38.1     
[10] Biobase_2.60.0        BiocGenerics_0.46.0   lubridate_1.9.2      
[13] forcats_1.0.0         stringr_1.5.0         dplyr_1.1.2          
[16] purrr_1.0.1           readr_2.1.4           tidyr_1.3.0          
[19] tibble_3.2.1          ggplot2_3.4.2         tidyverse_2.0.0      
[22] knitr_1.43           

loaded via a namespace (and not attached):
  [1] DBI_1.1.3               bitops_1.0-7            gson_0.1.0             
  [4] shadowtext_0.1.2        gridExtra_2.3           rlang_1.1.1            
  [7] magrittr_2.0.3          DOSE_3.26.1             compiler_4.3.0         
 [10] RSQLite_2.3.1           png_0.1-8               vctrs_0.6.2            
 [13] reshape2_1.4.4          pkgconfig_2.0.3         crayon_1.5.2           
 [16] fastmap_1.1.1           XVector_0.40.0          labeling_0.4.2         
 [19] ggraph_2.1.0            utf8_1.2.3              HDO.db_0.99.1          
 [22] tzdb_0.4.0              enrichplot_1.20.0       bit_4.0.5              
 [25] xfun_0.39               zlibbioc_1.46.0         cachem_1.0.8           
 [28] aplot_0.1.10            jsonlite_1.8.5          GenomeInfoDb_1.36.0    
 [31] blob_1.2.4              highr_0.10              BiocParallel_1.34.2    
 [34] tweenr_2.0.2            parallel_4.3.0          R6_2.5.1               
 [37] stringi_1.7.12          RColorBrewer_1.1-3      reticulate_1.28        
 [40] GOSemSim_2.26.0         Rcpp_1.0.10             downloader_0.4         
 [43] Matrix_1.5-4            splines_4.3.0           igraph_1.4.3           
 [46] timechange_0.2.0        tidyselect_1.2.0        qvalue_2.32.0          
 [49] viridis_0.6.3           codetools_0.2-19        lattice_0.21-8         
 [52] plyr_1.8.8              treeio_1.24.1           withr_2.5.0            
 [55] KEGGREST_1.40.0         evaluate_0.21           gridGraphics_0.5-1     
 [58] scatterpie_0.2.1        polyclip_1.10-4         Biostrings_2.68.1      
 [61] ggtree_3.8.0            pillar_1.9.0            ggfun_0.0.9            
 [64] generics_0.1.3          vroom_1.6.3             rprojroot_2.0.3        
 [67] RCurl_1.98-1.12         hms_1.1.3               tidytree_0.4.2         
 [70] munsell_0.5.0           scales_1.2.1            glue_1.6.2             
 [73] lazyeval_0.2.2          tools_4.3.0             data.table_1.14.8      
 [76] fgsea_1.26.0            graphlayouts_1.0.0      fastmatch_1.1-3        
 [79] tidygraph_1.2.3         grid_4.3.0              ape_5.7-1              
 [82] colorspace_2.1-0        nlme_3.1-162            patchwork_1.1.2        
 [85] GenomeInfoDbData_1.2.10 ggforce_0.4.1           cli_3.6.1              
 [88] fansi_1.0.4             viridisLite_0.4.2       gtable_0.3.3           
 [91] yulab.utils_0.0.6       digest_0.6.31           ggrepel_0.9.3          
 [94] ggplotify_0.1.0         rjson_0.2.21            farver_2.1.1           
 [97] memoise_2.0.1           lifecycle_1.0.3         here_1.0.1             
[100] httr_1.4.6              GO.db_3.17.0            bit64_4.0.5            
[103] MASS_7.3-58.4          

Key Points

  • Functional signatures are a useful point of comparison between species.

  • Subsetting functions into associated Biological Domains allows for more granular comparisons.