Last updated: 2024-01-15
Checks: 7 0
Knit directory: pdx_rnaseq/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20231121)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 3963225. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .RData
Ignored: .Rhistory
Ignored: analysis/.DS_Store
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/MDS_plot.jpeg
Untracked: analysis/MDS_plot.tiff
Untracked: analysis/heatmap.jpeg
Untracked: data/counts/
Untracked: data/david/
Untracked: limma_treat.csv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/limma_analysis.Rmd
) and
HTML (docs/limma_analysis.html
) files. If you’ve configured
a remote Git repository (see ?wflow_git_remote
), click on
the hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 3963225 | Sally Beard | 2024-01-15 | Correct wording a spelling |
html | 0beffae | Sally Beard | 2024-01-15 | Build site. |
Rmd | 2bd560e | Sally Beard | 2024-01-15 | Add pathway analysis and correct grammar |
html | b745edd | Sally Beard | 2023-11-22 | Build site. |
Rmd | 99eab40 | Sally Beard | 2023-11-22 | Add analysis |
The sequencing data used in the following study was generated at AGRF (Australian Genome Research Facility) using an Illumina stranded mRNA library prep. RNA was collected from PDX tumours from three mice each for eight different PDX tumours, resulting in 24 samples. Samples were sequenced on the Novaseq 6000 platform to generate 150bp paired-end reads. All samples were sequenced across three lanes.
Sequence quality was assessed by fastqc
and was found to
be high for all samples. Reads were mapped to GRch38 Ensembl release 110
with hisat2
(v2.2.1) and reads aligning with genes were
quantified using HTseq
(v3.0.3).
# Import required libraries
library(limma)
library(edgeR)
library(tidyverse)
library(RColorBrewer)
library(Homo.sapiens)
library(patchwork)
library(knitr)
library(kableExtra)
library(ggrepel)
library(gplots)
library(Glimma)
library(ggplot2)
library(tibble)
library(glue)
library(cowplot)
library(ggforce)
Set up DGElist
object for downstream analysis and add
sample-specific information and gene annotations.
# List count files from working directory
files <- list.files("./data/counts/", pattern = "_sorted\\.reverse_counts$")
# edgeR has a function readDGE for combining separate count files into one matrix of counts in one step
counts <- readDGE(files, path = "./data/counts/", columns = c(1,2), header = FALSE)
# Remove meta tags as library size includes counts from the meta tags
MetaTags <- grep("^__", rownames(counts))
counts <- counts[-MetaTags, ]
# Update library size to reflect mapped reads
counts$samples$lib.size <- colSums(counts$counts)
# Obtain sample names from file names and add sample-specific information
sample <- strsplit2(files, "_")[,1:3]
sample_names <- apply(sample, 1, function(row) paste(row, collapse = "_"))
counts$samples$mouse <- sample[,2]
counts$samples$pdx <- as.factor(sample[,1])
replicates <- rep(c(1,1,1,2,2,2,3,3,3), times = 8)
counts$samples$replicates <- replicates
sample_name <- paste(counts$samples$pdx, counts$samples$replicates, sep = "_")
counts$samples$sample <- sample_name
counts$samples$lane <- as.factor(strsplit2(files, "_")[,6])
# Import information for each gene
geneid <- rownames(counts)
ensembl_ids <- sub("\\..*", "", geneid)
genes <- select(Homo.sapiens, keys=ensembl_ids, columns=c("SYMBOL", "TXCHROM", "ENTREZID"),
keytype="ENSEMBL")
genes <- genes[!duplicated(genes$ENSEMBL),]
counts$genes <- genes
# Remove genes without ENTREZ IDs
keep <- !is.na(counts$genes$ENTREZID) & !is.null(counts$genes$ENTREZID)
counts <- counts[keep, ]
The MDS plots below show that there is no batch effect from sequencing lane, so technical replicates run across the three lanes were summed together.
# Make MDS plots to assess technical replicate
lcpm <- cpm(counts, log=TRUE)
col.pdx <- counts$samples$pdx
levels(col.pdx) <- brewer.pal(nlevels(col.pdx), "Set1")
col.pdx <- as.character(col.pdx)
col.lane <- counts$samples$lane
levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=counts$samples$sample, col=col.pdx)
title(main="A. Sample groups")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
plotMDS(lcpm, labels=counts$samples$lane, col=col.lane)
title(main="B. Sequencing lanes")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
# Sum counts from technical replicates
samplenames <- counts$samples$sample
counts <- sumTechReps(counts, ID=samplenames)
# Add additional sample level information
elements <- c("response", "poor_response", "response", "poor_response", "response", "poor_response")
repeated_elements <- rep(elements, times = c(9,3,3,3,3,3))
group <- list(elements = repeated_elements)
counts$samples$group <- group$elements
elements <- c("1", "2")
repeated_elements <- rep(elements, times = c(18, 6))
site <- list(elements = repeated_elements)
counts$samples$site <- site$elements
elements <- c("HGSOC", "OCS","HGSOC")
repeated_elements <- rep(elements, times = c(3,3,18))
pathology <- list(elements = repeated_elements)
counts$samples$pathology <- pathology$elements
elements <- c("ovarian_tumour", "omental_tumour", "biopsy", "ovarian_tumour", "omental_tumour", "ovarian_tumour", "omental_tumour")
repeated_elements <- rep(elements, times = c(6,3,3,3,3,3,3))
specimen <- list(elements = repeated_elements)
counts$samples$specimen <- specimen$elements
elements <- c("refractory", "sensitive", "refractory", "resistant", "sensitive")
repeated_elements <- rep(elements, times = c(9,3,6,3,3))
platinum <- list(elements = repeated_elements)
counts$samples$platinum <- platinum$elements
replicates <- rep(1:3, times = 8)
counts$samples$replicates <- replicates
sample <- paste(counts$samples$pdx, counts$samples$replicates, sep = "_")
counts$samples$sample <- sample
# Generate library statistics
cpm <- cpm(counts)
lcpm <- cpm(counts, log=TRUE)
L <- mean(counts$samples$lib.size) * 1e-6
M <- median(counts$samples$lib.size) * 1e-6
Genes with no or low counts are unlikely to be biologically
important, they provide little evidence for differential expression, and
interfere with statistical approximations, so they were filtered out
before performing the analysis. Genes with low or no counts (counts per
million reads that corresponded to less than ~10 reads in any 9 samples
(smallest group size)) were filtered out of the analysis, based on the
recommendations in the edgeR
documentation. Approximately
half the genes were removed at this step, leaving 18314 genes in the
analysis.
# Filter out genes with low expression
keep.exprs <- filterByExpr(counts, group=counts$samples$group)
table(keep.exprs)
keep.exprs
FALSE TRUE
18144 18314
counts_filtered <- counts[keep.exprs,, keep.lib.sizes=FALSE]
samplenames <- colnames(counts_filtered)
Plotting the distribution of log-CPM values (below) showed that before filtering (A) many genes in each sample had low or no expression, with negative log-CPM values. These genes were removed from the analysis after filtering (B).
# Make density plots for counts before and after filtering out genes with low expression
par(mfrow = c(1,1))
lcpmz <- lcpm
lcpm.cutoff <- log2(10/M + 2/L)
nsamples <- ncol(counts)
col <- scales::hue_pal()(nsamples)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.4), las=2, main="", xlab="")
title(main="A. Raw data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
lcpm2 <- cpm(counts_filtered, log=TRUE)
plot(density(lcpm2[,1]), col=col[1], lwd=2, ylim=c(0,0.4), las=2, main="", xlab="")
title(main="B. Filtered data", xlab="Log-cpm")
abline(v=lcpm.cutoff, lty=3)
for (i in 2:nsamples){
den <- density(lcpm2[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
TMM normalization was applied to account for any highly expressed genes monopolizing the sequencing and to account for the underlying differences in the distributions of expressed genes between samples.
# Apply TMM normalization
counts <- calcNormFactors(counts_filtered, method = "TMM")
Effective library sizes varied from around 50 million reads to more than 250 million reads, with the median just under 150 million reads (A). The range and distribution of expression values looked fairly uniform between samples, and TMM normalization further improved this.
# Create plots to explore library size and normalization
dat <- data.frame(lib = counts$samples$lib.size,
status = counts$samples$group,
sample = colnames(counts))
p1 <- ggplot(dat, aes(x = sample, y = lib, fill = status)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(size = 8)) +
labs(x = "Sample", y = "Library size",
fill = "CHK1i", title = "A. Library size after filtering") +
geom_hline(yintercept = median(dat$lib), linetype = "dashed") +
scale_x_discrete(limits = dat$sample)
dat <- reshape2::melt(cpm(counts, normalized.lib.sizes = FALSE, log = TRUE),
value.name = "cpm")
dat$status <- rep(counts$samples$group, each = nrow(counts))
colnames(dat)[2] <- "sample"
p2 <- ggplot(dat, aes(x = sample, y = cpm, fill = status)) +
geom_boxplot(show.legend = FALSE, outlier.size = 0.75) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7), plot.title = element_text(size = 8)) +
labs(x = "Sample", y = "Log-CPM",
fill = "CHK1i treatment", title = "B. Expression values after filtering") +
geom_hline(yintercept = median(dat$lib), linetype = "dashed")
dat <- reshape2::melt(cpm(counts, normalized.lib.sizes = TRUE, log = TRUE),
value.name = "cpm")
dat$status <- rep(counts$samples$group, each = nrow(counts))
colnames(dat)[2] <- "sample"
p3 <- ggplot(dat, aes(x = sample, y = cpm, fill = status)) +
geom_boxplot(show.legend = FALSE, outlier.size = 0.75) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7), plot.title = element_text(size = 8)) +
labs(x = "Sample", y = "Log-CPM",
fill = "CHK1i treatment", title = "C. Expression values after normalization") +
geom_hline(yintercept = median(dat$lib), linetype = "dashed")
p1 / (p2 + p3) + plot_layout(guides = "collect")
Multi-dimensional scaling (MDS) plots show the largest sources of variation in the data. They are a good way of identifying structure in the data and exploring relationships between samples. The following plots examine the first dimensions colored for known features of the data. The samples do not cluster in any of the first dimensions based on their CHK1i responder status, suggesting that this is not the largest source of variation in the data. Sample 1177 is an outlier in the first dimension, which is confounded with it being the only OCS sample, for this reason, sample 1177 is not included the differential expression analysis. Sample 206 also appears more different from the other samples in dimension 2. The remaining samples cluster reasonably closely together, except for WO-2, which is a bit closer to 206 in the first and second dimensions, which may reflect that these are the only two platinum sensitive PDX models. Overall, the large amount of variation in the first dimension suggests there are likely to be tumor specific differences in expression driving the variation.
# Make MDS plots to explore sources of variation in the data
dims <- list(c(1,2), c(1,3), c(2,3), c(3,4))
vars <- c("group", "pdx", "site", "pathology", "specimen", "platinum")
patches <- vector("list", length(vars))
for(i in 1:length(vars)){
p <- vector("list", length(dims))
for(j in 1:length(dims)){
mds <- plotMDS(cpm(counts, log = TRUE), top = 1000, gene.selection="common",
plot = FALSE, dim.plot = dims[[j]])
dat <- tibble::tibble(x = mds$x, y = mds$y,
sample = counts$samples$sample,
variable = pull(counts$samples, vars[i]))
p[[j]] <- ggplot(dat, aes(x = x, y = y, colour = variable)) +
geom_text(aes(label = sample), size = 2.5) +
labs(x = glue::glue("Leading logFC dim {dims[[j]][1]}"),
y = glue::glue("Leading logFC dim {dims[[j]][2]}"),
colour = vars[i])
}
patches[[i]] <- wrap_elements(wrap_plots(p, ncol = 2, guides = "collect") +
plot_annotation(title = glue::glue("Coloured by: {vars[i]}")) &
theme(legend.position = "bottom"))
}
patches
[[1]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
[[2]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
[[3]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
[[4]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
[[5]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
[[6]]
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
# Create MDS plot for supplementary figure
dims <- list(c(1, 2), c(1, 3), c(2, 3), c(3, 4))
# Create a list to store the MDS plots
mds_plots <- list()
for (i in 1:length(dims)) {
mds <- plotMDS(cpm(counts, log = TRUE), top = 1000, gene.selection = "common",
plot = FALSE, dim.plot = dims[[i]])
dat <- tibble::tibble(x = mds$x, y = mds$y,
pdx = pull(counts$samples, "pdx"),
group = pull(counts$samples, "group"),
pathology = pull(counts$samples, "pathology"))
p <- ggplot(dat, aes(x = x, y = y)) +
geom_point(aes(color = pdx, shape = group, fill = pathology), size = 3) +
scale_shape_manual(values = c("response" = 16, "poor_response" = 17)) +
labs(x = glue::glue("Leading logFC dim {dims[[i]][1]}"),
y = glue::glue("Leading logFC dim {dims[[i]][2]}")) +
theme(legend.position = "none") # Remove individual legends
# Add a circle around points with pathology "OCS" using ggforce
p <- p + geom_circle(data = filter(dat, pathology == "OCS"),
aes(x0 = x, y0 = y, r = 0.2),
color = "grey", fill = NA, size = 1)
mds_plots[[i]] <- p
}
# Combine the MDS plots into a single plot
mds_arranged <- plot_grid(
plot_grid(mds_plots[[1]], mds_plots[[2]], ncol = 2) + theme(legend.position = "none"),
plot_grid(mds_plots[[3]], mds_plots[[4]], ncol = 2) + theme(legend.position = "none"),
nrow = 2
)
# Display the arranged MDS plots
print(mds_arranged)
Version | Author | Date |
---|---|---|
b745edd | Sally Beard | 2023-11-22 |
In the following analysis, a test is set up to compare responders with poor-responders, excluding outlier sample 1177. A means model was fitted where the model parameters represent the means of each pdx model, and responsive models were compared to non-responsive models by testing the difference between the parameter estimates in each group.
# Set up design matrix with parameter for each pdx model
pdx <- as.factor(counts$samples$pdx)
design1 <- model.matrix(~0 + pdx)
colnames(design1) <- gsub("group", "", colnames(design1))
new_names <- c("pdxWO_19", "pdxWO_2")
old_names <- c("pdxWO-19", "pdxWO-2")
for (i in 1:length(old_names)) {
colnames(design1)[colnames(design1) == old_names[i]] <- new_names[i]
}
rownames(design1) <- samplenames
# Set up contrast to test responsive versus non-responsive models
contrast <- makeContrasts(
resp.v.nonresp = (pdx29+pdx111+pdx201+pdxWO_19)/4-(pdx206+pdx931+pdxWO_2)/3,
levels=colnames(design1))
Apply linear modelling using the limma
function, which
uses the log-CPM values which are assumed to be normally distributed,
with precision-weights calculated by the voom
function for
the mean-variance relationship. Number of significant genes listed below
in table 1.
# Apply voom function to the normalised counts
par(mfrow=c(1,2))
v <- voom(counts, design1, plot=TRUE)
# Fit linear model and apply contrast for test
vfit <- lmFit(v, design1)
vfit.cont <- contrasts.fit(vfit, contrasts=contrast)
efit <- eBayes(vfit.cont)
plotSA(efit, main="Final model: Mean-variance trend")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
# Find number of differentially expressed genes
kable_styling(kable(summary(decideTests(efit)), caption="Table 1: Number of differentially expressed genes in responsive PDX versus poor-responders"))
resp.v.nonresp | |
---|---|
Down | 4175 |
NotSig | 9173 |
Up | 4966 |
Since the number of differentially expressed genes was high, the
treat
method was applied to set a fold change cutoff of
0.5, which reduced the number of significant genes to a more manageable
number. Table 2 shows the number of significantly differentially
expressed genes after applying the fold change cut off and running the
test. Table 3 shows the top 20 DE genes for the responder vs poor
responder test.
# Apply log fold change cutoff to find significant genes
tfit <- treat(vfit.cont, lfc=0.5)
# Make table with number of DEGs for each test
dt <- decideTests(tfit)
kable_styling(kable(summary(dt), caption="Table 2: Number of differentially expressed genes from treat method in responsive PDX versus poor-responders"))
resp.v.nonresp | |
---|---|
Down | 1505 |
NotSig | 14898 |
Up | 1911 |
# Make table with top DEGs for each test
top_treat_resp.v.nonresp <- data.frame(topTreat(tfit, n=20, sort.by = "P"))
kable_styling(kable(top_treat_resp.v.nonresp[,c(3,5,6,9)], caption="Table 3: Top 20 DE genes in responsive PDX versus poor-responders", digits = 20))
SYMBOL | logFC | AveExpr | adj.P.Val | |
---|---|---|---|---|
ENSG00000157954.15 | WIPI2 | 4.486325 | 4.7979026 | 3.098845e-12 |
ENSG00000227827.3 | PKD1P2 | -2.971579 | 2.4138203 | 5.377192e-11 |
ENSG00000247516.8 | MIR4458HG | 6.310135 | 0.9026952 | 6.914640e-11 |
ENSG00000242802.9 | AP5Z1 | 4.221873 | 3.4185210 | 4.670573e-10 |
ENSG00000265933.5 | LINC00668 | 7.824832 | -1.6367033 | 4.670573e-10 |
ENSG00000256229.8 | ZNF486 | -6.957183 | -1.4707623 | 4.939533e-10 |
ENSG00000146587.18 | RBAK | 4.084453 | 3.1692840 | 4.970848e-10 |
ENSG00000172465.14 | TCEAL1 | 3.514373 | 1.2064940 | 4.970848e-10 |
ENSG00000147246.10 | HTR2C | 6.420638 | -1.3074195 | 1.200918e-09 |
ENSG00000266729.5 | DSG1-AS1 | 6.235451 | -4.5682199 | 1.200918e-09 |
ENSG00000214652.6 | ZNF727 | 7.269333 | -0.1039279 | 1.200918e-09 |
ENSG00000254681.7 | PKD1P5 | -2.126198 | 3.2688327 | 1.423881e-09 |
ENSG00000105171.10 | POP4 | 2.084162 | 6.7024602 | 1.499018e-09 |
ENSG00000107165.13 | TYRP1 | 7.279802 | -0.8700701 | 1.935851e-09 |
ENSG00000104870.13 | FCGRT | 4.816698 | 4.7487712 | 2.221262e-09 |
ENSG00000171889.7 | MIR31HG | -5.954429 | -2.2989996 | 2.221262e-09 |
ENSG00000147475.17 | ERLIN2 | 2.028277 | 5.9374255 | 2.221262e-09 |
ENSG00000189067.14 | LITAF | -2.516439 | 6.4708330 | 2.356192e-09 |
ENSG00000106536.21 | POU6F2 | 6.185343 | -1.9000557 | 2.711042e-09 |
ENSG00000131943.20 | C19orf12 | 1.801995 | 6.2736091 | 2.711042e-09 |
To look at the levels of gene expression in individual samples the
GlMDPlot
function was used. This creates a two-panel
interactive MD (mean-difference) plot in an html. The left plot shows
the log-fold-change vs average expression with up regulated genes
colored red and down regulated genes colored blue. The right plot shows
the expression levels of a particular gene in each sample, grouped by
PDX model. Hovering over points on left plot will plot expression level
for the corresponding gene, clicking on points will fix the expression
plot to gene. Clicking on a row in the table has the same effect as
clicking on the corresponding gene in the plot. The values in the table
correspond to the test of responders versus poor-responders excluding
outlier pdx 1177 samples.
# Make Glimma MD plot
lcpm <- cpm(counts, log=TRUE)
glMDPlot(tfit, coef = 1, counts = lcpm, status=dt, main="MD plot: Responders versus poor responders", groups=pdx)
Click here to see interactive MD plot
Investigating the top genes shows that many of the genes called as differentially expressed are actually being driven by just one PDX model, and not reflective of the difference between responders and poor responders.
Below is a heatmap made using the expression of the top 200 genes. As expected from the MDS plots and visualizations, PDX 206 is behaving very differently to the other samples for a subset of genes. Also sample 931 clusters with the responders rather than the poor-responders.
# Specify color palette from RColorBrewer
my_palette <- colorRampPalette(brewer.pal(9, "RdBu"))(100)
group_colours <- c("response" = brewer.pal(8, "Set2")[1], # Green
"poor_response" = brewer.pal(8, "Set2")[4]) # Pink
# Create a legend
legend_labels <- unique(unlist(group))
legend_colours <- group_colours[legend_labels]
legend_text <- c("Responders", "Poor responders")
# Make heatmap using heatmap.2
dge <- data.frame(topTreat(tfit, n=Inf, sort.by = "P"))
lcpm2 <- lcpm[, -c(4:6)]
responsive.v.non <- dge$ENSEMBL[1:200]
i <- which(counts$genes$ENSEMBL %in% responsive.v.non)
heatmap.2(lcpm2[i,], scale="row",
labRow=counts$genes$SYMBOL[i], labCol=colnames(lcpm2),
col=my_palette, trace="none", density.info="none",
margin=c(8,6), lhei=c(2,10), dendrogram="column", key.title = "Expression Levels",
ColSideColors = group_colours[unlist(group)[4:24]],
cexCol = 1.8)
# Add the legend to the plot
legend("left", legend = legend_text, fill = legend_colours, title = "Sample Groups")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
To explore molecular pathways that are altered in CHK1i responsive PDX models compared to poor-responders, significant genes with FDR<0.05 were imported into The Database for Annotation, Visualization and Integrated Discovery, DAVID (v2023q3), run by the NIH, which is a web-based functional annotation tool. Up regulated genes and down regulated genes were tested separately to find terms that were over-represented among statistically significant genes than would be expected by chance. Gene Ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes (KEGG) biological pathways were interrogated. GO terms represent specific biological processes, molecular functions, or cellular components associated with the genes. The background list of genes was compiled from the full list of genes that were detected in any of the samples.
In the following outputs, fold enrichment indicates that the genes in the list are that many times more likely to be associated with a particular term compared to what would be expected by chance. The FDR (false-discovery rate) is calculated from p-values generated by a modified Fisher’s exact test adapted to measure the gene enrichment in annotation terms. The gene count represents how many genes from the input list of significant genes is present in that gene set or pathway.
# Input kegg data from DAVID output for upregulated genes
data <- read.csv("./data/david/david_kegg_up_treat.csv", header = TRUE)
# Filter data for FDR < 0.1
filtered_data <- data %>%
filter(FDR < 0.1)
# Create a dot plot using ggplot2
ggplot(filtered_data, aes(x = Fold.Enrichment, y = reorder(Term, Fold.Enrichment))) +
geom_point(aes(size = Count, color = FDR), alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "Pathway Enrichment Analysis",
x = "Fold Enrichment",
y = "KEGG Pathways",
size = "Gene Count",
color = "FDR",
subtitle = "Genes upregulated in responsive pdx compared to poor-responders"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5)
)
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
# Input GO data from DAVID output for upregulated genes
data <- read.csv("./data/david/david_go_up_treat.csv", header = TRUE)
# Filter data for FDR < 0.1
filtered_data <- data %>%
filter(FDR < 0.1)
# Change gene counts to numerical values
filtered_data$Count <- as.numeric(filtered_data$Count)
# Create a faceted dot plot using ggplot2 with equal spacing on the y-axis
ggplot(filtered_data, aes(x = Fold.Enrichment, y = reorder(Term, Fold.Enrichment))) +
geom_point(aes(size = Count, color = FDR), alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "GO Term Enrichment Analysis",
x = "Fold Enrichment",
y = "GO Terms",
size = "Gene Count",
color = "FDR",
subtitle = "Genes upregulated in responsive pdx compared to poor-responders"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
strip.text = element_text(size = 7)
) +
facet_grid(Category ~ ., scales = "free_y", space = "free_y")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
# Input kegg data from DAVID output for down regulated genes
data <- read.csv("./data/david/david_kegg_limma_down.csv", header = TRUE)
# Filter data for FDR < 0.1
filtered_data <- data %>%
filter(FDR < 0.1)
# Create a dot plot using ggplot2
ggplot(filtered_data, aes(x = Fold.Enrichment, y = reorder(Term, Fold.Enrichment))) +
geom_point(aes(size = Count, color = FDR), alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "Pathway Enrichment Analysis",
x = "Fold Enrichment",
y = "KEGG Pathways",
size = "Gene Count",
color = "FDR",
subtitle = "Genes down regulated in responsive pdx compared to poor-responders"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5)
)
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
# Input GO data from DAVID output for downregulated genes
data <- read.csv("./data/david/david_go_down_treat.csv", header = TRUE)
# Filter data for FDR < 0.1
filtered_data <- data %>%
filter(FDR < 0.1)
# Create a faceted dot plot using ggplot2 with equal spacing on the y-axis
ggplot(filtered_data, aes(x = Fold.Enrichment, y = reorder(Term, Fold.Enrichment))) +
geom_point(aes(size = Count, color = FDR), alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(
title = "GO Term Enrichment Analysis",
x = "Fold Enrichment",
y = "GO Terms",
size = "Gene Count",
color = "FDR",
subtitle = "Genes down regulated in responsive pdx compared to poor-responders"
) +
theme_minimal() +
theme(
plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
plot.subtitle = element_text(size = 10, hjust = 0.5),
strip.text = element_text(size = 7) # Adjust the angle as needed
) +
facet_grid(Category ~ ., scales = "free_y", space = "free_y")
Version | Author | Date |
---|---|---|
0beffae | Sally Beard | 2024-01-15 |
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggforce_0.4.1
[2] cowplot_1.1.1
[3] glue_1.6.2
[4] Glimma_2.8.0
[5] gplots_3.1.3
[6] ggrepel_0.9.4
[7] kableExtra_1.3.4
[8] knitr_1.44
[9] patchwork_1.1.3
[10] Homo.sapiens_1.3.1
[11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[12] org.Hs.eg.db_3.16.0
[13] GO.db_3.16.0
[14] OrganismDbi_1.40.0
[15] GenomicFeatures_1.50.4
[16] GenomicRanges_1.50.2
[17] GenomeInfoDb_1.34.9
[18] AnnotationDbi_1.60.2
[19] IRanges_2.32.0
[20] S4Vectors_0.36.2
[21] Biobase_2.58.0
[22] BiocGenerics_0.44.0
[23] RColorBrewer_1.1-3
[24] lubridate_1.9.3
[25] forcats_1.0.0
[26] stringr_1.5.0
[27] dplyr_1.1.3
[28] purrr_1.0.2
[29] readr_2.1.4
[30] tidyr_1.3.0
[31] tibble_3.2.1
[32] ggplot2_3.4.4
[33] tidyverse_2.0.0
[34] edgeR_3.40.2
[35] limma_3.54.2
[36] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] BiocFileCache_2.6.1 systemfonts_1.0.5
[3] plyr_1.8.9 BiocParallel_1.32.6
[5] digest_0.6.33 htmltools_0.5.6.1
[7] fansi_1.0.5 magrittr_2.0.3
[9] memoise_2.0.1 tzdb_0.4.0
[11] Biostrings_2.66.0 annotate_1.76.0
[13] matrixStats_1.0.0 svglite_2.1.2
[15] timechange_0.2.0 prettyunits_1.2.0
[17] colorspace_2.1-0 blob_1.2.4
[19] rvest_1.0.3 rappdirs_0.3.3
[21] xfun_0.40 callr_3.7.3
[23] crayon_1.5.2 RCurl_1.98-1.12
[25] jsonlite_1.8.7 graph_1.76.0
[27] polyclip_1.10-6 gtable_0.3.4
[29] zlibbioc_1.44.0 XVector_0.38.0
[31] webshot_0.5.5 DelayedArray_0.24.0
[33] scales_1.2.1 DBI_1.1.3
[35] Rcpp_1.0.11 viridisLite_0.4.2
[37] xtable_1.8-4 progress_1.2.2
[39] bit_4.0.5 htmlwidgets_1.6.2
[41] httr_1.4.7 farver_2.1.1
[43] pkgconfig_2.0.3 XML_3.99-0.14
[45] sass_0.4.7 dbplyr_2.4.0
[47] locfit_1.5-9.8 utf8_1.2.4
[49] labeling_0.4.3 reshape2_1.4.4
[51] tidyselect_1.2.0 rlang_1.1.1
[53] later_1.3.1 munsell_0.5.0
[55] tools_4.2.2 cachem_1.0.8
[57] cli_3.6.1 generics_0.1.3
[59] RSQLite_2.3.2 evaluate_0.22
[61] fastmap_1.1.1 yaml_2.3.7
[63] processx_3.8.2 bit64_4.0.5
[65] fs_1.6.3 caTools_1.18.2
[67] KEGGREST_1.38.0 RBGL_1.74.0
[69] whisker_0.4.1 xml2_1.3.5
[71] biomaRt_2.54.1 compiler_4.2.2
[73] rstudioapi_0.15.0 filelock_1.0.2
[75] curl_5.1.0 png_0.1-8
[77] tweenr_2.0.2 geneplotter_1.76.0
[79] bslib_0.5.1 stringi_1.7.12
[81] highr_0.10 ps_1.7.5
[83] lattice_0.22-5 Matrix_1.6-1.1
[85] vctrs_0.6.4 pillar_1.9.0
[87] lifecycle_1.0.3 BiocManager_1.30.22
[89] jquerylib_0.1.4 bitops_1.0-7
[91] httpuv_1.6.12 rtracklayer_1.58.0
[93] R6_2.5.1 BiocIO_1.8.0
[95] promises_1.2.1 KernSmooth_2.23-22
[97] codetools_0.2-19 MASS_7.3-60
[99] gtools_3.9.4 SummarizedExperiment_1.28.0
[101] DESeq2_1.38.3 rprojroot_2.0.3
[103] rjson_0.2.21 withr_2.5.1
[105] GenomicAlignments_1.34.1 Rsamtools_2.14.0
[107] GenomeInfoDbData_1.2.9 parallel_4.2.2
[109] hms_1.1.3 grid_4.2.2
[111] rmarkdown_2.25 MatrixGenerics_1.10.0
[113] git2r_0.32.0 getPass_0.2-2
[115] restfulr_0.0.15
sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.5.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggforce_0.4.1
[2] cowplot_1.1.1
[3] glue_1.6.2
[4] Glimma_2.8.0
[5] gplots_3.1.3
[6] ggrepel_0.9.4
[7] kableExtra_1.3.4
[8] knitr_1.44
[9] patchwork_1.1.3
[10] Homo.sapiens_1.3.1
[11] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[12] org.Hs.eg.db_3.16.0
[13] GO.db_3.16.0
[14] OrganismDbi_1.40.0
[15] GenomicFeatures_1.50.4
[16] GenomicRanges_1.50.2
[17] GenomeInfoDb_1.34.9
[18] AnnotationDbi_1.60.2
[19] IRanges_2.32.0
[20] S4Vectors_0.36.2
[21] Biobase_2.58.0
[22] BiocGenerics_0.44.0
[23] RColorBrewer_1.1-3
[24] lubridate_1.9.3
[25] forcats_1.0.0
[26] stringr_1.5.0
[27] dplyr_1.1.3
[28] purrr_1.0.2
[29] readr_2.1.4
[30] tidyr_1.3.0
[31] tibble_3.2.1
[32] ggplot2_3.4.4
[33] tidyverse_2.0.0
[34] edgeR_3.40.2
[35] limma_3.54.2
[36] workflowr_1.7.1
loaded via a namespace (and not attached):
[1] BiocFileCache_2.6.1 systemfonts_1.0.5
[3] plyr_1.8.9 BiocParallel_1.32.6
[5] digest_0.6.33 htmltools_0.5.6.1
[7] fansi_1.0.5 magrittr_2.0.3
[9] memoise_2.0.1 tzdb_0.4.0
[11] Biostrings_2.66.0 annotate_1.76.0
[13] matrixStats_1.0.0 svglite_2.1.2
[15] timechange_0.2.0 prettyunits_1.2.0
[17] colorspace_2.1-0 blob_1.2.4
[19] rvest_1.0.3 rappdirs_0.3.3
[21] xfun_0.40 callr_3.7.3
[23] crayon_1.5.2 RCurl_1.98-1.12
[25] jsonlite_1.8.7 graph_1.76.0
[27] polyclip_1.10-6 gtable_0.3.4
[29] zlibbioc_1.44.0 XVector_0.38.0
[31] webshot_0.5.5 DelayedArray_0.24.0
[33] scales_1.2.1 DBI_1.1.3
[35] Rcpp_1.0.11 viridisLite_0.4.2
[37] xtable_1.8-4 progress_1.2.2
[39] bit_4.0.5 htmlwidgets_1.6.2
[41] httr_1.4.7 farver_2.1.1
[43] pkgconfig_2.0.3 XML_3.99-0.14
[45] sass_0.4.7 dbplyr_2.4.0
[47] locfit_1.5-9.8 utf8_1.2.4
[49] labeling_0.4.3 reshape2_1.4.4
[51] tidyselect_1.2.0 rlang_1.1.1
[53] later_1.3.1 munsell_0.5.0
[55] tools_4.2.2 cachem_1.0.8
[57] cli_3.6.1 generics_0.1.3
[59] RSQLite_2.3.2 evaluate_0.22
[61] fastmap_1.1.1 yaml_2.3.7
[63] processx_3.8.2 bit64_4.0.5
[65] fs_1.6.3 caTools_1.18.2
[67] KEGGREST_1.38.0 RBGL_1.74.0
[69] whisker_0.4.1 xml2_1.3.5
[71] biomaRt_2.54.1 compiler_4.2.2
[73] rstudioapi_0.15.0 filelock_1.0.2
[75] curl_5.1.0 png_0.1-8
[77] tweenr_2.0.2 geneplotter_1.76.0
[79] bslib_0.5.1 stringi_1.7.12
[81] highr_0.10 ps_1.7.5
[83] lattice_0.22-5 Matrix_1.6-1.1
[85] vctrs_0.6.4 pillar_1.9.0
[87] lifecycle_1.0.3 BiocManager_1.30.22
[89] jquerylib_0.1.4 bitops_1.0-7
[91] httpuv_1.6.12 rtracklayer_1.58.0
[93] R6_2.5.1 BiocIO_1.8.0
[95] promises_1.2.1 KernSmooth_2.23-22
[97] codetools_0.2-19 MASS_7.3-60
[99] gtools_3.9.4 SummarizedExperiment_1.28.0
[101] DESeq2_1.38.3 rprojroot_2.0.3
[103] rjson_0.2.21 withr_2.5.1
[105] GenomicAlignments_1.34.1 Rsamtools_2.14.0
[107] GenomeInfoDbData_1.2.9 parallel_4.2.2
[109] hms_1.1.3 grid_4.2.2
[111] rmarkdown_2.25 MatrixGenerics_1.10.0
[113] git2r_0.32.0 getPass_0.2-2
[115] restfulr_0.0.15