How remove batch effects of RNA sequencing without control genes in R
Introduction
RUVseq can conduct a differential expression (DE) analysis that controls for “unwanted variation”, e.g., batch, library preparation, and other nuisance effects, using the between-sample normalization methods proposed. They call this approach RUVSeq for remove unwanted variation from RNA-Seq data
If no genes are known a priori not to be influenced by the covariates of interest, one can obtain a set of “in-silico empirical” negative controls, e.g., least significantly DE genes based on a first-pass DE analysis performed prior to RUVg normalization.
Code exmaple
Installation of RUVseq and edgeR from bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("RUVSeq")
BiocManager::install("edgeR")
Load needed packge and data
library(RUVSeq)
library(zebrafishRNASeq)
data(zfGenes)
zfGenes data: one gene in a row, one sample in a column
Filter out non-expressed genes, by requiring more than 5 reads in at least two samples for each gene.
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
Defined group and saved data in SeqExpressionSet
x <- as.factor(rep(c("Ctl", "Trt"), each=3))
set <- newSeqExpressionSet(as.matrix(filtered),
phenoData = data.frame(x, row.names=colnames(filtered)))
Firstly we run DEG, then we get empirical control gene by considering all but the top 5000 genes as ranked by edgeR p-values.
design <- model.matrix(~x, data=pData(set))
y <- DGEList(counts=counts(set), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
top <- topTags(lrt, n=nrow(set))$table
empirical <- rownames(set)[which(!(rownames(set) %in% rownames(top)[1:5000]))]
Now we can estimate the factors of unwanted variation
set2 <- RUVg(set, empirical, k=1)
pData(set2)
## x W_1
## Ctl1 Ctl -0.10879677
## Ctl3 Ctl 0.23066424
## Ctl5 Ctl 0.19926266
## Trt9 Trt 0.07672121
## Trt11 Trt -0.83540924
## Trt13 Trt 0.43755790
Now , We can put W_1 as the factors of unwanted variation combining with group factor into the negative binomial GLM model of edgeR. Then the DE genes without batch effect came out.
design <- model.matrix(~x + W_1, data=pData(set1))
y <- DGEList(counts=counts(set1), group=x)
y <- calcNormFactors(y, method="upperquartile")
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit, coef=2)
topTags(lrt) # DE gene order by pvalue
If someone wants to use DESeq2 instead of edgeR, the codes are as follows.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts(set1),
colData = pData(set1),
design = ~ W_1 + x)
dds <- DESeq(dds)
res <- results(dds)
res
References
http://www.bioconductor.org/packages/devel/bioc/vignettes/RUVSeq/inst/doc/RUVSeq.pdf
Donation
[paypal-donation]
Hello, I enjoy reading all of your post. I like to write a little comment to support you.