RESTORBIOME2: Exploration and normalisation

Experiment with ER

Authors

Affiliation

Olivier Rué

Migale bioinformatics facility

Christelle Hennequet-Antier

Migale bioinformatics facility

Published

May 2, 2023

Modified

February 19, 2025

Show the code

#Use packages and functions
library(tidyverse)
library(data.table)
library(SummarizedExperiment) # manipulation RNASeq assays
library(rtracklayer) ## for annotation
#library("DEFormats")
library(edgeR)
library(kableExtra)
library(dplyr)
library(purrr)
library(stringr)
library(ggplot2)
library(reshape2)
library(mixOmics) #PCA

Read counts

The matrix of raw counts looks like that.

Show the code

#cat("EXP = ", params$my.interest)

### Get Raw Count data in edgeR object
# At gene level
#inputFile <-  "count_allSamples_BA.csv"
inputFile <- list.files("../data/counts") |> str_subset(params$my.interest)

#cat("File raw = ",inputFile,"\n")
rawdata <- fread(file.path("../data/counts",inputFile))
# rename first column gene ID to be homogenous between assays
colnames(rawdata)[1] <- "Gene_ID"
nameEXP <- params$my.interest

# genecount: count matrix
geneCount <- rawdata[,-c(1)] |> as.matrix()
row.names(geneCount) <- rawdata %>% dplyr::select(1) %>% unlist() %>% as.character()
#row.names(geneCount) %>% head()
#print(dim(geneCount))
head(geneCount) %>% kbl() %>% kable_styling() %>% 
  scroll_box(width = "100%", height = "200px")

	ER_G_R1	ER_G_R2	ER_G_R3	ER_R_R1	ER_R_R2	ER_R_R3
gene-EUBREC_RS00010	2954	5651	6885	4663	4618	7345
gene-EUBREC_RS00015	1247	2482	3366	2390	2453	3605
gene-EUBREC_RS00020	147	305	378	325	374	445
gene-EUBREC_RS00025	1115	2183	2871	1876	2041	3159
gene-EUBREC_RS00030	2570	4589	5907	4570	4973	7323
gene-EUBREC_RS00035	3380	6119	8719	7076	7463	10522

This matrix contains raw counts from ER experiment with 3240 genes and 6 samples.

This is the experimental design.

Show the code

# create Design from sample's names in file
sampleInfo.all <- colnames(rawdata)[-c(1)] %>%
  stringr::str_split(., "[_]", simplify = FALSE) %>%
  transpose() %>%
  purrr::simplify_all() 
names(sampleInfo.all) <- c("Bacteria","Treatment","Replicate")
#sampleInfo.all$sample <- paste(sampleInfo.all$Treatment, sampleInfo.all$Rep, sep="_")
sampleInfo.all$sample <- colnames(rawdata)[-c(1)]

Design <- as.data.frame(sampleInfo.all)
Design <- Design %>% mutate( group = paste(Bacteria, Treatment, sep="_"))
Design %>% kbl() %>% kable_styling() %>% 
  scroll_box(width = "100%", height = "200px")

Bacteria	Treatment	Replicate	sample	group
ER	G	R1	ER_G_R1	ER_G
ER	G	R2	ER_G_R2	ER_G
ER	G	R3	ER_G_R3	ER_G
ER	R	R1	ER_R_R1	ER_R
ER	R	R2	ER_R_R2	ER_R
ER	R	R3	ER_R_R3	ER_R

Show the code

# Design_file <- paste("../output/Design", nameEXP, ".csv", sep="")
# write.table(Design, Design_file, row.names=FALSE, sep=",")

We used gene annotation produced using DBCAN .

Show the code

# GFF annotation file
filegff <- case_when(
  nameEXP == 'BA' ~ "../data/GFF/NC_008618_bifido_adolescentis_spikes_cazymes_dbcan.gff3",
  nameEXP == 'BU' ~ "../data/GFF/CP102263_1_Bacteroides_uniformis_spikes_cazymes_dbcan_pulpred_cc.gff3",
  nameEXP == 'BC' ~ "../data/GFF/NZ_AP012325_bifido_catenulatum_spikes_cazymes_dbcan.gff3",
  nameEXP == 'ER' ~ "../data/GFF/NC_012781_eubacterium_rectale_spikes_cazymes_dbcan.gff3",
)

# use gff produced by DBCAN
gff <- readGFF(filegff)


df_annot <- gff %>% as_tibble() %>% unnest_longer(Parent)  


# remove comulmn with all NA
df_annot <- df_annot %>% select_if(~sum(!is.na(.)) > 0) 
# Remark: duplicated gene annotation
# which(duplicated(df_annot[,"Parent"]))
# keeep the first line when multiple annotation
df_annot <- df_annot%>%
  group_by(seqid, source, type, ID, Parent) %>%
  filter(row_number() <= 1)
#glimpse(df_annot)


rawdata_annot <- left_join(rawdata[, 1], df_annot, by=c("Gene_ID" = "Parent"), keep=FALSE) %>%
  as.data.frame()

#glimpse(rawdata_annot)

In this ER experiment, the effect of each treatment (G, R) on gene expression will be explore. The number of biological repetitions is 3, 3 for G, R treatment, respectively.

Show the code

# create a SummarizedExperiment object
se <- SummarizedExperiment(assays=list(counts=geneCount),
                     rowData=rawdata_annot, colData=Design)
#print(se)

#counts matrix = assay(se)
#size of library
#libsize <- assay(se) %>% colSums()

#exemple to remove sample BT_G_Rep1
#se <- se[, se$sample !="BT_G_Rep1"]
#colData(se)
#dim(se)
#print(se)

#counts matrix = assay(se)
libsize <- assay(se) %>% colSums()
libsize %>% kbl(col.names = c("libsize")) %>% kable_styling(full_width = FALSE)

	libsize
ER_G_R1	6659725
ER_G_R2	11293829
ER_G_R3	10632841
ER_R_R1	10015263
ER_R_R2	10757161
ER_R_R3	11736919

Genes with very low counts across all libraries may be filtered.

Show the code

# Filter gene with low count, at least number of biological replicates with cpm > valfilt 
#summary(libsize)
#valfilt <- round(10*1e06 / min(libsize), 1)
valfilt <- 1
nbrepbio <- min(table(colData(se)$group))
keep <- rowSums(edgeR::cpm(se) > valfilt) >= nbrepbio
#summary(keep)
se <- se[keep,]
#dim(se)

The filtering on genes is based on count per million (cpm) greater than 1 in at least 3 samples corresponding to the minimum number of biological replicates. We kept 3082 expressed genes for further analyses from 6 samples.

Normalisation

A normalization factor is calculated to take into account the different sizes of the sequencing banks (i.e. the total read count) and the distribution of reads per sample on sequencing run, as discussed [1]. Normalization by trimmed mean of M values (TMM) [2] is performed by using the calcNormFactors function from edgeR R package. It calculates a set of normalization factors, one for each sample, to eliminate composition biases between libraries.

Here, the table contains library size and normalisation factors using TMM method ordered by library size. These graphs are another way of verifying the quality control carried out during the sequencing and bioinformatics analysis steps.

Show the code

dge <- calcNormFactors(se)
#cat("Ordered library size and normalisation factors TMM's method \n")
# dge$samples
dge$samples[order(dge$samples$lib.size), ]

        group lib.size norm.factors Bacteria Treatment Replicate  sample
ER_G_R1  ER_G  6659374    0.9042937       ER         G        R1 ER_G_R1
ER_R_R1  ER_R 10014864    0.9417411       ER         R        R1 ER_R_R1
ER_G_R3  ER_G 10632182    1.0974065       ER         G        R3 ER_G_R3
ER_R_R2  ER_R 10756755    0.9262899       ER         R        R2 ER_R_R2
ER_G_R2  ER_G 11293072    1.0385752       ER         G        R2 ER_G_R2
ER_R_R3  ER_R 11735721    1.1122607       ER         R        R3 ER_R_R3

Show the code

# barplot
p<-ggplot(dge$samples, aes(x=dge$samples$sample, y=dge$samples$norm.factors, fill=dge$samples$group)) +
    geom_col() + xlab("Samples") + ylab("TMM factors") + labs(fill = "group") + theme_bw() +
    theme(axis.text.x = element_text(angle=90)) + ggtitle("TMM size factors \n")
p

Show the code

# Library size with group
p<-ggplot(dge$samples, aes(x=dge$samples$sample, y=dge$samples$lib.size, fill=dge$samples$group)) +
    geom_col() + xlab("Samples") + ylab("Library size") + labs(fill = "group") + theme_bw() +
    theme(axis.text.x = element_text(angle=90)) + ggtitle("Library size \n")
p

Show the code

# bp <- ggplot(dge$samples, aes(x=dge$samples$group, y=dge$samples$lib.size, fill=dge$samples$group)) + geom_boxplot() + 
#       geom_jitter(position=position_jitter(0.1)) + theme_bw() + 
#     theme(axis.text.x = element_text(angle=90)) +
#       xlab("group") + ylab("Library size") + labs(fill = "group")
# bp

Multidimensional scaling plot

Multidimensional scaling (MDS) plot shows the relationships between the samples. The top (500 genes) are used to calculate the distance between expression profiles of samples. The distance approximate the log2 fold change between the samples.

As expected, samples are grouped by treatment.

Show the code

limma::plotMDS(dge, col = as.numeric(dge$samples$group), labels = dge$samples$sample, cex=0.6, top = 500, main = "MDS plot top 500 genes")

Reproducibility token

Show the code

sessioninfo::session_info(pkgs = "attached")

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 24.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  fr_FR.UTF-8
 ctype    fr_FR.UTF-8
 tz       Europe/Paris
 date     2025-02-16
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version date (UTC) lib source
 Biobase              * 2.64.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 BiocGenerics         * 0.50.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 data.table           * 1.16.4  2024-12-06 [1] CRAN (R 4.4.2)
 dplyr                * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 edgeR                * 4.2.2   2024-10-13 [1] Bioconductor 3.19 (R 4.4.1)
 forcats              * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
 GenomeInfoDb         * 1.40.1  2024-05-24 [1] Bioconductor 3.19 (R 4.4.0)
 GenomicRanges        * 1.56.2  2024-10-09 [1] Bioconductor 3.19 (R 4.4.1)
 ggplot2              * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
 IRanges              * 2.38.1  2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
 kableExtra           * 1.4.0   2024-01-24 [1] CRAN (R 4.4.0)
 lattice              * 0.22-5  2023-10-24 [4] CRAN (R 4.3.1)
 limma                * 3.60.6  2024-10-02 [1] Bioconductor 3.19 (R 4.4.1)
 lubridate            * 1.9.4   2024-12-08 [1] CRAN (R 4.4.2)
 MASS                 * 7.3-61  2024-06-13 [4] CRAN (R 4.4.1)
 MatrixGenerics       * 1.16.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 matrixStats          * 1.5.0   2025-01-07 [1] CRAN (R 4.4.2)
 mixOmics             * 6.28.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.2)
 purrr                * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
 readr                * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
 reshape2             * 1.4.4   2020-04-09 [1] CRAN (R 4.4.0)
 rtracklayer          * 1.64.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 S4Vectors            * 0.42.1  2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
 stringr              * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
 SummarizedExperiment * 1.34.0  2024-05-01 [1] Bioconductor 3.19 (R 4.4.0)
 tibble               * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
 tidyr                * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
 tidyverse            * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)

 [1] /home/orue/R/x86_64-pc-linux-gnu-library/4.4
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

References

1. Dillies M-A, Rau A, Aubert J, Hennequet-Antier C, Jeanmougin M, Servant N, et al. A comprehensive evaluation of normalization methods for illumina high-throughput RNA sequencing data analysis. Brief Bioinform. 2012;14:671–83.

2. Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology. 2010;11:R25. doi:10.1186/gb-2010-11-3-r25.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France