Microvarior3

Authors
Affiliation

Olivier Rué

Migale bioinformatics facility

Mahendra Mariadassou

Migale bioinformatics facility

Published

January 3, 2023

Modified

December 10, 2024

Aim of the project

The aim of this project is :

  • to compare the microbiota of grape bays of different varieties (resistant to cryptogamic diseases versus conventional) and according to different agricultural practices (organic, low-input and no-treatment)
  • to determine the impact of the variation of microbial communities on spontaneous fermentation kinetics, and the final quality of the wines (analytical and sensory).

Partners

  • Cécile Neuvéglise - INRAE - SPO - Montpellier
  • Olivier Rué - INRAE - Migale - Jouy-en-Josas
  • Mahendra Mariadassou - INRAE - Migale - Jouy-en-Josas

Raw data

Raw data were deposited to the front server, renamed according to metadata informations and a copy was sent to the abaca server:

Code
ssh orue@front.migale.inrae.fr

cd /home/orue/work/MICROVARIOR3/
mkdir RAW_DATA
unzip 221213_M03493_0461_000000000-KHT5G.zip
for i in 221213_M03493_0461_000000000-KHT5G/*/*/*/*_R1_001.fastq.gz  ; do id=$(echo $(basename $i | cut -f 1 -d '_')) ; cp $i RAW_DATA/${id}_R1.fastq.gz ; done
for i in 221213_M03493_0461_000000000-KHT5G/*/*/*/*_R2_001.fastq.gz  ; do id=$(echo $(basename $i | cut -f 1 -d '_')) ; cp $i RAW_DATA/${id}_R2.fastq.gz ; done
cd RAW_DATA
rm -f Undeterm*
tar zcvf Microvarior3.tar.gz *.fastq.gz
scp Microvarior3.tar.gz orue@abaca.maiage.inrae.fr:/backup/partage/migale/MICROVARIOR3/RAW_DATA/
  • /backup/partage/migale/MICROVARIOR3/RAW_DATA/: Illumina MiSeq 2x301 bp, 195 samples

Reads were renamed according to metadata informations.

Associated metadata

Here are the informations about the samples:

Quality control

Reads

First, the number of reads and their lengths was computed for each sample with seqkit [1]. We expect at least 10-20,000 reads by sample and all reads with the same length.

Code
cd /home/orue/work/MICROVARIOR3/RAW_DATA/
conda activate seqkit-2.0.0
seqkit stat *.fastq.gz > seqkit.txt
conda deactivate

Let see what we have:

Code
raw_data <- read.csv2("data/seqkit.txt", stringsAsFactors = F, sep="", strip.white=T, dec=".") %>% as_tibble() %>% mutate(sample = basename(gsub('.{12}$', '', file))) %>% distinct(sample, .keep_all = TRUE) %>% select(-file)
raw_data %>% datatable()
Code
raw_reads_data <- raw_data %>% mutate(num_seqs = as.numeric(gsub(",", "", num_seqs))) %>% rename(Sample = sample)

p<-ggplot(data=raw_reads_data, aes(x=raw_reads_data$Sample, y=raw_reads_data$num_seqs)) +
      xlab("Sample")  + ylab("Reads") +
      scale_y_continuous() +
     geom_bar(stat="identity",fill="#b1b1ff") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  geom_text(aes(y = raw_reads_data$num_seqs, label = ""),position = position_stack(vjust = .5)) 

All samples have more than 30,000 reads, even blanks and controls :(
We observe that reads are not all 301 bp long.

Bases quality

Sequencing quality was assessed with FastQC [2] and individual reports were agglomerated with MultiQC [3] to obtain a global overview of metrics (base quality, adapter content, reads length distribution…).

Code
mkdir LOGS
mkdir FASTQC
for i in RAW_DATA/*.fastq.gz ; do echo "conda activate fastqc-0.11.8 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done
qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
qsub -cwd -V -N multiqc -o LOGS -e LOGS -b y "conda activate multiqc-1.12 && multiqc FASTQC -o MULTIQC && conda deactivate"

The HTML report shows that the sequencing quality was medium at the 3’ extremities for some samples. It is often seen for 2x300 sequencing protocols and was observed in the previous runs.

Bioinformatics

A combination of FROGS [4] and dada2 package [5] was used to obtain Amplicon Sequence Variants (ASVs) from raw reads. Briefly, primers were checked and removed with cutadapt [6], sequences shorter than 20 bp or containing ambigous bases (Ns) were removed. Then, reads were denoised with dada2 (dada function) and remaining pairs were merged with pear [7]. Reads that were not merged were kept to keep potential long ITS sequences. Chimera were removed with vsearch [8], low-abundant OTUs were removed (abundance < 0.005% of total abundance). ITSx was applied to keep only ITS sequences. Remaining OTUs were affiliated with Unite v. XX [^unite]. Taxonomies were checked and curated manually to obtain the final dataset.

This procedure will be available in the next release of FROGS
Code
mkdir /home/orue/work/MICROVARIOR3/DADA2_FROGS
cd /home/orue/work/MICROVARIOR3/DADA2_FROGS
R -e "rmarkdown::render('/work_home/orue/MICROVARIOR2/SCRIPTS/dada2_FROGS.Rmd', params=list(
  references=c('/work_home/orue/MICROVARIOR/Unite_Fungi8.2_and_METABARFOOD.fasta'), 
  region='ITS', 
  expname='MICROVARIOR3',
  input_directory='/home/orue/work/MICROVARIOR3/RAW_DATA/',
  forward_primer='CTTGGTCATTTAGAGGAAGTAA',
  reverse_primer='GCTGCGTTCTTCATCGATGC',
  output_directory='/home/orue/work/MICROVARIOR3/DADA2_FROGS',
  metadata_file='/work_home/orue/MICROVARIOR3/metadata.tsv',
  min_reads=1000,
  min_abundance=0.00005,
  its=TRUE,
  its_region='ITS1',
  threads=16), 
output_file = '/home/orue/work/MICROVARIOR3/DADA2_FROGS/report.html')"

cd FROGS
conda activate frogs-3.2.3
remove_chimera.py --input-biom ../dada2/dada2.biom --input-fasta ../dada2/dada2.fasta --nb-cpus 16
otu_filters.py --input-biom remove_chimera_abundance.biom --input-fasta remove_chimera.fasta --min-abundance 0.00005 --contaminant /db/frogs_databanks/contaminants/phi.fa 
itsx.py --input-biom otu_filters_abundance.biom --input-fasta otu_filters.fasta --check-its-only --region ITS1
affiliation_OTU.py --input-biom itsx_abundance.biom --input-fasta itsx.fasta --reference /work_home/orue/MICROVARIOR/Unite_Fungi8.2_and_METABARFOOD.fasta --nb-cpus 16
affiliations_stat.py --input-biom affiliation_abundance.biom --multiple-tag blast_affiliations --tax-consensus-tag blast_taxonomy --identity-tag perc_identity --coverage-tag perc_query_coverage
conda deactivate
Code
biomfile <- "html/affiliation_abundance.biom"
physeq <- import_frogs(biomfile, taxMethod = "blast")
metadata <- read.table("data/metadata.tsv", row.names = 1, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
sample_data(physeq) <- metadata
saveRDS(physeq,"html/dada2_frogs_data.rds")
Code
microvarior2 <- readRDS("data/microvarior2_ITS_final.rds")
microvarior3 <- readRDS("html/dada2_frogs_data.rds")

asv_microvarior2 <- taxa_names(microvarior2)
asv_microvarior3 <- taxa_names(microvarior3)

x <- list(
  asv_microvarior2 = asv_microvarior2, 
  asv_microvarior3 = asv_microvarior3
  )
ggVennDiagram(x,label_alpha = 0) + 
  scale_fill_gradient(low = "#F4FAFE", high = "#4981BF") 

We now use the affiliations manually curated for the Microvarior2 data to update the current data.

Code
change_complete_taxo <- function(t, taxo, sequence){
    if(sequence %in% rownames(t)){
      taxolist <- unlist(strsplit(taxo, ";"))
      t[sequence,"Kingdom"] <- taxolist[1]
      t[sequence,"Phylum"] <- taxolist[2]
      t[sequence,"Class"] <- taxolist[3]
      t[sequence,"Order"] <- taxolist[4]
      t[sequence,"Family"] <- taxolist[5]
      t[sequence,"Genus"] <- taxolist[6]
      t[sequence,"Species"] <- taxolist[7]  
    }
    return(t)
  }

t3 <- tax_table(microvarior3)
t2 <- tax_table(microvarior2)
l2 <- rownames(t2)
l3 <- rownames(t3)
t <- tax_table(microvarior3)
t_save <- tax_table(microvarior3)

tib2 <- psmelt(microvarior2) %>% mutate(Taxo = paste(Kingdom,Phylum,Class,Order,Family,Genus,Species,sep=";")) %>% group_by(OTU) %>% ungroup() %>% select(OTU,Taxo) %>% unique()
tib3 <- psmelt(microvarior3) %>% mutate(Taxo = paste(Kingdom,Phylum,Class,Order,Family,Genus,Species,sep=";")) %>% group_by(OTU) %>% ungroup() %>% select(OTU,Taxo) %>% unique()

for (asv in l3) {
  if(asv %in% l2){
    write.table(asv, "html/common_asvs.txt", append=TRUE, quote = FALSE, row.names = FALSE, col.names = FALSE)
    taxo3 <- tib3 %>% filter(OTU == asv) %>% select(Taxo) %>% as.character()
    taxo2 <- tib2 %>% filter(OTU == asv) %>% select(Taxo) %>% as.character()
    if(taxo2 != taxo3){
      t <- change_complete_taxo(t,taxo2,asv)  
    }
    
  }
}
tax_table(microvarior3) <- t
saveRDS(microvarior3,"html/physeq_curated_with_microvarior2.rds")
write.table(tax_table(microvarior3) %>% as.matrix() ,"html/affiliations_curated_with_microvarior2.tsv",sep="\t",row.names=TRUE,col.names=NA)

ab <- tibble(OTU=names(taxa_sums(microvarior3)),Abundance=taxa_sums(microvarior3) %>% unname())
tib3ab <- left_join(tib3, ab, by=c("OTU"))
tib4 <- read_csv("html/common_asvs.txt", col_names = c("OTU")) %>% mutate(toto = OTU)
ll <- left_join(tib3ab, tib4, by=c("OTU"), na_matches = c("na")) %>% mutate(Curated = case_when(is.na(toto) ~ "todo", !is.na(toto) ~ "done")) %>% filter(Curated == "todo") %>% mutate(ASV = OTU) %>% select(ASV,Abundance,Taxo)
write.table(ll, "html/to_check.txt", append=FALSE, quote = FALSE, row.names = FALSE, col.names = TRUE)
Code
datatable(tax_table(microvarior3))

We nom merge Microvarior2 and Microvarior3 phyloseq objects after adding the Year metadata.

Code
sample_data(microvarior2)$Year <- "2021"
sample_names(microvarior2) <- glue::glue("{sample_names(microvarior2)}_2021")
sample_data(microvarior3)$Year <- "2022"
sample_names(microvarior3) <- glue::glue("{sample_names(microvarior3)}_2022")
all <- merge_phyloseq(microvarior2, microvarior3)

sample_data(all)$treatment[sample_data(all)$treatment == "bas-intrant"] <- "bas-intrants"
sample_data(all)$Volume[sample_data(all)$Volume == "901 g"] <- "900 g"
sample_data(all)$Volume[sample_data(all)$Volume == "902 g"] <- "900 g"
sample_data(all)$Volume[sample_data(all)$Volume == "903 g"] <- "900 g"
sample_data(all)$Volume[sample_data(all)$Volume == "904 g"] <- "900 g"
sample_data(all)$Volume[sample_data(all)$Volume == "905 g"] <- "900 g"
sample_data(all)$Volume[sample_data(all)$Volume == "906 g"] <- "900 g"
sample_names(all)[sample_names(all) == "v2-6-T0_2022"] <- "V2-6-T0_2021"

pos <- match("V2-6-T0_2021", sample_names(all))
years <- sample_data(all)$Year
years[pos] <- "2021"
sample_data(all)$Year <- years


saveRDS(all,"html/physeq_all.rds")

Finally we apply the last modifications on ASVs (manual chimera detection, updates on taxonomy…)

Code
curated_its_table <- read.csv2("html/to_check_checked.txt", sep = "\t", colClasses = c("character","character","character","character")) %>% as_tibble()
  
to_remove <- curated_its_table %>% filter(Cécile == "chimère à supprimer" | Cécile == "à supprimer" | Cécile == "contamination à supprimer") %>% select(ASV) %>% pull()

keptTaxa <- setdiff(taxa_names(all), to_remove)
physeq_no_chimera <- prune_taxa(keptTaxa, all)
  
to_change <- curated_its_table %>% filter(Cécile != "chimère à supprimer") %>% filter(Cécile != "à supprimer") %>%  filter(Cécile != "contamination à supprimer") %>% mutate(Cécile = ifelse(Cécile == "ok",Taxo,Cécile)) %>% select(ASV,Cécile)
t <- tax_table(physeq_no_chimera)
physeq_final <- physeq_no_chimera
taxa_names(physeq_final) <- str_replace(taxa_names(physeq_final), "_FROGS_combined","")
saveRDS(physeq_final,"html/physeq_final.rds")

Let see the final ASVs:

Code
datatable(tax_table(physeq_final))

I finally correct some errors in localities (Bordeaux vs. bordeaux…)

Code
sample_data(physeq_final)$locality[sample_data(physeq_final)$locality == "bordeaux"] <- "Bordeaux"
sample_data(physeq_final)$locality[sample_data(physeq_final)$locality == "Coupe-rose"] <- "Coupe-Roses"
sample_data(physeq_final)$locality[sample_data(physeq_final)$locality == "Coupe Roses"] <- "Coupe-Roses"
sample_data(physeq_final)$Type[sample_data(physeq_final)$Type == "eau de rinçage"] <- "rinse water"

sample_data(physeq_final)$locality %>% unique()
[1] "Control"     "Pech Rouge"  "Coupe-Roses" "Bordeaux"    NA           
[6] "Vassal"     
Code
meta <- sample_data(physeq_final) |> as.data.frame()
meta$rowname <- rownames(meta)
sample_data(physeq_final) <- meta |> as_tibble() |>
  mutate(across(everything(), as_factor)) |>
  column_to_rownames()

saveRDS(physeq_final, "html/physeq_factor.rds")

Downloads

Changes in taxonomic assignations can be made with AffiliationExplorer

References

1. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PloS one. 2016;11:e0163962.
2. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
3. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.
4. Escudié F, Auer L, Bernard M, Mariadassou M, Cauquil L, Vidal K, et al. FROGS: Find, Rapidly, OTUs with Galaxy Solution. Bioinformatics. 2018;34:1287–94. doi:10.1093/bioinformatics/btx791.
5. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from illumina amplicon data. Nature methods. 2016;13:581.
6. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet journal. 2011;17:10–2.
7. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate illumina paired-end reAd mergeR. Bioinformatics. 2013;30:614–20.
8. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France