Application of stable isotope probing to identify temperature responsive bacterial taxa at the level of 3-hydroxy fatty acid synthesis

closed
collaboration
Authors
Affiliation

Olivier Rué

Migale bioinformatics facility

Christelle Hennequet-Antier

Migale bioinformatics facility

Published

November 22, 2023

Modified

November 30, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

The aim of the project is to provide a better understanding of how soil bacteria respond to temperature variations at the membrane level. Bacterial lipids can be used in paleoclimatology to reconstruct temperature of past climates in marine and terrestrial archives. This involves the use of proxies and calculations that are based on the relative amounts of these lipids. In soil, only one organic proxy is available, mainly due to the difficulty to work on soil, which is much more heterogeneous in structure than aquatic environments. A drawback of this sungle proxy is the error which is associated with it as well as the lack of confirmation of the result by a different approach. A new proxy based on 3-hydroxy fatty acids has been proposed in 2016. To better understand and validate this proxy, as well as to investigate new more precise proxies, we aim at understanding what soil bacterial species contribute to the proposed proxy by investigating their response te temperature at the taxonomic and lipid levels. We have used a DNA- and lipid-SIP approach during a time-course experiment and are now at the stage of analyzing what species have used the 13C labelled substrate in a temperature dependent manner. The profile of the enriched taxa in the 13C fraction will be compared in the different samples (different temperature, different incubation time) and will be compared with the 13C-enriched lipid profiles in the same samples.

Partners

  • Olivier Rué - Migale bioinformatics facility - BioInfomics - INRAE
  • Christelle Hennequet-Antier - Migale bioinformatics facility - BioInfomics - INRAE
  • Sylvie Collin - Sorbonne Université

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables
  Definition
1 HTML report

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

Raw data were sent on November 15 and deposited on the front server and a copy was sent to the abaca server.

Show the code
cd /home/orue/work/PROJECTS/TEMPO/
mkdir RAW_DATA
# Rename FASTQ files
for i in DiversiteTEMPO/donneesbrutes/*_R1.fastq.gz ; do id=$(echo basename $i |cut -d '.' -f 5 ) ;cp $i RAW_DATA/${id}.fastq.gz ; done
for i in DiversiteTEMPO/donneesbrutes/*_R2.fastq.gz ; do id=$(echo basename $i |cut -d '.' -f 5 ) ;cp $i RAW_DATA/${id}.fastq.gz ; done
scp -r RAW_DATA/ orue@abaca.maiage.inrae.fr:/backup/partage/migale/TEMPO/

Quality control

seqkit [1] was used to get informations from FASTQ files.

Show the code
# seqkit
cd /home/orue/work/PROJECTS/TEMPO/
qsub -cwd -V -N seqkit -pe thread 4 -R y -b y "conda activate seqkit-2.0.0 && seqkit stats /home/orue/work/PROJECTS/TEMPO/RAW_DATA/*.fastq.gz -j 4 > raw_data.infos && conda deactivate"

We can plot and display the number of reads to see if enough reads are present and if samples are homogeneous.

FastQC [2] is a program designed to spot potential problems in high througput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results. MultiQC [3] aggregates results from bioinformatics analyses across many samples into a single report.

Show the code
cd /home/orue/work/PROJECTS/TEMPO/
mkdir FASTQC LOGS
for i in /home/orue/work/PROJECTS/TEMPO/RAW_DATA/*.fastq.gz ; do echo "conda activate fastqc-0.11.9 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done
qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
qsub -cwd -V -N multiqc -o LOGS -e LOGS -b y "conda activate multiqc-1.11 && multiqc FASTQC -o MULTIQC && conda deactivate"

The MultiQC report shows expected metrics for Illumina Miseq sequencing data.

Note

The quality control is good enough to go further. Only the PCR blank sample has too few reads. We will remove it later after having checked its composition.

Bioinformatics

We used FROGS [4], [5] to build amplicon sequence variants (ASVs) from raw reads.

The first tool, called preprocess allows to clean reads. From FASTQ files, we overlap R1 and R2 reads, then look at primers and remove them, filter on length and finally dereplicate remaining sequences.

Show the code
mkdir FROGS4
cd FROGS4
conda activate frogs-4.1.0
preprocess.py illumina --merge-software pear --min-amplicon-size 250 --max-amplicon-size 580 --five-prim-primer CCTACGGGNGGCWGCAG --three-prim-primer GGATTAGATACCCBDGTAGTC --R1-size 300 --R2-size 300 --nb-cpus 16 --input-archive ../RAW_DATA/Tempo.tar.gz

The preprocess report shows that almost all reads are kept (7,591,113 from 7,776,650), indicating that there were no problems during sequencing.

Now we cluster sequences with swarm [6].

Show the code
clustering.py --input-fasta preprocess.fasta --input-count preprocess_counts.tsv --nb-cpus 16 --distance 1 --fastidious

Then, we removed chimera with vsearch [7].

Show the code
remove_chimera.py --input-fasta clustering.fasta --input-biom clustering.biom --non-chimera remove_chimera.fasta --nb-cpus 8 --log-file remove_chimera.log --out-abundance remove_chimera.biom --summary remove_chimera.html

The remove chimera report shows that 77.5% of the clusters (but 30% of the sequences) are considered as chimera. It is expected for 16S data.

Finally, we filter ASVs to remove low-abundant ASVs (composed of few sequences). We use the threshold of 0.005% of total abundance. We check also if phiX sequences are remaining.

Show the code
cluster_filters.py --input-fasta remove_chimera.fasta --input-biom remove_chimera_abundance.biom --output-fasta filters.fasta --nb-cpus 8 --log-file filters.log --output-biom filters.biom --summary filters.html --excluded filters_excluded.tsv --contaminant /db/outils/FROGS/contaminants/phi.fa --min-sample-presence 1 --min-abundance 0.00005

The filters report is very informative. The filters have removed 99.6% of clusters. 1,901 ASVs are remaining. However, only 24.8% of the sequences were deleted. The smallest ASV is composed of 267 sequences. No phiX contamination is observed.

Now, it’s time to affiliate our ASVs to know what they are. We used blast [8] with Silva v.138.1 [9].

Show the code
taxonomic_affiliation.py --reference /db/outils/FROGS/assignation/silva_138.1_16S_pintail100/silva_138.1_16S_pintail100.fasta --input-biom filters.biom --input-fasta filters.fasta --output-biom affiliation.biom --nb-cpus 8 --taxonomy-ranks 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'

The affiliation report shows that all ASVs have been affiliated. 17.6% of the ASVs are multi-affiliated at Species level. It means that the ASV sequences are shared with several species. It is expected for 16S sequences.

The affiliations stats report allows us to see the taxonomic compositon for each sample. An other interesting information is the alignment distribution. We show that the alignments are good (% identity and coverage). It indicates that our ASVs sequences were present in our databank.

Here are the informations of the ASVs:

Finally, we can build a phylogenetic tree from the sequences. It will allow us to use some diversity indices that need a phylogenetic tree.

Show the code
ree.py --biom-file affiliation.biom --input-sequences filters.fasta --out-tree tree.nwk --log-file tree.log --html tree.html

Here is the tree report.

Biostatistics

Diversity Analyses with all samples

Diversity Analyses with Time T0-T14-T28 samples

Diversity Analyses with Time T3-T7 samples

Reports on diversity analyses for each phyloseq object

Reproducibility token

Show the code
sessioninfo::session_info(pkgs = "attached")
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 24.04.1 LTS
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  fr_FR.UTF-8
 ctype    fr_FR.UTF-8
 tz       Europe/Paris
 date     2025-02-18
 pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package                  * version date (UTC) lib source
 Biobase                  * 2.64.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 BiocGenerics             * 0.50.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 Biostrings               * 2.72.1  2024-06-02 [1] Bioconductor 3.19 (R 4.4.0)
 ComplexHeatmap           * 2.20.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 data.table               * 1.16.4  2024-12-06 [1] CRAN (R 4.4.2)
 dplyr                    * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
 DT                       * 0.33    2024-04-04 [1] CRAN (R 4.4.0)
 emmeans                  * 1.10.7  2025-01-31 [1] CRAN (R 4.4.2)
 forcats                  * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
 GenomeInfoDb             * 1.40.1  2024-05-24 [1] Bioconductor 3.19 (R 4.4.0)
 GenomicRanges            * 1.56.2  2024-10-09 [1] Bioconductor 3.19 (R 4.4.1)
 ggplot2                  * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
 ggtree                   * 3.12.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 ggtreeExtra              * 1.14.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 InraeThemes              * 3.0.0   2024-05-21 [1] Github (davidcarayon/InraeThemes@9ee5259)
 IRanges                  * 2.38.1  2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
 kableExtra               * 1.4.0   2024-01-24 [1] CRAN (R 4.4.0)
 lubridate                * 1.9.4   2024-12-08 [1] CRAN (R 4.4.2)
 MatrixGenerics           * 1.16.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 matrixStats              * 1.5.0   2025-01-07 [1] CRAN (R 4.4.2)
 metacoder                * 0.3.7   2024-10-03 [1] Github (grunwaldlab/metacoder@54ef8ea)
 mia                      * 1.15.6  2025-01-07 [1] Github (microbiome/mia@2dc3430)
 MicrobiotaProcess        * 1.16.1  2024-07-28 [1] Bioconductor 3.19 (R 4.4.1)
 MultiAssayExperiment     * 1.30.3  2024-07-10 [1] Bioconductor 3.19 (R 4.4.2)
 openxlsx                 * 4.2.8   2025-01-25 [1] CRAN (R 4.4.2)
 phyloseq                 * 1.48.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)
 phyloseq.extended        * 0.1.4.1 2024-05-21 [1] Github (mahendra-mariadassou/phyloseq-extended@150a871)
 purrr                    * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
 readr                    * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
 S4Vectors                * 0.42.1  2024-07-03 [1] Bioconductor 3.19 (R 4.4.1)
 scales                   * 1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
 SingleCellExperiment     * 1.26.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.2)
 stringr                  * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
 SummarizedExperiment     * 1.34.0  2024-05-01 [1] Bioconductor 3.19 (R 4.4.0)
 tibble                   * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
 tidyr                    * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
 tidyverse                * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)
 TreeSummarizedExperiment * 2.12.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.2)
 viridis                  * 0.6.5   2024-01-29 [1] CRAN (R 4.4.0)
 viridisLite              * 0.4.2   2023-05-02 [1] CRAN (R 4.4.0)
 XVector                  * 0.44.0  2024-04-30 [1] Bioconductor 3.19 (R 4.4.0)

 [1] /home/orue/R/x86_64-pc-linux-gnu-library/4.4
 [2] /usr/local/lib/R/site-library
 [3] /usr/lib/R/site-library
 [4] /usr/lib/R/library

──────────────────────────────────────────────────────────────────────────────

References

1. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PloS one. 2016;11:e0163962.
2. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
3. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.
4. Escudié F, Auer L, Bernard M, Mariadassou M, Cauquil L, Vidal K, et al. FROGS: Find, Rapidly, OTUs with Galaxy Solution. Bioinformatics. 2018;34:1287–94. doi:10.1093/bioinformatics/btx791.
5. Bernard M, Rué O, Mariadassou M, Pascal G. FROGS: a powerful tool to analyse the diversity of fungi with special management of internal transcribed spacers. Briefings in Bioinformatics. 2021;22. doi:10.1093/bib/bbab318.
6. Mahé F, Rognes T, Quince C, Vargas C de, Dunthorn M. Swarm v2: Highly-scalable and high-resolution amplicon clustering. PeerJ. 2015;3:e1420.
7. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: A versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.
8. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215:403–10.
9. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic acids research. 2012;41:D590–6.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France