CAVIAR

Whole genome analysis of Enterococcus cecorum

closed
collaboration
Authors
Affiliation

Cédric Midoux

Migale bioinformatics facility

Valentin Loux

Migale bioinformatics facility

Published

April 24, 2023

Modified

April 30, 2023

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

The aim of this project is to develop a workflow to analyze Enterococcus cecorum strains.

Patners

  • Cédric Midoux - Migale bioinformatics facility - BioInfomics - INRAE
  • Valentin Loux - Migale bioinformatics facility - BioInfomics - INRAE
  • Pascale Serror - MICALIS - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables
  Definition
1 Snakemake workflow
2 HTML reports

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

The raw data is provided by the partner in directory front.migale.inrae.fr:/work_projet/cecotype/2022-CAVIAR-SEQUENCING/RAW. It contains 56 shotgun sequencings of Enterococcus cecorum strains. The project focuses on the development of the workflow and not only on the data analysis as such.

Important

Some samples were sequenced twice time :

  • comA_004_2 & ceco_comA_004_2
  • comA_122_1 & ceco_comA_122_1

Some samples are missing :

  • comA_244_1
  • comI_262_1
Tip

Je viens d’élucider les mystère des séquence disparues. Ce n’est pas le bon ADN qui a été envoyé 2 fois!
Ceci explique les doublons!
Je n’ai aucun souvenir de la genèse de l’erreur qui est grossière!
C’est trop bête. Je peux éventuellement envoyer ces 2 ADNs.

Pascale Serror - 2023-05-03

Workflow

We used a snakemake [1] workflow versioned on ForgeMIA.

This snakemake workflow aims to assemble Enterococcus cecorum genomes from raw reads and published reference data.

Config

The main parameters are specified in the config/config.yaml file.

They include :

Variable Definition Default
samples Table of samples with two columns sample et library (example : ceco_comA_004_2 lib593974 ) /work_home/cmidoux/GIT/wf_caviar/config/samples.tsv
raw_data Data path /work_projet/cecotype/2022-CAVIAR-SEQUENCING/RAW
workdir Results path /work_home/cmidoux/caviar
subsample Size of sub-samples for easy assembly 750000
k_spades Size of k-mers used by spades 21,33,55,77
reference Reference genome path, used by riboSeed /work_projet/cecotype/REF/NCTC12421/NCTC12421.fasta
genus Genus used by prokka for annotation Enterococcus
species Species used by prokka for annotation cecorum
proteins Gene catalogue path used by prokka for annotation /work_projet/cecotype/NANOPORE_ASSEMBLY-2020/Ref/Refseq/Enterococcus/proteins.faa
eggnog_db eggNOG database path /db/outils/eggnog-mapper/
kaiju_db Kaiju database path /db/outils/kaiju-2021-03/nr_euk/

Outputs

  • results/qc/multiqc.html : MULTIQC report with :
    • FASTQC raw data quality report
    • FASTP trimming report
    • QUAST assembly report
    • PROKKA annotation report
  • results/kaiju/krona.html : Raw data taxonomic annotation.
  • results/assembly/{sample}/contigs.fasta : Sample assembly after fastp, seqtk_subsample, riboSeed and spades.
  • results/annot/prokka/{sample}/{sample}.gbk : Contig annotation by prokka.
  • results/annot/eggnog/{sample}.emapper.hits : Functional annotation by eggNOG.
  • results/checkm/results.tsv : Assessment of genome quality (completeness and contamination) by CheckM

Results & Notes

FASTQC

FASTQC [2] results are included in the MULTIQC report.

Note

The quality control (phread score, lenght, %GC) is good enough to go further.

Warning

A few adapter sequences can be found, but not too many.

Kaiju

Raw data are taxonomically annotated with kaiju [3] on nr_euk .

Results are available in HTML report.

We made a representation at the “species” level.

vroom::vroom("html/kaiju.tsv", delim = "\t", col_types = "fddif") |>
  mutate(sample = as_factor(stringr::str_split_i(file, pattern = "/", 3)), .before = 1, .keep = "unused") |>
  mutate(taxon_name = fct_reorder(taxon_name, percent, .desc = TRUE, .na_rm = FALSE)) |>
  ggplot(aes(fill = taxon_name, y = reads, x = sample)) +
  geom_bar(position = "fill", stat = "identity") +
  scale_fill_brewer(palette = "Set1", label = ~ stringr::str_wrap(.x, width = 50)) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  theme(legend.position = "bottom") +
  guides(fill = guide_legend(ncol = 2))

Taxonomic affiliation distribution of raw reads at species level with kaiju/nr_euk

Taxonomic affiliation distribution of raw reads at species level with kaiju/nr_euk
Important

8 samples have a high contamination rate and are not annotated as Enterococcus cecorum.

This is :

  • comA_171_1
  • comA_243_1
  • comA_83_2
  • comI_154_2
  • comI_156_1
  • comI_183_1
  • comI_191_2
  • comI_244_3

riboSeed

riboSeed [4] was used to refine the assemblies of multiple ribosomal regions in a genome.

read_tsv("html/ribo.tsv", col_types = "cffiiidi") |> 
  mutate(sample = stringr::str_split_i(file, pattern = "/", 3), .before = 1, .keep = "unused") |> 
  DT::datatable()
Important

riboSeed failed to produce contigs for 9 samples: all 8 contaminated samples and comA_267_1.

SPAdes

We used SPAdes [5] to assemble the subset of curated data and riboSeed contigs.

Assembly results are available in QUAST report.

Important

Here again, we have many problems with some sample:

  • Contaminated samples:
    • comA_171_1
    • comA_243_1
    • comA_83_2
    • comI_156_1
    • comI_183_1
    • comI_191_2
    • comI_244_3
  • A new one : comA_177_1

CheckM

We evaluate the robustness of assemblies with CheckM [6] in comparison with the Enterococcus genus (Enterococcus cecorum is not available).

checkm <- vroom::vroom("html/checkm.tsv", col_types = "ffiiiiiiiiiddd", .name_repair = snakecase::to_snake_case)

DT::datatable(checkm)

Robustness estimation of genome

p <- ggplot(checkm, aes(x = completeness, y = contamination , col = strain_heterogeneity)) +
  ggiraph::geom_point_interactive(aes(tooltip = bin_id)) +
  expand_limits(x = c(0, 100), y = c(0, 100)) +
  theme(legend.position = "bottom")

ggiraph::girafe(ggobj = p)

Robustness estimation of genome

We also used dRep [7] to compare and cluster assembled genomes.

read_csv("html/dRep_Cdb.csv", col_types = "cfdfff") |> 
  DT::datatable()

Primary clustering dendrogram and Primary clustering dendrogram are available.

Note

MASH clustering build 3 clusters:

  • Cluster 3_1 : Enterococcus durans-like with comI_244_3, comI_183_1, comI_156_1 and comA_243_1.
  • Cluster 2_1 : Enterococcus faecalis-like with comI_191_2 and comI_154_2.
  • Cluster 1 with 11 sub-clusters.

Details can be found in the table and figures.

PROKKA & eggNOG

Finally, we annotated contigs with pokka [8] and eggNOG [9].

Annotated data are available on results/annot/prokka/{sample}/{sample}.gbk following analysis.

References

References

1. Köster J, Rahmann S. Snakemake—a scalable bioinformatics workflow engine. Bioinformatics. 2012;28:2520–2.
2. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
3. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nature communications. 2016;7:11257.
4. Waters NR, Abram F, Brennan F, Holmes A, Pritchard L. riboSeed: Leveraging prokaryotic genomic architecture to assemble across ribosomal regions. Nucleic Acids Research. 2018;46:e68–8. doi:10.1093/nar/gky212.
5. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.
6. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.
7. Olm MR, Brown CT, Brooks B, Banfield JF. dRep: A tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. The ISME Journal. 2017;11:2864–8. doi:10.1038/ismej.2017.126.
8. Seemann T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–9. doi:10.1093/bioinformatics/btu153.
9. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution. 2021;38:5825–9. doi:10.1093/molbev/msab293.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France