read_tsv("html/ribo.tsv", col_types = "cffiiidi") |>
mutate(sample = stringr::str_split_i(file, pattern = "/", 3), .before = 1, .keep = "unused") |>
::datatable() DT
CAVIAR
Whole genome analysis of Enterococcus cecorum
This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.
Aim of the project
The aim of this project is to develop a workflow to analyze Enterococcus cecorum strains.
Patners
- Cédric Midoux - Migale bioinformatics facility - BioInfomics - INRAE
- Valentin Loux - Migale bioinformatics facility - BioInfomics - INRAE
- Pascale Serror - MICALIS - INRAE
Deliverables
Deliverables agreed at the preliminary meeting (Table 1).
Definition | |
---|---|
1 | Snakemake workflow |
2 | HTML reports |
Data management
All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.
Raw data
The raw data is provided by the partner in directory front.migale.inrae.fr:/work_projet/cecotype/2022-CAVIAR-SEQUENCING/RAW
. It contains 56 shotgun sequencings of Enterococcus cecorum strains. The project focuses on the development of the workflow and not only on the data analysis as such.
Some samples were sequenced twice time :
comA_004_2
&ceco_comA_004_2
comA_122_1
&ceco_comA_122_1
Some samples are missing :
comA_244_1
comI_262_1
Je viens d’élucider les mystère des séquence disparues. Ce n’est pas le bon ADN qui a été envoyé 2 fois!
Ceci explique les doublons!
Je n’ai aucun souvenir de la genèse de l’erreur qui est grossière!
C’est trop bête. Je peux éventuellement envoyer ces 2 ADNs.
Pascale Serror - 2023-05-03
Workflow
We used a snakemake
This snakemake workflow aims to assemble Enterococcus cecorum genomes from raw reads and published reference data.
Config
The main parameters are specified in the config/config.yaml file.
They include :
Variable | Definition | Default |
---|---|---|
samples |
Table of samples with two columns sample et library (example : ceco_comA_004_2 lib593974 ) |
/work_home/cmidoux/GIT/wf_caviar/config/samples.tsv |
raw_data |
Data path | /work_projet/cecotype/2022-CAVIAR-SEQUENCING/RAW |
workdir |
Results path | /work_home/cmidoux/caviar |
subsample |
Size of sub-samples for easy assembly | 750000 |
k_spades |
Size of k-mers used by spades |
21,33,55,77 |
reference |
Reference genome path, used by riboSeed |
/work_projet/cecotype/REF/NCTC12421/NCTC12421.fasta |
genus |
Genus used by prokka for annotation |
Enterococcus |
species |
Species used by prokka for annotation |
cecorum |
proteins |
Gene catalogue path used by prokka for annotation |
/work_projet/cecotype/NANOPORE_ASSEMBLY-2020/Ref/Refseq/Enterococcus/proteins.faa |
eggnog_db |
eggNOG database path | /db/outils/eggnog-mapper/ |
kaiju_db |
Kaiju database path | /db/outils/kaiju-2021-03/nr_euk/ |
Outputs
results/qc/multiqc.html
: MULTIQC report with :- FASTQC raw data quality report
- FASTP trimming report
- QUAST assembly report
- PROKKA annotation report
results/kaiju/krona.html
: Raw data taxonomic annotation.results/assembly/{sample}/contigs.fasta
: Sample assembly afterfastp
,seqtk_subsample
,riboSeed
andspades
.results/annot/prokka/{sample}/{sample}.gbk
: Contig annotation byprokka
.results/annot/eggnog/{sample}.emapper.hits
: Functional annotation byeggNOG
.results/checkm/results.tsv
: Assessment of genome quality (completeness and contamination) byCheckM
Results & Notes
FASTQC
FASTQC
The quality control (phread score, lenght, %GC) is good enough to go further.
A few adapter sequences can be found, but not too many.
Kaiju
Raw data are taxonomically annotated with kaiju nr_euk
Results are available in HTML report.
We made a representation at the “species” level.
::vroom("html/kaiju.tsv", delim = "\t", col_types = "fddif") |>
vroommutate(sample = as_factor(stringr::str_split_i(file, pattern = "/", 3)), .before = 1, .keep = "unused") |>
mutate(taxon_name = fct_reorder(taxon_name, percent, .desc = TRUE, .na_rm = FALSE)) |>
ggplot(aes(fill = taxon_name, y = reads, x = sample)) +
geom_bar(position = "fill", stat = "identity") +
scale_fill_brewer(palette = "Set1", label = ~ stringr::str_wrap(.x, width = 50)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
theme(legend.position = "bottom") +
guides(fill = guide_legend(ncol = 2))
8 samples have a high contamination rate and are not annotated as Enterococcus cecorum.
This is :
comA_171_1
comA_243_1
comA_83_2
comI_154_2
comI_156_1
comI_183_1
comI_191_2
comI_244_3
riboSeed
riboSeed
riboSeed failed to produce contigs for 9 samples: all 8 contaminated samples and comA_267_1
.
SPAdes
We used SPAdes
Assembly results are available in QUAST report.
Here again, we have many problems with some sample:
- Contaminated samples:
comA_171_1
comA_243_1
comA_83_2
comI_156_1
comI_183_1
comI_191_2
comI_244_3
- A new one :
comA_177_1
CheckM
We evaluate the robustness of assemblies with CheckM
<- vroom::vroom("html/checkm.tsv", col_types = "ffiiiiiiiiiddd", .name_repair = snakecase::to_snake_case)
checkm
::datatable(checkm) DT
Robustness estimation of genome
<- ggplot(checkm, aes(x = completeness, y = contamination , col = strain_heterogeneity)) +
p ::geom_point_interactive(aes(tooltip = bin_id)) +
ggiraphexpand_limits(x = c(0, 100), y = c(0, 100)) +
theme(legend.position = "bottom")
::girafe(ggobj = p) ggiraph
Robustness estimation of genome
We also used dRep
read_csv("html/dRep_Cdb.csv", col_types = "cfdfff") |>
::datatable() DT
Primary clustering dendrogram and Primary clustering dendrogram are available.
MASH clustering
build 3 clusters:
- Cluster 3_1 : Enterococcus durans-like with
comI_244_3
,comI_183_1
,comI_156_1
andcomA_243_1
. - Cluster 2_1 : Enterococcus faecalis-like with
comI_191_2
andcomI_154_2
. - Cluster 1 with 11 sub-clusters.
Details can be found in the table and figures.
PROKKA & eggNOG
Finally, we annotated contigs with pokka
Annotated data are available on results/annot/prokka/{sample}/{sample}.gbk
following analysis.