CAREMBOUCHE

Genome analysis strategy to predict the proteolytic capacity of the strain Lacticaseibacillus casei CNCM I-5663

closed

collaboration

Author

Affiliation

Olivier Rué

Migale bioinformatics facility

Published

March 20, 2024

Modified

June 30, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

At Carembouche, we are developing a lactobacillus with probiotic properties by adding it to an oral nutritional supplement for malnourished people. As part of a collaboration with the STLO unit (INRAE, Rennes), we are seeking to evaluate the proteolytic capacities of our strain. We would also like to evaluate fibrolytic capacities. The genome of Lacticaseibacillus casei CNCM I-5663 has been sequenced by illumina.

Partners

Olivier Rué - Migale bioinformatics facility - BioInfomics - INRAE
Paul Biscarrat - Carembouche - Private

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables

	Definition	Status
1	HTML report	✔️
2	Assembly (FASTA)	✔️
3	Annotation (GFF)	✔️

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

Genome assembly and prokka annotation were deposited on the front server and a copy was sent to the abaca server.

Galaxy13-fna.fasta

Show the code

conda activate seqkit-2.0.0
seqkit stat Galaxy13-fna.fasta

file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
Galaxy13-fna.fasta  FASTA   DNA        449  3,116,159      227  6,940.2  805,683

The assembly consists of 449 contigs, with a length of 3.1 Mb. Assembly length is close to that of other strains, including ASM82905v1

Evaluation of the provided assembly

Quast [1] stands for QUality ASsessment Tool. It evaluates genome/metagenome assemblies by computing various metrics.

Show the code

cd ~/work/PROJECTS/CAREMBOUCHE/
mkdir QUAST
conda activate quast-5.0.2 && HOME=/projet/tmp/ && quast.py Galaxy13-fna.fasta -o QUAST --min-contig 10 -r REF/GCF_000829055.1_ASM82905v1_genomic.fna && conda deactivate

Quast report

Busco [2] provides a quantitative assessment of the completeness in terms of expected gene content of a genome assembly, transcriptome, or annotated gene set.

Show the code

mkdir BUSCO
cd BUSCO
conda activate busco-5.3.2
wget https://busco-data.ezlab.org/v5/data/lineages/lactobacillales_odb10.2024-01-08.tar.gz
tar zxvf lactobacillales_odb10.2024-01-08.tar.gz
2024-06-19 11:42:26 INFO:       ***** Start a BUSCO v5.3.2 analysis, current time: 06/19/2024 11:42:26 *****
2024-06-19 11:42:26 INFO:       Configuring BUSCO with local environment
2024-06-19 11:42:26 INFO:       Mode is genome
2024-06-19 11:42:26 INFO:       Downloading information on latest versions of BUSCO data...
2024-06-19 11:42:26 INFO:       Input file is /work_home/orue/PROJECTS/CAREMBOUCHE/Galaxy13-fna.fasta
2024-06-19 11:42:26 INFO:       Downloading file 'https://busco-data.ezlab.org/v5/data/lineages/lactobacillales_odb10.2024-01-08.tar.gz'
2024-06-19 11:42:28 INFO:       Decompressing file '/work_home/orue/PROJECTS/CAREMBOUCHE/BUSCO/busco_downloads/lineages/lactobacillales_odb10.tar.gz'
2024-06-19 11:42:31 INFO:       Running BUSCO using lineage dataset lactobacillales_odb10 (prokaryota, 2024-01-08)
2024-06-19 11:42:31 INFO:       ***** Run Prodigal on input to predict and extract genes *****
2024-06-19 11:42:31 INFO:       Running Prodigal with genetic code 11 in single mode
2024-06-19 11:42:31 INFO:       Running 1 job(s) on prodigal, starting at 06/19/2024 11:42:31
2024-06-19 11:42:37 INFO:       [prodigal]      1 of 1 task(s) completed
2024-06-19 11:42:37 INFO:       Genetic code 11 selected as optimal
2024-06-19 11:42:37 INFO:       ***** Run HMMER on gene sequences *****
2024-06-19 11:42:37 INFO:       Running 402 job(s) on hmmsearch, starting at 06/19/2024 11:42:37
2024-06-19 11:42:40 INFO:       [hmmsearch]     41 of 402 task(s) completed
2024-06-19 11:42:40 INFO:       [hmmsearch]     81 of 402 task(s) completed
2024-06-19 11:42:40 INFO:       [hmmsearch]     121 of 402 task(s) completed
2024-06-19 11:42:41 INFO:       [hmmsearch]     161 of 402 task(s) completed
2024-06-19 11:42:41 INFO:       [hmmsearch]     202 of 402 task(s) completed
2024-06-19 11:42:41 INFO:       [hmmsearch]     242 of 402 task(s) completed
2024-06-19 11:42:42 INFO:       [hmmsearch]     282 of 402 task(s) completed
2024-06-19 11:42:42 INFO:       [hmmsearch]     322 of 402 task(s) completed
2024-06-19 11:42:42 INFO:       [hmmsearch]     362 of 402 task(s) completed
2024-06-19 11:42:43 INFO:       [hmmsearch]     402 of 402 task(s) completed
2024-06-19 11:42:43 INFO:       Results:        C:99.3%[S:98.8%,D:0.5%],F:0.7%,M:0.0%,n:402        

2024-06-19 11:42:46 INFO:

        --------------------------------------------------
        |Results from dataset lactobacillales_odb10       |
        --------------------------------------------------
        |C:99.3%[S:98.8%,D:0.5%],F:0.7%,M:0.0%,n:402      |
        |399    Complete BUSCOs (C)                       |
        |397    Complete and single-copy BUSCOs (S)       |
        |2      Complete and duplicated BUSCOs (D)        |
        |3      Fragmented BUSCOs (F)                     |
        |0      Missing BUSCOs (M)                        |
        |402    Total BUSCO groups searched               |
        --------------------------------------------------
2024-06-19 11:42:46 INFO:       BUSCO analysis done. Total running time: 19 seconds
2024-06-19 11:42:46 INFO:       Results written in /work_home/orue/PROJECTS/CAREMBOUCHE/BUSCO/busco_output
2024-06-19 11:42:46 INFO:       For assistance with interpreting the results, please consult the userguide: https://busco.ezlab.org/busco_userguide.html

2024-06-19 11:42:46 INFO:       Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO

generate_plot.py -wd busco_output/

CheckM [3] provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage.

Show the code

cd /home/orue/work/PROJECTS/CAREMBOUCHE/CHECKM
conda activate checkm2-0.1.3
checkm2 predict --threads 16 --input ../SPADES/BUSCO/busco_output/prodigal_output/predicted_genes/predicted.faa --output-directory . --genes 
more quality_report.tsv 
Name    Completeness    Contamination   Completeness_Model_Used Additional_Notes
predicted       99.76   1.13    Neural Network (Specific Model) None

Paul sent me a sequence (https://www.ncbi.nlm.nih.gov/protein/AFJ15093.1) found in a publication where the authors are looking for putative Prt genes in the genome of L. casei by blasting this sequence.

Important

This assembly is very fragmented (449 contigs). Less a genome is fragmented, better is theorically the gene annotation.

I suggested to Paul that I test other assemblers myself to try to improve the assembly.

New assembly

I got the FASTQ files in the Galaxy of Gladys. They contain about 4 M reads (150 bp). I tested different assemblers.

UNICYCLER [4] is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser.

Show the code

cd /home/orue/work/PROJECTS/CAREMBOUCHE/UNICYCLER
conda activate unicycler-0.5.0
unicycler -1 ../FASTQ/data_R1.fastq.gz -2 ../FASTQ/data_R2.fastq.gz -o . --keep 0 -t 16
grep -c '>' /work_home/orue/PROJECTS/CAREMBOUCHE/UNICYCLER/assembly.fasta
38

SPAdes [5] is a versatile toolkit designed for assembly and analysis of sequencing data. SPAdes is primarily developed for Illumina sequencing data, but can be used for IonTorrent as well. Most of SPAdes pipelines support hybrid mode, i.e. allow using long reads (PacBio and Oxford Nanopore) as a supplementary data.

SPAdes package contains assembly pipelines for isolated and single-cell bacterial, as well as metagenomic and transcriptomic data. Additional modes allow to discover bacterial plasmids and RNA viruses, as well as perform HMM-guided assembly. Besides, SPAdes package includes supplementary tools for efficient k-mer counting and k-mer-based read filtering, assembly graph construction and simplification, sequence-to-graph alignment and metagenomic binning refinement.

Show the code

conda activate spades-3.15.3
spades.py -1 ../FASTQ/data_R1.fastq.gz -2 ../FASTQ/data_R2.fastq.gz -k 11,17,25,31,43,55,77,81 -t 16 -o . --isolate
conda deactivate

MEGAHIT [6] is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on generic single genome assembly (small or mammalian size) and single-cell assembly.

Assembly metrics
tool	contigs	total length	N50	largest contig	CDS
Original (Spades single-end)	449	3116159	234751	805683	2778
Spades	475	3124564	257129	848256	2777
Megahit	85	3026066	234879	805707	2764
Unicycler	38	3005286	403998	847367	2756

Important

The assembly produced by Unicycler is less fragmented than others, for a similar assembly length and gene content. This assembly will be used as reference for now.

Annotation with Bakta

Bakta [7] is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis.

Show the code

conda activate bakta-1.9.1
bakta --db /db/outils/bakta-1.9.1/db/ --output BAKTA/ assembly.fasta --threads 8

Annotation with DBCAN

run_dbcan [8] is the standalone version of the dbCAN3 annotation tool for automated CAZyme annotation. This tool, known as run_dbcan, incorporates HMMER, Diamond, and dbCAN_sub for annotating CAZyme families, and integrates Cazyme Gene Clusters (CGCs) and substrate predictions.

Show the code

mkdir /home/orue/work/PROJECTS/CAREMBOUCHE/UNICYCLER/DBCAN
cd /home/orue/work/PROJECTS/CAREMBOUCHE/UNICYCLER/

conda activate run_dbcan-3.0.2
#run_dbcan assembly.fasta prok --out_dir DBCAN --db_dir /db/outils/dbcan/ --tools all --hmm_cpu 16 --eCAMI_jobs 16 --cluster ../BAKTA/assembly.gff3
run_dbcan BAKTA/assembly.fna prok --out_dir DBCAN2 --db_dir /db/outils/dbcan/ --tools all --hmm_cpu 16 --eCAMI_jobs 16 --cluster BAKTA/assembly.gff3

egrep -v -e "^[ATCGN]" -e "^##FASTA" -e "^>contig"  BAKTA/assembly.gff3 > BAKTA/assembly_without_seqs.gff3 

python ~/save/SCRIPTS/add_dbcan_to_gff3.py --gff BAKTA/assembly.gff3 --overview DBCAN2/overview.txt --gff-prodigal DBCAN2/prodigal.gff --output dbcan2.gff3 --cgc-file DBCAN2/cgc_standard.out

Blast lactocepin, partial sequence against genes

A lactocepin sequence, see ref was then blasted against the genes with blastp [9]. This sequence was previously used to find putative Prt genes in L. casei genomes.

Show the code

cd /home/orue/work/PROJECTS/CAREMBOUCHE/UNICYCLER
mkdir BLAST
conda activate blast-2.15.0
makeblastdb -in BAKTA/assembly.faa -dbtype prot
cd BLAST
blastp -query ../../paul_sequence_putative_prt_lcassei.prot.fasta -db ../BAKTA/assembly.faa -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore stitle" -out alignment.out

Here is the location of the best hit: contig_2 9923 11812

Note

The sequence is located in a CDS annotated Peptidase S8.

Annotation with Eggnog-mapper

EggNOG-mapper [10] is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database (http://eggnog5.embl.de) to transfer functional information from fine-grained orthologs only.

Show the code

cd /home/orue/work/PROJECTS/CAREMBOUCHE/UNICYCLER
mkdir EGGNOG
cd EGGNOG
conda activate eggnog-mapper-2.1.5
emapper.py -i ../BAKTA/assembly.faa --output eggnog -m diamond -d bact --cpu 16 --data_dir /db/outils/eggnog-mapper-2.1.5/ --output_dir . --decorate_gff ../dbcan2.gff3
conda deactivate

Annotation with MEROPS

Paul pointed me the MEROPS [11] database for peptidases annotation. The MEROPS database is a resource for information on peptidases (sometimes also termed proteases, proteinases and proteolytic enzymes).

Its use is complicated because it required to build a MySQL instance. Nevertheless, MEROPS gives the possibility to download some flat files:

The file “pepunit.lib” is a non-redundant library of protein sequences in FastA format the peptidase units and inhibitor units of all the peptidases and peptidase inhibitors that are included in MEROPS. The library can be searched by use of FastA without further modification, but must be converted and indexed for BLAST searches. The restriction of the library to peptidase and inhibitor units should decrease the risk of false positive matches to other domains.

Show the code

cd /home/orue/work/PROJECTS/CAREMBOUCHE
wget https://ftp.ebi.ac.uk/pub/databases/merops/current_release/pepunit.lib # Download file
grep -c '>' pepunit.lib 
1227939

The file contains 1,227,939 proteic sequences. Information is found in the sequence header.

Show the code

head -n 2 pepunit.lib
>MER0000001 - chymotrypsin B (Homo sapiens) [S01.152]#S01A#{peptidase unit: 34-263}~source CTRB_HUMAN~
IVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWVVTAAHCGVRTSDVVVAGEFDQGS

I used seqkit [12] to find sequences with some patterns: PrtP, PrtS, Lactobacillus, paracasei and casei.

Show the code

conda activate seqkit-2.0.0
seqkit grep -r -p Lactobacillus -p paracasei -p casei -p PrtP -p Prtp -p prtp -n pepunit.lib > pepunit_lactocasei_and_prtp.lib
grep -c '>' pepunit_lactocasei_and_prtp.lib
8058

and then used fasta36 [13] to align proteic sequences predicted with Bakta against the MEROPS reduced databank.

Show the code

mkdir MEROPS
conda activate fasta-36.3.8e
fasta36 BAKTA/assembly.faa ../pepunit_lactocasei_and_prtp.lib -m 8 -E 0.05 -T 16 > MEROPS/pepunit_lactocasei_and_prtp_fasta36.out
wc -l pepunit_lactocasei_and_prts_fasta36.out
13072

The output file contains information about alignments:

Query ID : Identifiant de la séquence de requête.
Subject ID : Identifiant de la séquence de la base de données (sujet).
% Identity : Pourcentage d’identité entre la séquence de requête et la séquence de la base de données.
Alignment Length : Longueur de l’alignement.
Mismatches : Nombre de mismatches dans l’alignement.
Gap Openings : Nombre de fois où des gaps ont été ouverts dans l’alignement.
Query Start : Position de départ de l’alignement dans la séquence de requête.
Query End : Position de fin de l’alignement dans la séquence de requête.
Subject Start : Position de départ de l’alignement dans la séquence du sujet.
Subject End : Position de fin de l’alignement dans la séquence du sujet.
E-value : Valeur E de l’alignement.
Bit Score : Score de bits de l’alignement.

Show the code

grep ">" pepunit_lactocasei_and_prtp.lib | sed "s/\ -\ /;/g" | sed "s/>//g" > pepunit_lactocasei_and_prtp.lib.infos

I join now the information about subjects

Show the code

alignments <- vroom::vroom("data/pepunit_lactocasei_and_prtp_fasta36.out", col_names=F)
colnames(alignments) <- c("QueryID","SubjectID","Percidentity","AlignmentLength","Mismatches","GapOpenings","QueryStart","QueryEnd","SubjectStart","SubjectEnd","E-value","BitScore")

# on ajoute les infos trouvées dans les headers
infos <- vroom::vroom("data/pepunit_lactocasei_and_prtp.lib.infos", col_names=F, delim = ";")
colnames(infos) <- c("ID","Details")

mytable <- left_join(alignments,infos,by=join_by(SubjectID==ID))
library(reactable)
reactable::reactable(mytable, columns = list(
    Details = colDef(width = 200)
  ))

Tip

You can look in the GFF3 file where hits fall by searching for the query ID. For example the annotation of JCKJDL_02300 is ID=JCKJDL_02300;Name=putative metalloendopeptidase;locus_tag=JCKJDL_02300;product=putative metalloendopeptidase;Dbxref=COG:COG3590,COG:O,RefSeq:WP_049169840.1,SO:0001217,UniParc:UPI0006674A97,UniRef:UniRef100_UPI0006674A97,UniRef:UniRef50_K0N8Q0,UniRef:UniRef90_A0A179YD63;gene=pepO

Show the code

library(rtracklayer)
gff <- readGFF("html/eggnog.emapper.gff")

Downloads

References

1. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. doi:10.1093/bioinformatics/btt086.

2. Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes. Molecular Biology and Evolution. 2021;38:4647–54. doi:10.1093/molbev/msab199.

3. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research. 2015;25:1043–55. doi:10.1101/gr.186072.114.

4. Wick LMAG Ryan R. AND Judd. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology. 2017;13:1–22. doi:10.1371/journal.pcbi.1005595.

5. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.

6. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph. Bioinformatics. 2015;31:1674–6.

7. Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: Rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics. 2021;7. doi:https://doi.org/10.1099/mgen.0.000685.

8. Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Research. 2023;51:W115–21. doi:10.1093/nar/gkad328.

9. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215:403–10.

10. Cantalapiedra CP, Hernández-Plaza A, Letunic I, Bork P, Huerta-Cepas J. eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Molecular Biology and Evolution. 2021;38:5825–9. doi:10.1093/molbev/msab293.

11. Rawlings ND, Waller M, Barrett AJ, Bateman A. MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Research. 2013;42:D503–9. doi:10.1093/nar/gkt953.

12. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PloS one. 2016;11:e0163962.

13. Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences. 1988;85:2444–8.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France