RABATGA

The aim of this project is to characterise a Moroccan bacterial strain

closed
support
Authors
Affiliation

Cédric Midoux

Migale bioinformatics facility

Valentin Loux

Migale bioinformatics facility

Published

May 16, 2024

Modified

December 20, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

The aim of this project is to characterise a Moroccan bacterial strain

Patners

  • Cédric Midoux - Migale bioinformatics facility - BioInfomics - INRAE
  • Valentin Loux - Migale bioinformatics facility - BioInfomics - INRAE
  • Bahia Rached - CNRST Rabat (Maroc)
  • Christel Maillet - MICALIS - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables
  Definition
1 HTML report
2 FASTA Sequences

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

Raw data (Illumina + IonTorrent) were sequenced by CNRST. Files were sent and deposited on the front server.

cd /work_home/cmidoux/BAHIA

IonTorrent data were cut because the file was incomplete. We keep the complete reads.

conda activate seqkit-2.0.0
seqkit seq -m 1 DATA/B599_iontorrent.fastq -o DATA/B599_iontorrent_complet.fastq 
conda deactivate

Analyses

Quality control

We now run fastp [1], a tool designed to provide fast all-in-one preprocessing for FASTQ files. Here, we allow correction in overlapped regions (at least 30 bases between R1 and R2), remove reads with more than 5 ambiguous sequences, move a sliding window from tail (3’) to front, drop the bases in the window if its mean quality < 20, stop otherwise. Reads shorter than 50 nucleotides are also removed.

mkdir 0_FASTP/
conda activate fastp-0.23.4
fastp --in1 DATA/B599_subsampled_R1.fastq --in2 DATA/B599_subsampled_R2.fastq --out1 0_FASTP/B599_subsampled_R1.fastq.gz --out2 0_FASTP/B599_subsampled_R2.fastq.gz --length_required 50 --html 0_FASTP/B599_subsampled_fastp.html --json 0_FASTP/B599_subsampled_fastp.json --thread 4
fastp --in1 DATA/B599_iontorrent_complet.fastq --out1 0_FASTP/B599_iontorrent.fastq.gz --length_required 50 --html 0_FASTP/B599_iontorrent.fastp.html --json 0_FASTP/B599_iontorrent.json --thread 4
conda deactivate
conda activate multiqc-1.21
multiqc --outdir 00_MULTIQC 0_FASTP
conda deactivate

The MultiQC report shows metrics before and after filtering.

The quality control is not bad, but %GC differs between the two files and insert size distribution evolves strangely (but this represents a small portion of data).

Taxonomic affiliation

We want now to assess rapidly, without assembly, the composition of the raw reads, based on k-mer composition. It is the best way to detect a contamination and to see if we observe what we expect. We use kaiju [2] to assign reads against a databank composed of 32-mer from refseq and nr_euk databank.

mkdir 1_KAIJU
qsub -cwd -V -N kaiju_illumina -pe thread 16 -e LOGS -o LOGS -b y "conda activate kaiju-1.9.2 && kaiju -t /db/outils/kaiju-2023-05/refseq/nodes.dmp -f /db/outils/kaiju-2023-05/refseq/kaiju_db_refseq.fmi -i 0_FASTP/B599_subsampled_R1.fastq.gz -j 0_FASTP/B599_subsampled_R2.fastq.gz -o 1_KAIJU/B599_illumina.kaiju -z 16 && kaiju2krona -t /db/outils/kaiju-2023-05/refseq/nodes.dmp -n /db/outils/kaiju-2023-05/refseq/names.dmp -i 1_KAIJU/B599_illumina.kaiju -o 1_KAIJU/B599_illumina.krona -u && conda deactivate"
qsub -cwd -V -N kaiju_iontorrent -pe thread 16 -e LOGS -o LOGS -b y "conda activate kaiju-1.9.2 && kaiju -t /db/outils/kaiju-2023-05/refseq/nodes.dmp -f /db/outils/kaiju-2023-05/refseq/kaiju_db_refseq.fmi -i 0_FASTP/B599_iontorrent.fastq.gz -o 1_KAIJU/B599_iontorrent.kaiju -z 16 && kaiju2krona -t /db/outils/kaiju-2023-05/refseq/nodes.dmp -n /db/outils/kaiju-2023-05/refseq/names.dmp -i 1_KAIJU/B599_iontorrent.kaiju -o 1_KAIJU/B599_iontorrent.krona -u && conda deactivate"
qsub -cwd -V -N kaiju_iontorrent_nreuk -pe thread 16 -e LOGS -o LOGS -b y "conda activate kaiju-1.9.2 && kaiju -t /db/outils/kaiju-2023-05/nr_euk/nodes.dmp -f /db/outils/kaiju-2023-05/nr_euk/kaiju_db_nr_euk.fmi -i 0_FASTP/B599_iontorrent.fastq.gz -o 1_KAIJU/B599_iontorrent_nreuk.kaiju -z 16 && kaiju2krona -t /db/outils/kaiju-2023-05/nr_euk/nodes.dmp -n /db/outils/kaiju-2023-05/nr_euk/names.dmp -i 1_KAIJU/B599_iontorrent_nreuk.kaiju -o 1_KAIJU/B599_iontorrent_nreuk.krona -u && conda deactivate"

conda activate krona-2.8
ktImportText -o 1_KAIJU/B599-krona.html 1_KAIJU/B599_illumina.krona 1_KAIJU/B599_iontorrent.krona 1_KAIJU/B599_iontorrent_nreuk.krona
conda deactivate

The KAIJU report shows the taxonomic distribution of reads.

Half of the IonTorrent data is unclassified. This may be due to contamination.

Assembly

We now run SPAdes [3], a tool designed for the correction and assembly of reads. SPAdes authors advise against assembling Illumina and IonTorrent libraries together.

qsub -cwd -V -N spades -q maiage.q -pe thread 16 -e LOGS -o LOGS -b y "conda activate spades-3.15.3 && spades.py --isolate -t 16 -m 500 --tmp-dir /projet/tmp/ -1 0_FASTP/B599_subsampled_R1.fastq.gz -2 0_FASTP/B599_subsampled_R2.fastq.gz -s 0_FASTP/B599_iontorrent.fastq.gz -o 2_SPADES && conda deactivate"

We also used unicycler [4], based on spades, for assembling and curating data.

mkdir 3_UNICYCLER
qsub -cwd -V -N unicycler -q maiage.q -pe thread 16 -e LOGS -o LOGS -b y "conda activate unicycler-0.5.0 && unicycler -1 0_FASTP/B599_subsampled_R1.fastq.gz -2 0_FASTP/B599_subsampled_R2.fastq.gz -s 0_FASTP/B599_iontorrent.fastq.gz -o 3_UNICYCLER -t 16 && conda deactivate"

qsub -cwd -V -N unicycler -q maiage.q -pe thread 16 -e LOGS -o LOGS -b y "conda activate unicycler-0.5.0 && unicycler -1 0_FASTP/B599_subsampled_R1.fastq.gz -2 0_FASTP/B599_subsampled_R2.fastq.gz -o 3_UNICYCLER_illumina -t 16 && conda deactivate"
qsub -cwd -V -N unicycler -q maiage.q -pe thread 16 -e LOGS -o LOGS -b y "conda activate unicycler-0.5.0 && unicycler -s 0_FASTP/B599_iontorrent.fastq.gz -o 3_UNICYCLER_iontorrent -t 16 && conda deactivate"

Quast [5] enables quality control of assemblies.

mkdir 4_QUAST
conda activate quast-5.2.0
quast --gene-finding -o 4_QUAST -1 0_FASTP/B599_subsampled_R1.fastq.gz -2 0_FASTP/B599_subsampled_R2.fastq.gz -l "spades, unicycler, unicycler_illumina, unicycler_iontorrent" --threads 4 2_SPADES/contigs.fasta 3_UNICYCLER/assembly.fasta 3_UNICYCLER_illumina/assembly.fasta 3_UNICYCLER_iontorrent/assembly.fasta

Due to the data heterogeneity, the best assembly is with unicycler, using only Illumina data.

Outputs

References

1. Zhou Y, Chen Y, Chen S, Gu J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884–90. doi:10.1093/bioinformatics/bty560.
2. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with kaiju. Nature communications. 2016;7:11257.
3. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: A new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–77. doi:10.1089/cmb.2012.0021.
4. Wick LMAG Ryan R. AND Judd. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLOS Computational Biology. 2017;13:1–22. doi:10.1371/journal.pcbi.1005595.
5. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–5. doi:10.1093/bioinformatics/btt086.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France