PY79

Mapping and variant calling of two samples of Bacillus subtilis

closed

collaboration

Author

Affiliation

Olivier Rué

Migale bioinformatics facility

Published

August 6, 2024

Modified

December 2, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

The aim of this project is to perform a mapping of reads of 2 samples of Bacillus subtilis and then carry out a variant calling.

Partners

Olivier Rué - Migale bioinformatics facility - BioInfomics - INRAE
Arnaud Chastanet - MICALIS - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables

	Definition	Status
1	HTML report	✔️
2	Mapping files (BAM)	✔️
3	Variant calling files (gVCF)	✔️
4	Find insertions localization in RCL368 sample	✔️

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

Reads were deposited on the front server and a copy was sent to the abaca server.

Show the code

cd ~/work/PROJECTS/PY79
tar xvf transfer_8031793_files_70efce5b.tar
mkdir FASTQC
mkdir RAW_DATA
for i in *_1.fastq.gz ; do id=$(echo $(basename $i |cut -d '_' -f 2)) ; mv $i RAW_DATA/${id}_R1.fastq.gz ;done
for i in *_2.fastq.gz ; do id=$(echo $(basename $i |cut -d '_' -f 2)) ; mv $i RAW_DATA/${id}_R2.fastq.gz ;done

Quality control

seqkit [1] was used to get informations from FASTQ files.

Show the code

# seqkit
cd ~/work/PROJECTS/PY79
qsub -cwd -V -o LOGS -e LOGS -N seqkit -pe thread 4 -R y -b y "conda activate seqkit-2.0.0 && seqkit stats RAW_DATA/*.fastq.gz -j 4 > raw_data.infos && conda deactivate"

FastQC [2] is a program designed to spot potential problems in high througput sequencing datasets. It runs a set of analyses on one or more raw sequence files in fastq or bam format and produces a report which summarises the results. MultiQC [3] aggregates results from bioinformatics analyses across many samples into a single report.

Show the code

cd ~/work/PROJECTS/PY79
mkdir FASTQC
for i in RAW_DATA/*.fastq.gz ; do echo "conda activate fastqc-0.11.9 && fastqc $i -o FASTQC && conda deactivate" >> fastqc.sh ; done
qarray -cwd -V -N fastqc -o LOGS -e LOGS fastqc.sh
conda activate multiqc-1.11 && multiqc FASTQC -o MULTIQC && conda deactivate

The MultiQC report shows expected metrics for Illumina Hiseq sequencing data.

Note

The quality control is expected for Illumina sequencing reads. No reads cleaning is necessary.

Mapping

bwa-mem2 [4] is the next version of the bwa-mem algorithm in bwa [5]. It produces alignment identical to bwa and is ~1.3-3.1x faster depending on the use-case, dataset and the running machine.

Show the code

conda activate bwa-mem2-2.2.1
bwa-mem2 index PY79.fasta 
bwa-mem2 mem -t 4 PY79.fasta RAW_DATA/RCL368_R1.fastq.gz RAW_DATA/RCL368_R2.fastq.gz |samtools view -hbS - |samtools sort -@ 4 - > RCL368.bam
bwa-mem2 mem -t 4 PY79.fasta RAW_DATA/RCL415_R1.fastq.gz RAW_DATA/RCL415_R2.fastq.gz |samtools view -hbS - |samtools sort -@ 4 - > RCL415.bam
samtools flagstat RCL368.bam > RCL368.flagstat
samtools flagstat RCL415.bam > RCL415.flagstat

Qualimap [6] is a platform-independent application written in Java and R that provides both a Graphical User Interface (GUI) and a command-line interface to facilitate the quality control of alignment sequencing data. Shortly, Qualimap:

Examines sequencing alignment data according to the features of the mapped reads and their genomic properties
Povides an overall view of the data that helps to to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.

Show the code

conda activate qualimap-2.2.2d
qualimap bamqc -bam RCL368.bam -outdir QM_RCL368
qualimap bamqc -bam RCL415.bam -outdir QM_RCL415

Note

98.04% of mapped reads for RCL368
97.03% of mapped reads for RCL415

The genome seems to be completely covered, at homogeneous depths.

Variant calling

freebayes [7] is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.

Show the code

mkdir FREEBAYES
conda activate freebayes-1.3.5
freebayes -f PY79.fasta RCL368.bam > FREEBAYES/RCL368.vcf
freebayes -f PY79.fasta RCL415.bam > FREEBAYES/RCL415.vcf
freebayes -f PY79.fasta RCL368.bam --gvcf > FREEBAYES/RCL368.gvcf
freebayes -f PY79.fasta RCL415.bam --gvcf > FREEBAYES/RCL415.gvcf

Note

67 variants are detected in RCL368
79 variants are detected in RCL415

VCFttols [8] is a set of tools written in Perl and C++ for working with VCF files

Show the code

conda activate vcftools-0.1.16
vcftools --vcf FREEBAYES/RCL368.vcf --diff FREEBAYES/RCL415.vcf --diff-site

Note

34 sites are common to both files.
33 sites only in RCL368.
45 sites only in RCL415.

Show the code

library(reactable)
vcfdata <- read.csv2("data/out.diff.sites_in_files", stringsAsFactors = F, sep="", strip.white=T, dec=".") %>% as_tibble()

reactable(
  vcfdata, 
  rowStyle = function(index, value) {
    if (vcfdata[index,"IN_FILE"] == "B") list(color = "gray")
    else if (vcfdata[index,"IN_FILE"] == "1") list(color = "green")
    else if (vcfdata[index,"IN_FILE"] == "2") list(color = "blue")
  },
  columns = list(
    CHROM = colDef(width = 150)
  ),
  fullWidth = FALSE
)

Positions in blue are only found in RCL368, positions in green are only found in RCL415, positions in gray are common.

SNPs visualisation

Show the code

conda activate vcftools-0.1.16
vcftools --vcf RCL368.vcf --remove-indels --out RCL368.snps --recode --recode-INFO-all
vcftools --vcf RCL415.vcf --remove-indels --out RCL415.snps --recode --recode-INFO-all
bgzip RCL415.snps.recode.vcf 
bgzip RCL368.snps.recode.vcf 
tabix -p vcf RCL415.snps.recode.vcf.gz 
tabix -p vcf RCL368.snps.recode.vcf.gz 
cat ../PY79.fasta |vcf-consensus RCL368.snps.recode.vcf.gz > RCL368.fasta
cat ../PY79.fasta |vcf-consensus RCL415.snps.recode.vcf.gz > RCL415.fasta
cat ../PY79.fasta RCL368.fasta RCL415.fasta >> ALL.fasta
conda activate snipit-1.2
snipit ALL.fasta

Find insertions in RCL368

Use dedicated tools

Delly [9] is an integrated structural variant (SV) prediction method that can discover, genotype and visualize deletions, tandem duplications, inversions and translocations at single-nucleotide resolution in short-read and long-read massively parallel sequencing data. It uses paired-ends, split-reads and read-depth to sensitively and accurately delineate genomic rearrangements throughout the genome.

Show the code

cd /home/orue/work/PROJECTS/PY79
mkdir DELLY
conda activate delly-0.8.7
delly call --svtype ALL -g PY79.fasta RCL368.bam RCL415.bam -o DELLY/delly.bcf

conda activate bcftools-1.13
bcftools view DELLY/delly.bcf |grep -v "^#" |cut -f 5 |sort |uniq -c   
   2079 <DEL>
   1593 <DUP>
   3450 <INV>

No structural variation of typeINSERTION is found by Delly.

Use mapping information

I tried then to use the mapping information available in the BAM files. At the point of insertion, reads should not map along their entire length. So, for each read, I look at the CIGAR information.

The frequency of soft clipping should be higher in regions where insertion has occurred. I cut the genome into 100 bp parts and look at those containing the most soft-clipping reads.

Show the code

samtools view RCL368.bam | awk '$6 ~ /S/ {print $3, int($4/100)*100}' | sort | uniq -c | sort -k1,1nr | awk '{print $2 "\t" $3 "\t" $3+100 "\t" $1}' |head

CP006881.1      3566100 3566200 1280
CP006881.1      2678500 2678600 711
CP006881.1      3566000 3566100 420
CP006881.1      4033300 4033400 358
CP006881.1      0       100     334
CP006881.1      3566200 3566300 270
CP006881.1      917900  918000  168
CP006881.1      3674000 3674100 133
CP006881.1      3674400 3674500 133
CP006881.1      2678400 2678500 121

2 genomic regions appear in the top results before the reference extremities.

The inserted gene is sfGFP, the same in the 2 locations indicated (in theory at least), in translational fusion with the mreB and mbl genes.

In these regions, we find the expected genes.

CP006881.1    Genbank    CDS    3565108    3566109    .    -    0    ID=cds-AHA79589.1;Parent=gene-U712_18295;Dbxref=NCBI_GP:AHA79589.1;Name=AHA79589.1;gbkey=CDS;locus_tag=U712_18295;product=MreB-like protein;protein_id=AHA79589.1;transl_table=11
CP006881.1    Genbank    CDS    2678589    2679602    .    -    0    ID=cds-AHA78701.1;Parent=gene-U712_13815;Dbxref=NCBI_GP:AHA78701.1;Name=AHA78701.1;gbkey=CDS;locus_tag=U712_13815;product=Rod shape-determining protein mreB;protein_id=AHA78701.1;transl_table=11

Important

This solution would make it possible to find insertion zones without any preconceived ideas.

To reproduce

MYREADS is the sample name, REFERENCE is the reference name to use.

Connection with MobaXterm (cf https://documents.migale.inrae.fr/posts/tutorials/migale-infra/)
Mapping of the FASTQ

Show the code

cd ~/work/
mkdir MY_ANALYSIS
cd MY_ANALYSIS

conda activate bwa-mem2-2.2.1
bwa-mem2 index REFERENCE.fasta
samtools faidx REFERENCE.fasta
bwa-mem2 mem -t 4 REFERENCE.fasta MYREADS_R1.fastq.gz MYREADS_R2.fastq.gz |samtools view -hbS - |samtools sort -@ 4 - > MYREADS.bam
samtools index MYREADS.bam

Find zones with a high frequency of soft clipping inside reads

Show the code

samtools view MYREADS.bam | awk '$6 ~ /S/ {print $3, int($4/100)*100}' | sort | uniq -c | sort -k1,1nr | awk '{print $2 "\t" $3 "\t" $3+100 "\t" $1}' |head

Visualization with IGV

Download IGV
Download BAM and BAI files
Download FASTA and FASTA.fai files
Load FASTA as Genome
Load BAM as Tracks

Downloads

An archive containing:

BAM files and indexed
Reference FASTA file and index
VCF files

was available via Filesender

References

1. Shen W, Le S, Li Y, Hu F. SeqKit: A cross-platform and ultrafast toolkit for FASTA/q file manipulation. PloS one. 2016;11:e0163962.

2. Andrews S. FastQC a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 2010. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

3. Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016;32:3047–8.

4. Vasimuddin M, Misra S, Li H, Aluru S. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS). IEEE; 2019. p. 314–24.

5. Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013.

6. Okonechnikov K, Conesa A, Garcı́a-Alcalde F. Qualimap 2: Advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2015;32:292–4.

7. Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:12073907. 2012.

8. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–8. doi:10.1093/bioinformatics/btr330.

9. Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28:i333–9. doi:10.1093/bioinformatics/bts378.

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France