CECO2

Comparative analysis of Enterococcus cecorum genomes

closed
collaboration
Authors
Affiliation

Cédric Midoux

Migale bioinformatics facility

Valentin Loux

Migale bioinformatics facility

Published

July 23, 2024

Modified

November 4, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

The aim of this project is to compare public, in-house and private partner genomes.

Patners

  • Cédric Midoux - Migale bioinformatics facility - BioInfomics - INRAE
  • Valentin Loux - Migale bioinformatics facility - BioInfomics - INRAE
  • Pascale Serror - MICALIS - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables
  Definition
1 dRep dendogram
2 PPanGGolin report

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

An analysis strain table is produced by the partner.

vroom::vroom("Genomes-Ceco2-GCF.tsv") |> 
  DT::datatable()
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS

mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS
mkdir -p ToDoGBFF

tail -n +2 Genomes-Ceco2-GCF.tsv | while read -r line; do
    NAME=$(echo "$line" | cut -f1)
    SOURCE=$(echo "$line" | cut -f2)
    GBFF=$(echo "$line" | cut -f3)
    
    if [ $SOURCE == "ncbi_datasets" ]; then
        echo -n $NAME " " >> ncbi.list
    else
        ln -s $SOURCE DATA/$NAME.fna
    fi

    if [ $GBFF != "ncbi_datasets" ]; then
        if [ $GBFF == "ToDo" ]; then
            ln -s $SOURCE ToDoGBFF/$NAME.fna
        else
            ln -s $GBFF GBFF/$NAME.gbff
        fi
    fi
done

conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome,gbff --dehydrated `cat ncbi.list`
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 474 of 474 files for rehydration #474=2*237

mv ncbi_dataset/data/*/*.fna DATA/
for i in ncbi_dataset/data/*/genomic.gbff ; do id=$(dirname $i |cut -d/ -f3) ; mv $i GBFF/$id.gbff ; done
rm -fr ncbi_dataset/ ncbi_dataset.zip README.md

mkdir -p DoneGBFF

for i in ToDoGBFF/*fna; do
    id=$(basename $i .fna)
    qsub -V -cwd -N bakta_$id -pe thread 12 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.1 && bakta --threads 12 --db /db/outils/bakta-1.9.1/db/ --output DoneGBFF/$id --prefix $id --compliant --genus Enterococcus --species cecorum --strain $id $i && conda deactivate"
done

for i in DoneGBFF/*/*.gbff ; do id=$(basename $i .gbff) ; ln -s ../$i GBFF/ ; done

ls -1 DATA/*.fna | wc -l #392
ls -1 GBFF/*.gbff | wc -l #392

Genomes

We use quast for genomes metrics and gtdb-tk for taxonomic metrics.

qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS && quast -t 16 -o QUAST DATA/* && conda deactivate"

qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"

We also extract som statistics about number of pseudogeens (a proxy to the quality of assembly) from public genomes from Refseq (124) and annotated by PGAP (8). Accession with a high number of pseudogenes will be removed from the analysis. We extract stats about genome coverage, when available.

cd GBFF/
 grep  "Pseudo Genes (total)" * |sed 's/\.gbff://' |sed 's/:://'|sed 's/\s+/\t/g'|sort -u  |sort -nr -k 5 > ../QUAST/stats_pseudos.tsv
 grep  "Genome Coverage" * |sed 's/\.gbff://' |sed 's/:://'|sed 's/\s+/\t/g'|sort -u  |sort -nr -k 5 > ../QUAST/stats_coverage.tsv

dRep dendrogram

We use dRep --S_algorithm fastANI for compute a dendrogram of strains.

qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA/* -p 16 DREP && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA/* -p 16 DREP_fastANI --S_algorithm fastANI && conda deactivate"

PPanGGOLiN

Finaly, we use PPanGGOLiN for a pangenome analysis and add rarefaction curves generation

for i in GBFF/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #392

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.gbff.list --output PANGO --cpu 16 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin rarefaction -c 16 -p pangenome.h5  && conda deactivate"

Core genes alignement (DNA and protein) :

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 &&  ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA  && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 &&  ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT  && conda deactivate"

Second step : genome filtration

Based on the results from the first analysis. 9 genomes were found aberrant (ANI, number of pseduogenes) and removed from the analysis :

Fits dataset (383 genome, Genomes-Ceco2-GCF-filtered.tsv) :

  • GCA_034090505.1
  • GCA_039954935.1
  • GCA_039954955.1
  • GCA_039954965.1
  • GCA_039954975.1
  • GCA_039955015.1
  • GCA_039955055.1
  • GCF_023221725.1
  • GCF_023221775.1

Other filtered datasets have been produced.

  • 382 genomes :
    • 383 (dataset-1) - abnormal size genome (GCA_039906345.1) Genomes-Ceco2-GCF-382.tsv
  • 378 génomes :
    • 382 - 4 genomes from Cluster 1_1 (GCF_022806925.1, GCF_022806955.1, GCF_022807045.1, GCF_022807175.1) Genomes-Ceco2-GCF-378.tsv
  • 373 génomes :
    • 378 - genomes with assembly coverage <72% (as stated by Refseq) (GCF_039906335.1, GCF_013103375.1, GCF_039955025.1, GCF_039954945.1) Genomes-Ceco2-GCF-373.tsv

dRep and PPanGGOLiN has to be relaunched on each filtered dataset.

Filtered Dataset -1

Filtered strain table

vroom::vroom("Genomes-Ceco2-GCF-filtered.tsv") |> 
  DT::datatable()

Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_1 and GBFF_FILTERED_1 directories.


mkdir DATA_FILTERED_1 GBFF_FILTERED_1/
cd DATA_FILTERED_1
 ln -s ../DATA/* .

cd ../GBFF_FILTERED_1/
 ln -s ../GBFF/* .
cd ..

rm -f  *_FILTERED_1/GCA_034090505.1*
rm -f  *_FILTERED_1/GCA_039954935.1*
rm -f  *_FILTERED_1/GCA_039954955.1*
rm -f  *_FILTERED_1/GCA_039954965.1*
rm -f  *_FILTERED_1/GCA_039954975.1*
rm -f  *_FILTERED_1/GCA_039955015.1*
rm -f  *_FILTERED_1/GCA_039955055.1*
rm -f  *_FILTERED_1/GCF_023221725.1*
rm -f  *_FILTERED_1/GCF_023221775.1*




ls -1 DATA_FILTERED_1/*.fna | wc -l # 383
ls -1 GBFF_FILTERED_1/*.gbff | wc -l # 383

dRep dendrogram of filtered dataset-1

We use dRep --S_algorithm fastANI to compute a dendrogram of strains.

On Migale, results are in DREP_FILTERED_1_fastANI directory.

qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_1/* -p 16 DREP_FILTERED_1 && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_1/* -p 16 DREP_FILTERED_1_fastANI --S_algorithm fastANI && conda deactivate"

PPanGGOLiN on filtered dataset-1

Finaly, we use PPanGGOLiN to do a pangenome analysis and add rarefaction curves generation

On Migale, results are in PANGO_FILTERED_1 directory.

for i in GBFF_FILTERED_1/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_1.gbff.list ; done
wc -l genomes.filtered_1.gbff.list #383

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_1.gbff.list --output PANGO_FILTERED_1 --cpu 16 --tmp /projet/tmp && conda deactivate"


qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_1  && ppanggolin rarefaction -c 16 -p pangenome.h5  && conda deactivate"

Core genes alignement (DNA and protein) :

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_1  && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA  && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_1 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT  && conda deactivate"

PPanGGOLiN statistics on filtered dataset-1:

Content:
    Genes: 870376
    Genomes: 383
    Families: 9647
    Edges: 16697
    Persistent:
        Family_count: 1491
        min_genomes_frequency: 0.96
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 1.0
    Shell:
        Family_count: 3003
        min_genomes_frequency: 0.03
        max_genomes_frequency: 0.98
        sd_genomes_frequency: 0.27
        mean_genomes_frequency: 0.23
    Cloud:
        Family_count: 5153
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.02
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 0.01

Filtered Dataset - 382

Filtered strain table

vroom::vroom("Genomes-Ceco2-GCF-382.tsv") |> 
  DT::datatable()

Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_382 and GBFF_FILTERED_382 directories.


mkdir DATA_FILTERED_382 GBFF_FILTERED_382/
cd DATA_FILTERED_382
 ln -s ../DATA/* .

cd ../GBFF_FILTERED_382/
 ln -s ../GBFF/* .
cd ..

rm -f  *_FILTERED_382/GCA_034090505.1*
rm -f  *_FILTERED_382/GCA_039954935.1*
rm -f  *_FILTERED_382/GCA_039954955.1*
rm -f  *_FILTERED_382/GCA_039954965.1*
rm -f  *_FILTERED_382/GCA_039954975.1*
rm -f  *_FILTERED_382/GCA_039955015.1*
rm -f  *_FILTERED_382/GCA_039955055.1*
rm -f  *_FILTERED_382/GCF_023221725.1*
rm -f  *_FILTERED_382/GCF_023221775.1*
rm -f  *_FILTERED_382/GCA_039906345.1*


ls -1 DATA_FILTERED_382/*.fna | wc -l # 382
ls -1 GBFF_FILTERED_382/*.gbff | wc -l # 382

dRep dendrogram of filtered dataset-382

We use dRep --S_algorithm fastANI to compute a dendrogram of strains.

On Migale, results are in DREP_FILTERED_382_fastANI directory.

qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_382/* -p 64 DREP_FILTERED_382 && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_382/* -p 64 DREP_FILTERED_382_fastANI --S_algorithm fastANI && conda deactivate"

PPanGGOLiN on filtered dataset-382

Finaly, we use PPanGGOLiN to do a pangenome analysis and add rarefaction curves generation

On Migale, results are in PANGO_FILTERED_382 directory.

for i in GBFF_FILTERED_382/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_382.gbff.list ; done
wc -l genomes.filtered_382.gbff.list #382

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_382.gbff.list --output PANGO_FILTERED_382 --cpu 64 --tmp /projet/tmp && conda deactivate"


qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_382  && ppanggolin rarefaction -c 64 -p pangenome.h5  && conda deactivate"

Core genes alignement (DNA and protein) :

qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_382  && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA  && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_382 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT  && conda deactivate"

PPanGGOLiN statistics on filtered dataset-382:

Content:
    Genes: 867520
    Genomes: 382
    Families: 9643
    Edges: 16581
    Persistent:
        Family_count: 1492
        min_genomes_frequency: 0.96
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 1.0
    Shell:
        Family_count: 3000
        min_genomes_frequency: 0.03
        max_genomes_frequency: 0.98
        sd_genomes_frequency: 0.27
        mean_genomes_frequency: 0.23
    Cloud:
        Family_count: 5151
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.02
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 0.01

Filtered Dataset - 378

Filtered strain table

vroom::vroom("Genomes-Ceco2-GCF-378.tsv") |> 
  DT::datatable()

Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_378 and GBFF_FILTERED_378 directories.


mkdir DATA_FILTERED_378 GBFF_FILTERED_378/
cd DATA_FILTERED_378
 ln -s ../DATA/* .

cd ../GBFF_FILTERED_378/
 ln -s ../GBFF/* .
cd ..

rm -f  *_FILTERED_378/GCA_034090505.1*
rm -f  *_FILTERED_378/GCA_039954935.1*
rm -f  *_FILTERED_378/GCA_039954955.1*
rm -f  *_FILTERED_378/GCA_039954965.1*
rm -f  *_FILTERED_378/GCA_039954975.1*
rm -f  *_FILTERED_378/GCA_039955015.1*
rm -f  *_FILTERED_378/GCA_039955055.1*
rm -f  *_FILTERED_378/GCF_023221725.1*
rm -f  *_FILTERED_378/GCF_023221775.1*
rm -f  *_FILTERED_378/GCA_039906345.1*
rm -f  *_FILTERED_378/GCF_022806925.1*
rm -f  *_FILTERED_378/GCF_022806955.1*
rm -f  *_FILTERED_378/GCF_022807045.1*
rm -f  *_FILTERED_378/GCF_022807175.1*



ls -1 DATA_FILTERED_378/*.fna | wc -l # 378
ls -1 GBFF_FILTERED_378/*.gbff | wc -l # 378

dRep dendrogram of filtered dataset-378

We use dRep --S_algorithm fastANI to compute a dendrogram of strains.

On Migale, results are in DREP_FILTERED_378_fastANI directory.

qsub -cwd -V -N drep-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_378 && conda deactivate"
qsub -cwd -V -N drep-378_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_378_fastANI --S_algorithm fastANI && conda deactivate"

PPanGGOLiN on filtered dataset-378

Finaly, we use PPanGGOLiN to do a pangenome analysis and add rarefaction curves generation

On Migale, results are in PANGO_FILTERED_378 directory.

for i in GBFF_FILTERED_378/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_378.gbff.list ; done
wc -l genomes.filtered_378.gbff.list #378

qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_378.gbff.list --output PANGO_FILTERED_378 --cpu 64 --tmp /projet/tmp && conda deactivate"


qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_378  && ppanggolin rarefaction -c 64 -p pangenome.h5  && conda deactivate"

Core genes alignement (DNA and protein) :

qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_378  && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA  && conda deactivate"
qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_378 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT  && conda deactivate"

PPanGGOLiN statistics on filtered dataset-378:

Content:
    Genes: 857846
    Genomes: 378
    Families: 9490
    Edges: 16293
    Persistent:
        Family_count: 1493
        min_genomes_frequency: 0.96
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 1.0
    Shell:
        Family_count: 2968
        min_genomes_frequency: 0.03
        max_genomes_frequency: 0.98
        sd_genomes_frequency: 0.27
        mean_genomes_frequency: 0.24
    Cloud:
        Family_count: 5029
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.02
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 0.01

Filtered Dataset - 374

Filtered strain table

vroom::vroom("Genomes-Ceco2-GCF-374.tsv") |> 
  DT::datatable()

Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_374 and GBFF_FILTERED_374 directories.


mkdir DATA_FILTERED_374 GBFF_FILTERED_374/
cd DATA_FILTERED_374
 ln -s ../DATA/* .

cd ../GBFF_FILTERED_374/
 ln -s ../GBFF/* .
cd ..

rm -f  *_FILTERED_374/GCA_034090505.1*
rm -f  *_FILTERED_374/GCA_039954935.1*
rm -f  *_FILTERED_374/GCA_039954955.1*
rm -f  *_FILTERED_374/GCA_039954965.1*
rm -f  *_FILTERED_374/GCA_039954975.1*
rm -f  *_FILTERED_374/GCA_039955015.1*
rm -f  *_FILTERED_374/GCA_039955055.1*
rm -f  *_FILTERED_374/GCF_023221725.1*
rm -f  *_FILTERED_374/GCF_023221775.1*
rm -f  *_FILTERED_374/GCA_039906345.1*
rm -f  *_FILTERED_374/GCF_022806925.1*
rm -f  *_FILTERED_374/GCF_022806955.1*
rm -f  *_FILTERED_374/GCF_022807045.1*
rm -f  *_FILTERED_374/GCF_022807175.1*
rm -f  *_FILTERED_374/GCF_039906335.1*
rm -f  *_FILTERED_374/GCF_013103375.1*
rm -f  *_FILTERED_374/GCF_039955025.1*
rm -f  *_FILTERED_374/GCF_039954945.1*




ls -1 DATA_FILTERED_374/*.fna | wc -l # 374
ls -1 GBFF_FILTERED_374/*.gbff | wc -l # 374

dRep dendrogram of filtered dataset-374

We use dRep --S_algorithm fastANI to compute a dendrogram of strains.

On Migale, results are in DREP_FILTERED_374_fastANI directory.

qsub -cwd -V -N drep-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_374 && conda deactivate"
qsub -cwd -V -N drep-374_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_374_fastANI --S_algorithm fastANI && conda deactivate"

PPanGGOLiN on filtered dataset-374

Finaly, we use PPanGGOLiN to do a pangenome analysis and add rarefaction curves generation

On Migale, results are in PANGO_FILTERED_1 directory.

for i in GBFF_FILTERED_374/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_374.gbff.list ; done
wc -l genomes.filtered_374.gbff.list #374

qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_374.gbff.list --output PANGO_FILTERED_374 --cpu 64 --tmp /projet/tmp && conda deactivate"


qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_374  && ppanggolin rarefaction -c 64 -p pangenome.h5  && conda deactivate"

Core genes alignement (DNA and protein) :

qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_374  && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA  && conda deactivate"
qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 &&   cd PANGO_FILTERED_374 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT  && conda deactivate"

PPanGGOLiN statistics on filtered dataset-374:

Content:
    Genes: 847688
    Genomes: 374
    Families: 9206
    Edges: 15544
    Persistent:
        Family_count: 1634
        min_genomes_frequency: 0.88
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.02
        mean_genomes_frequency: 0.99
    Shell:
        Family_count: 2447
        min_genomes_frequency: 0.03
        max_genomes_frequency: 0.93
        sd_genomes_frequency: 0.23
        mean_genomes_frequency: 0.23
    Cloud:
        Family_count: 5125
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.03
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 0.01

Labofarm update (2024 october)

After a new curation of genomes list, all the analyses were recomputed.

qlogin -pe thread 8
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_update

mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS

tail -n +2 Genomes-Ceco2-GCT-382.tsv | while read -r line; do
    NAME=$(echo "$line" | cut -f1)
    SOURCE=$(echo "$line" | cut -f2)
    STRAIN=$(echo "$line" | cut -f3)
    
    if [ $SOURCE == "ncbi_datasets" ]; then
        echo -n $NAME " " >> ncbi.list
    else
        ln -s $SOURCE DATA/$NAME.fna
    fi
done

ls -1  DATA/*.fna | wc -l #155

conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome --dehydrated `cat ncbi.list`  
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 227 of 227 files for rehydration

mv ncbi_dataset/data/*/*.fna DATA/
rm -fr ncbi_dataset/ ncbi_dataset.zip README.md md5sum.txt ncbi.list
conda deactivate

ls -1  DATA/*.fna | wc -l #382

for i in DATA/*fna; do
    id=$(basename $i .fna)
    qsub -V -cwd -N bakta_$id -pe thread 20 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.4 && bakta --threads 20 --db /db/outils/bakta-1.9.4/db/ --output GBFF/$id --prefix $id --compliant --genus Enterococcus --species cecorum --strain $id $i && conda deactivate"
done

ls -1 DATA/*.fna  | wc -l #382
ls -1 GBFF/*/*.gbff | wc -l #382

# for i in GBFF/*/*.gbff ; do id=$(basename $i .gbff) ; ln -s ../$i DATA/ ; done

##########################
## QUAST
qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_update && quast -t 16 -o QUAST DATA/* && conda deactivate"

## GTDB
qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"

## dRep
# ls -1  `tail -n +2  Genomes-Ceco2-GCT-375.tsv | cut -f1 | sed 's/^/DATA\//;s/$/*.fna/' | tr '\n' ' '` | wc -l #375
echo "conda activate drep-3.2.2 && dRep dereplicate -g `tail -n +2  Genomes-Ceco2-GCT-375.tsv | cut -f1 | sed 's/^/DATA\//;s/$/*.fna/' | tr '\n' ' '` -p 16 DREP --S_algorithm fastANI && conda deactivate" > drep.sh
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 drep.sh

##########################
## PPANGGO

conda activate ppanggolin-2.1.1
ppanggolin utils --default_config all
# INFO  Writting default config in default_config.yaml
sed "s/use_pseudo: False/use_pseudo: True/g" default_config.yaml > use_pseudo_config.yaml
grep use_pseudo use_pseudo_config.yaml
# use_pseudo: True
conda deactivate

for i in GBFF/*/*.gbff; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #382

# grep $'\u2019' GBFF/*/*.gbff
sed -i 's/’/_/g' GBFF/GCA_947055075.1_CIRMBP-1230_assembly_genomic/GCA_947055075.1_CIRMBP-1230_assembly_genomic.gbff
sed -i 's/’/_/g' GBFF/GCA_947055225.1_CIRMBP-1229_assembly_genomic/GCA_947055225.1_CIRMBP-1229_assembly_genomic.gbff
# grep $'\u2019' GBFF/*/*.gbff | wc -l # 0


qsub -cwd -V -N ppanggolin382 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes.gbff.list --output PANGO382 --cpu 64 --tmp /projet/tmp --config use_pseudo_config.yaml && conda deactivate"

grep -Ff <(tail -n +2 Genomes-Ceco2-GCT-375.tsv | cut -f1) genomes.gbff.list > genomes375.gbff.list
wc -l genomes375.gbff.list #375


 qsub -cwd -V -N ppanggolin375 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375 --cpu 64 --tmp /projet/tmp --identity 0.94 -f  --config use_pseudo_config.yaml && conda deactivate"

 ppanggolin fasta -p PANGO375/pangenome.h5 -f -o PANGO375/FASTA/ --genes all --proteins all --prot_families all --gene_families all --cpu 32
 
  ppanggolin fasta -p PANGO382/pangenome.h5 -f -o PANGO382/FASTA/ --genes all --proteins all --prot_families all --gene_families all --cpu 32
 

Results

Results 382 genomes , 94% similarity

Genes: 881322
Genomes: 382
Families: 13440
Edges: 21626
Persistent:
    Family_count: 1493
    min_genomes_frequency: 0.95
    max_genomes_frequency: 1.0
    sd_genomes_frequency: 0.01
    mean_genomes_frequency: 1.0
Shell:
    Family_count: 3727
    min_genomes_frequency: 0.02
    max_genomes_frequency: 0.97
    sd_genomes_frequency: 0.25
    mean_genomes_frequency: 0.19
Cloud:
    Family_count: 8220
    min_genomes_frequency: 0.0
    max_genomes_frequency: 0.02
    sd_genomes_frequency: 0.0
    mean_genomes_frequency: 0.01
Number_of_partitions: 13
Shell_S10: 549
Shell_S6: 129
Shell_S11: 1849
Shell_S9: 209
Shell_S5: 65
Shell_S1: 156
Shell_S8: 425
Shell_S4: 100
Shell_S7: 123
Shell_S2: 86
Shell_S3: 36
RGP: 18413
Spots: 270
Modules:
    Number_of_modules: 519
    Families_in_Modules: 3345
    Partition_composition:
        Persistent: 0.0
        Shell: 52.53
        Cloud: 47.47

Results 375 genomes , 94% similarity

  Content:
      Genes: 864019
      Genomes: 375
      Families: 13398
      Edges: 21534
      Persistent:
          Family_count: 1497
          min_genomes_frequency: 0.94
          max_genomes_frequency: 1.0
          sd_genomes_frequency: 0.01
          mean_genomes_frequency: 1.0
      Shell:
          Family_count: 3704
          min_genomes_frequency: 0.02
          max_genomes_frequency: 0.97
          sd_genomes_frequency: 0.24
          mean_genomes_frequency: 0.2
      Cloud:
          Family_count: 8197
          min_genomes_frequency: 0.0
          max_genomes_frequency: 0.02
          sd_genomes_frequency: 0.0
          mean_genomes_frequency: 0.01
      Number_of_partitions: 15
      Shell_S13: 2136
      Shell_S4: 107
      Shell_S9: 76
      Shell_S1: 153
      Shell_S11: 61
      Shell_S12: 675
      Shell_S7: 92
      Shell_S2: 85
      Shell_S6: 57
      Shell_S8: 71
      Shell_S10: 129
      Shell_S3: 36
      Shell_S5: 26
      RGP: 17770
      Spots: 271
      Modules:
          Number_of_modules: 517
          Families_in_Modules: 3330
          Partition_composition:
              Persistent: 0.0
              Shell: 52.76
              Cloud: 47.24

Labofarm update (2024 october 15)

  • 375 génomes avec les métadonnées Genomes-Ceco2-382_curated.tsv
vroom::vroom("www/Labofarm_update/PANGO375_CURATED/Genomes-Ceco2-375_curated.tsv") |> 
  DT::datatable()
  • Génomes publics récupérés, annotation roiginelles

  • Génomes privés annotés avec Bakta 1.9.4

  • Pseudogènes intégrés dans l’analyse du pangénome (use_peudo=true)

  • Quast

  • ppanggolin avec paramètre de simliarité à 94% et use_psuedo=true

qlogin -pe thread 8
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated

mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS

tail -n +2 Genomes-Ceco2-382_curated.tsv | while read -r line; do
    NAME=$(echo "$line" | cut -f1)
    SOURCE=$(echo "$line" | cut -f2)
    STRAIN=$(echo "$line" | cut -f3)

    if [ $SOURCE == "ncbi_datasets" ]; then
        echo -n $NAME " " >> ncbi.list
    else
        ln -s $SOURCE DATA/${STRAIN}.fna
    fi
done

ls -1  DATA/*.fna | wc -l #155

conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome,gbff --dehydrated `cat ncbi.list`
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 454 of 454 files for rehydration (454=2*227)

tail -n +2 Genomes-Ceco2-382_curated.tsv | while read -r line; do
    NAME=$(echo "$line" | cut -f1)
    SOURCE=$(echo "$line" | cut -f2)
    STRAIN=$(echo "$line" | cut -f3)

    if [ $SOURCE == "ncbi_datasets" ]; then
        mv ncbi_dataset/data/${NAME}/*.fna DATA/${STRAIN}.fna
        mkdir GBFF/${STRAIN}
        mv ncbi_dataset/data/${NAME}/genomic.gbff GBFF/${STRAIN}/${STRAIN}.gbff
    else
        qsub -V -cwd -N bakta_${STRAIN} -pe thread 20 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.4 && bakta --threads 20 --db /db/outils/bakta-1.9.4/db/ --output GBFF/${STRAIN} --prefix ${STRAIN} --compliant --genus Enterococcus --species cecorum --strain ${STRAIN} DATA/${STRAIN}.fna && conda deactivate"
    fi
done

rm -fr ncbi_dataset/ ncbi_dataset.zip README.md md5sum.txt
conda deactivate

ls -1 DATA/*.fna  | wc -l #382
ls -1 GBFF/*/*.gbff | wc -l #382

##########################
## QUAST
qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 64 -q maiage.q -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/ && quast -t 64 -o QUAST DATA/* && conda deactivate"

## GTDB
qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -q maiage.q -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"

## dRep
# ls -1  `tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/DATA\//;s/$/.fna/' | tr '\n' ' '` | wc -l #375
echo "conda activate drep-3.2.2 && dRep dereplicate -g `tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/DATA\//;s/$/.fna/' | tr '\n' ' '` -p 64 DREP --S_algorithm fastANI && conda deactivate" > drep.sh
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 64 drep.sh

##########################
## PPANGGO

conda activate ppanggolin-2.1.1
ppanggolin utils --default_config all
# INFO  Writting default config in default_config.yaml
sed "s/use_pseudo: False/use_pseudo: True/g" default_config.yaml > use_pseudo_config.yaml
grep use_pseudo use_pseudo_config.yaml
# use_pseudo: True
conda deactivate

for i in GBFF/*/*.gbff; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #382



qsub -cwd -V -N ppanggolin382 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes.gbff.list --output PANGO382 --cpu 64 --tmp /projet/tmp --identity 0.94 --config use_pseudo_config.yaml && conda deactivate"

grep -Ff <(tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/\//;s/$/.gbff/') genomes.gbff.list > genomes375.gbff.list
wc -l genomes375.gbff.list #375
qsub -cwd -V -N ppanggolin375 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375 --cpu 64 --tmp /projet/tmp --identity 0.94 --config use_pseudo_config.yaml && conda deactivate"

Pangenome statistics :

Status:
    Genomes_Annotated: true
    Genes_Clustered: true
    Genes_with_Sequences: true
    Gene_Families_with_Sequences: true
    Neighbors_Graph: true
    Pangenome_Partitioned: true
    RGP_Predicted: true
    Spots_Predicted: true
    Modules_Predicted: true
    PPanGGOLiN_Version: 2.1.1

Content:
    Genes: 868207
    Genomes: 375
    Families: 14964
    Edges: 24212
    Persistent:
        Family_count: 1491
        min_genomes_frequency: 0.94
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 1.0
    Shell:
        Family_count: 4050
        min_genomes_frequency: 0.02
        max_genomes_frequency: 0.98
        sd_genomes_frequency: 0.24
        mean_genomes_frequency: 0.18
    Cloud:
        Family_count: 9423
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.02
        sd_genomes_frequency: 0.0
        mean_genomes_frequency: 0.01
    Number_of_partitions: 13
    Shell_S11: 1872
    Shell_S1: 166
    Shell_S7: 437
    Shell_S4: 125
    Shell_S10: 782
    Shell_S2: 96
    Shell_S8: 125
    Shell_S9: 186
    Shell_S6: 111
    Shell_S5: 112
    Shell_S3: 35
    Shell_S_: 3
    RGP: 18285
    Spots: 279
    Modules:
        Number_of_modules: 494
        Families_in_Modules: 3201
        Partition_composition:
            Persistent: 0.0
            Shell: 55.67
            Cloud: 44.33

Parameters:
    annotate:
        # used_local_identifiers: True
        use_pseudo: True
        # read_annotations_from_file: True
    cluster:
        coverage: 0.8
        identity: 0.94
        mode: 1
        # defragmentation: True
        no_defrag: False
        translation_table: 11
        # read_clustering_from_file: False
    graph:
    partition:
        beta: 2.5
        max_degree_smoothing: 10.0
        free_dispersion: False
        ICL_margin: 0.05
        seed: 42
        # computed nb of partitions: True
        nb_of_partitions: -1
        # final nb of partitions: 13
        krange: [3, 20]
    rgp:
        persistent_penalty: 3
        variable_gain: 1
        min_length: 3000
        min_score: 4
        dup_margin: 0.05
    spot:
        set_size: 3
        overlapping_match: 2
        exact_match_size: 1
    module:
        size: 3
        min_presence: 2
        transitive: 4
        jaccard: 0.85
        dup_margin: 0.05

Metadata:
    contigs: annotation_file
    genomes: annotation_file

375 génomes sans les pseudogènes

  • 375 génomes avec les métadonnées Genomes-Ceco2-382_curated.tsv
vroom::vroom("www/Labofarm_update/PANGO375_CURATED/Genomes-Ceco2-375_curated.tsv") |> 
  DT::datatable()
  • Génomes publics récupérés, annotation originelles

  • Génomes privés annotés avec Bakta 1.9.4

  • Pseudogènes écartés dans l’analyse du pangénome (use_peudo=false)

  • Quast

  • ppanggolin avec paramètre de simliarité à 94% , use_pseudo=false

##########################
## PPANGGO

conda activate ppanggolin-2.1.1

qsub -cwd -V -N ppanggolin375woPseudo -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375woPseudo --cpu 64 --tmp /projet/tmp --identity 0.94 --config default_config.yaml && conda deactivate"
Pangenome statistics :
Status:
    Genomes_Annotated: true
    Genes_Clustered: true
    Genes_with_Sequences: true
    Gene_Families_with_Sequences: true
    Neighbors_Graph: true
    Pangenome_Partitioned: true
    RGP_Predicted: true
    Spots_Predicted: true
    Modules_Predicted: true
    PPanGGOLiN_Version: 2.1.1

Content:
    Genes: 851351
    Genomes: 375
    Families: 13014
    Edges: 21796
    Persistent:
        Family_count: 1467
        min_genomes_frequency: 0.94
        max_genomes_frequency: 1.0
        sd_genomes_frequency: 0.01
        mean_genomes_frequency: 0.99
    Shell:
        Family_count: 3847
        min_genomes_frequency: 0.02
        max_genomes_frequency: 0.98
        sd_genomes_frequency: 0.25
        mean_genomes_frequency: 0.19
    Cloud:
        Family_count: 7700
        min_genomes_frequency: 0.0
        max_genomes_frequency: 0.02
        sd_genomes_frequency: 0.0
        mean_genomes_frequency: 0.01
    Number_of_partitions: 13
    Shell_S11: 1851
    Shell_S1: 177
    Shell_S9: 244
    Shell_S6: 98
    Shell_S10: 912
    Shell_S5: 100
    Shell_S8: 130
    Shell_S7: 104
    Shell_S3: 33
    Shell_S4: 110
    Shell_S2: 88
    RGP: 18412
    Spots: 254
    Modules:
        Number_of_modules: 480
        Families_in_Modules: 3111
        Partition_composition:
            Persistent: 0.0
            Shell: 55.64
            Cloud: 44.36

Parameters:
    annotate:
        # used_local_identifiers: True
        use_pseudo: False
        # read_annotations_from_file: True
    cluster:
        coverage: 0.8
        identity: 0.94
        mode: 1
        # defragmentation: True
        no_defrag: False
        translation_table: 11
        # read_clustering_from_file: False
    graph:
    partition:
        beta: 2.5
        max_degree_smoothing: 10.0
        free_dispersion: False
        ICL_margin: 0.05
        seed: 42
        # computed nb of partitions: True
        nb_of_partitions: -1
        # final nb of partitions: 13
        krange: [3, 20]
    rgp:
        persistent_penalty: 3
        variable_gain: 1
        min_length: 3000
        min_score: 4
        dup_margin: 0.05
    spot:
        set_size: 3
        overlapping_match: 2
        exact_match_size: 1
    module:
        size: 3
        min_presence: 2
        transitive: 4
        jaccard: 0.85
        dup_margin: 0.05


```bash

-   PPanGGOLiN 375 genomes curated without pseuodgenes
    -   [PPanGGOLiN genomes_statistics](www/Labofarm_update/PANGO375_CURATEDwoP/genomes_statistics.tsv)
    -   [U-shaped plot](www/Labofarm_update/PANGO375_CURATEDwoP/Ushaped_plot.html)
    -   [Tile plot](www/Labofarm_update/PANGO375_CURATEDwoP/tile_plot.html)
    
  
  
  #### Metadata
  
  Construct metatdata files for [iTol](https://itol.embl.de)
  
  Directory `2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/METADATA`
  
  Metadata file (`metadata-375.xlsx`) communicated via e-mail by Pascale on 20.12.2024.
  
  Converted as CVS file (`metadata-375.csv`)
  
::: {.cell}

```{.r .cell-code}
vroom::vroom("www/Labofarm_update/PANGO375_CURATEDwoP/metadata-375.csv") |> 
  DT::datatable()

:::

Generate datset file for itol for :

  • Clinic
  • Country
  • hatchery
  • Tissue

cat  dataset_binary_template.txt  <(echo "FIELD_SHAPES,3") <(echo "FIELD_LABELS,LABOFARM") <(echo "DATA" )<(  awk -F';' '{print $2,$8}'  metadata-375.csv | sed 's/, /-/' |sed 's/Labofarm/LABOFARM/'| sed 's/RESALAB/LABOFARM/'| sed 's/LABOFARM/0/')




cat dataset_binary_template.txt <(echo "DATA" ) <(   awk -F';' '{print $2,$7}' metadata-375.csv | sed -r 's/ C/,1/'|sed -r 's/ NC/,0/'|sed -r 's/ n\/a/-1/') > itol_clinic.txt

 cat dataset_color_strip_template.txt <(  awk -F';' '{print $2,$11}' metadata-375.csv| sed 's/  GOASDUFF/ GOASDUFF/'|sed 's/Josset/JOSSET/'| sed 's/AVIAGEN/#E1A624 AVIAGEN/'|sed 's/AVILOIRE/#317AC1 AVILOIRE/'|sed 's/BOYE/#384454 BOYE/'|sed 's/GALLINA 22/#D4D3DC GALLINA 22/'|sed 's/GALLINA 85/#AD956B GALLINA 85/'|sed 's/GOASDUFF/#18534F GOASDUFF/'|sed 's/JOSSET/#226D68 JOSSET/'|sed 's/ORVIA/#FEEAA1 ORVIA/'|sed 's/PERROT/#D6955B PERROT/'|sed 's/SAS AMILLY ACCOUVAGE/#B39188 SAS AMILLY ACCOUVAGE/'| sed 's/ND//'|sed 's/NR//') > itol_hatchery.txt
 
 
cat dataset_color_strip_template.txt <(  awk -F';' '{print $2,$16}' metadata-375.csv | sed 's/Air sacculitis/#1abc9c Air sacculitis/'|sed 's/Blood\/Ascite/#3498db Blood\/Ascite/'|sed 's/ Blood/ #2ecc71 Blood/'|sed 's/Bone-marrow/#9b59b6 Bone-marrow/'|sed 's/Breast meat/#34495e Breast meat/'|sed 's/ Caecum/ #16a085 Caecum/'|sed 's/Carcass/#27ae60 Carcass/'|sed 's/Cloacal content/#2980b9 Cloacal content/'|sed 's/Cull eggs/#8e44ad Cull eggs/'|sed 's/Egg transfer residue/#2c3e50 Egg transfer residue/'|sed 's/Feces/#f1c40f Feces /'|sed 's/Femoral Head Necrosis/#e67e22 Femoral Head Necrosis/'|sed 's/GIT/#e74c3c GIT/'|sed 's/Joint, Spine, Heart/#bdc3c7 Joint-Spine-Heart/'|sed 's/Heart, Liver/#95a5a6 Heart-Liver/'|sed 's/Heart, liver/#95a5a6 Heart-Liver/'|sed 's/Heart, Spine/#f39c12 Heart-Spine/'|sed 's/ Joint, Heart/ #c0392b Joint-Heart/'|sed 's/ Heart$/ #ecf0f1 Heart/'|sed 's/ Joint$/ #d35400 Joint/'|sed 's/ Leg/ #7f8c8d Leg/'|sed 's/ Liver/ #40407a Liver/'|sed 's/n\/a//'|sed 's/ND//'|sed 's/Pericardium/#ff5252 Pericardium/'|sed 's/Peritoin/#ff793f Peritoin/'|sed 's/ Spine$/ #cd6133 Spine/'|sed 's/ Spleen$/ #cc8e35 Spleen/'|sed 's/Tibial pus/#227093 Tibial pus/'|sed 's/Tissue sample/#33d9b2 Tissue sample/'|sed 's/Vertebral osteomyelitis/#ffda79 Vertebral osteomyelitis/'|sed 's/Vertebras/#ccae62 Vertebras/'|sed 's/Yolk/#474787 Yolk/') > itol_tissue.txt

AMR genes research :

NCBI AMrFinderPlus

cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/PANGO375woPseudo/
conda activate ppanggolin-2.1.2
ppanggolin fasta   -p pangenome.h5 --protein all -o PROTEOME
amrfinder -p PROTEOME/all_protein_genes.fna --threads 80  --plus -o ncb_amrfinderplus_all_protein_genes.csv

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France