::vroom("Genomes-Ceco2-GCF.tsv") |>
vroom::datatable() DT
CECO2
Comparative analysis of Enterococcus cecorum genomes
This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.
Aim of the project
The aim of this project is to compare public, in-house and private partner genomes.
Patners
- Cédric Midoux - Migale bioinformatics facility - BioInfomics - INRAE
- Valentin Loux - Migale bioinformatics facility - BioInfomics - INRAE
- Pascale Serror - MICALIS - INRAE
Deliverables
Deliverables agreed at the preliminary meeting (Table 1).
Definition | |
---|---|
1 | dRep dendogram |
2 | PPanGGolin report |
Data management
All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.
Raw data
An analysis strain table is produced by the partner.
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS
mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS
mkdir -p ToDoGBFF
tail -n +2 Genomes-Ceco2-GCF.tsv | while read -r line; do
NAME=$(echo "$line" | cut -f1)
SOURCE=$(echo "$line" | cut -f2)
GBFF=$(echo "$line" | cut -f3)
if [ $SOURCE == "ncbi_datasets" ]; then
echo -n $NAME " " >> ncbi.list
else
ln -s $SOURCE DATA/$NAME.fna
fi
if [ $GBFF != "ncbi_datasets" ]; then
if [ $GBFF == "ToDo" ]; then
ln -s $SOURCE ToDoGBFF/$NAME.fna
else
ln -s $GBFF GBFF/$NAME.gbff
fi
fi
done
conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome,gbff --dehydrated `cat ncbi.list`
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 474 of 474 files for rehydration #474=2*237
mv ncbi_dataset/data/*/*.fna DATA/
for i in ncbi_dataset/data/*/genomic.gbff ; do id=$(dirname $i |cut -d/ -f3) ; mv $i GBFF/$id.gbff ; done
rm -fr ncbi_dataset/ ncbi_dataset.zip README.md
mkdir -p DoneGBFF
for i in ToDoGBFF/*fna; do
id=$(basename $i .fna)
qsub -V -cwd -N bakta_$id -pe thread 12 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.1 && bakta --threads 12 --db /db/outils/bakta-1.9.1/db/ --output DoneGBFF/$id --prefix $id --compliant --genus Enterococcus --species cecorum --strain $id $i && conda deactivate"
done
for i in DoneGBFF/*/*.gbff ; do id=$(basename $i .gbff) ; ln -s ../$i GBFF/ ; done
ls -1 DATA/*.fna | wc -l #392
ls -1 GBFF/*.gbff | wc -l #392
Genomes
We use quast
for genomes metrics and gtdb-tk
for taxonomic metrics.
qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS && quast -t 16 -o QUAST DATA/* && conda deactivate"
qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"
We also extract som statistics about number of pseudogeens (a proxy to the quality of assembly) from public genomes from Refseq (124) and annotated by PGAP (8). Accession with a high number of pseudogenes will be removed from the analysis. We extract stats about genome coverage, when available.
cd GBFF/
grep "Pseudo Genes (total)" * |sed 's/\.gbff://' |sed 's/:://'|sed 's/\s+/\t/g'|sort -u |sort -nr -k 5 > ../QUAST/stats_pseudos.tsv
grep "Genome Coverage" * |sed 's/\.gbff://' |sed 's/:://'|sed 's/\s+/\t/g'|sort -u |sort -nr -k 5 > ../QUAST/stats_coverage.tsv
dRep dendrogram
We use dRep --S_algorithm fastANI
for compute a dendrogram of strains.
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA/* -p 16 DREP && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA/* -p 16 DREP_fastANI --S_algorithm fastANI && conda deactivate"
PPanGGOLiN
Finaly, we use PPanGGOLiN
for a pangenome analysis and add rarefaction curves generation
for i in GBFF/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #392
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.gbff.list --output PANGO --cpu 16 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin rarefaction -c 16 -p pangenome.h5 && conda deactivate"
Core genes alignement (DNA and protein) :
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT && conda deactivate"
Second step : genome filtration
Based on the results from the first analysis. 9 genomes were found aberrant (ANI, number of pseduogenes) and removed from the analysis :
Fits dataset (383 genome, Genomes-Ceco2-GCF-filtered.tsv
) :
GCA_034090505.1
GCA_039954935.1
GCA_039954955.1
GCA_039954965.1
GCA_039954975.1
GCA_039955015.1
GCA_039955055.1
GCF_023221725.1
GCF_023221775.1
Other filtered datasets have been produced.
- 382 genomes :
- 383 (dataset-1) - abnormal size genome (
GCA_039906345.1
)Genomes-Ceco2-GCF-382.tsv
- 383 (dataset-1) - abnormal size genome (
- 378 génomes :
- 382 - 4 genomes from Cluster 1_1 (
GCF_022806925.1
,GCF_022806955.1
,GCF_022807045.1
,GCF_022807175.1
)Genomes-Ceco2-GCF-378.tsv
- 382 - 4 genomes from Cluster 1_1 (
- 373 génomes :
- 378 - genomes with assembly coverage <72% (as stated by Refseq) (
GCF_039906335.1
,GCF_013103375.1
,GCF_039955025.1
,GCF_039954945.1
)Genomes-Ceco2-GCF-373.tsv
- 378 - genomes with assembly coverage <72% (as stated by Refseq) (
dRep and PPanGGOLiN has to be relaunched on each filtered dataset.
Filtered Dataset -1
Filtered strain table
::vroom("Genomes-Ceco2-GCF-filtered.tsv") |>
vroom::datatable() DT
Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_1 and GBFF_FILTERED_1 directories.
mkdir DATA_FILTERED_1 GBFF_FILTERED_1/
cd DATA_FILTERED_1
ln -s ../DATA/* .
cd ../GBFF_FILTERED_1/
ln -s ../GBFF/* .
cd ..
rm -f *_FILTERED_1/GCA_034090505.1*
rm -f *_FILTERED_1/GCA_039954935.1*
rm -f *_FILTERED_1/GCA_039954955.1*
rm -f *_FILTERED_1/GCA_039954965.1*
rm -f *_FILTERED_1/GCA_039954975.1*
rm -f *_FILTERED_1/GCA_039955015.1*
rm -f *_FILTERED_1/GCA_039955055.1*
rm -f *_FILTERED_1/GCF_023221725.1*
rm -f *_FILTERED_1/GCF_023221775.1*
ls -1 DATA_FILTERED_1/*.fna | wc -l # 383
ls -1 GBFF_FILTERED_1/*.gbff | wc -l # 383
dRep dendrogram of filtered dataset-1
We use dRep --S_algorithm fastANI
to compute a dendrogram of strains.
On Migale, results are in DREP_FILTERED_1_fastANI
directory.
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_1/* -p 16 DREP_FILTERED_1 && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_1/* -p 16 DREP_FILTERED_1_fastANI --S_algorithm fastANI && conda deactivate"
PPanGGOLiN on filtered dataset-1
Finaly, we use PPanGGOLiN
to do a pangenome analysis and add rarefaction curves generation
On Migale, results are in PANGO_FILTERED_1
directory.
for i in GBFF_FILTERED_1/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_1.gbff.list ; done
wc -l genomes.filtered_1.gbff.list #383
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_1.gbff.list --output PANGO_FILTERED_1 --cpu 16 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_1 && ppanggolin rarefaction -c 16 -p pangenome.h5 && conda deactivate"
Core genes alignement (DNA and protein) :
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_1 && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_1 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT && conda deactivate"
PPanGGOLiN statistics on filtered dataset-1:
Content:
Genes: 870376
Genomes: 383
Families: 9647
Edges: 16697
Persistent:
Family_count: 1491
min_genomes_frequency: 0.96
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 3003
min_genomes_frequency: 0.03
max_genomes_frequency: 0.98
sd_genomes_frequency: 0.27
mean_genomes_frequency: 0.23
Cloud:
Family_count: 5153
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.01
mean_genomes_frequency: 0.01
Filtered Dataset - 382
Filtered strain table
::vroom("Genomes-Ceco2-GCF-382.tsv") |>
vroom::datatable() DT
Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_382 and GBFF_FILTERED_382 directories.
mkdir DATA_FILTERED_382 GBFF_FILTERED_382/
cd DATA_FILTERED_382
ln -s ../DATA/* .
cd ../GBFF_FILTERED_382/
ln -s ../GBFF/* .
cd ..
rm -f *_FILTERED_382/GCA_034090505.1*
rm -f *_FILTERED_382/GCA_039954935.1*
rm -f *_FILTERED_382/GCA_039954955.1*
rm -f *_FILTERED_382/GCA_039954965.1*
rm -f *_FILTERED_382/GCA_039954975.1*
rm -f *_FILTERED_382/GCA_039955015.1*
rm -f *_FILTERED_382/GCA_039955055.1*
rm -f *_FILTERED_382/GCF_023221725.1*
rm -f *_FILTERED_382/GCF_023221775.1*
rm -f *_FILTERED_382/GCA_039906345.1*
ls -1 DATA_FILTERED_382/*.fna | wc -l # 382
ls -1 GBFF_FILTERED_382/*.gbff | wc -l # 382
dRep dendrogram of filtered dataset-382
We use dRep --S_algorithm fastANI
to compute a dendrogram of strains.
On Migale, results are in DREP_FILTERED_382_fastANI
directory.
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_382/* -p 64 DREP_FILTERED_382 && conda deactivate"
qsub -cwd -V -N drep_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_382/* -p 64 DREP_FILTERED_382_fastANI --S_algorithm fastANI && conda deactivate"
PPanGGOLiN on filtered dataset-382
Finaly, we use PPanGGOLiN
to do a pangenome analysis and add rarefaction curves generation
On Migale, results are in PANGO_FILTERED_382
directory.
for i in GBFF_FILTERED_382/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_382.gbff.list ; done
wc -l genomes.filtered_382.gbff.list #382
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_382.gbff.list --output PANGO_FILTERED_382 --cpu 64 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_382 && ppanggolin rarefaction -c 64 -p pangenome.h5 && conda deactivate"
Core genes alignement (DNA and protein) :
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_382 && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA && conda deactivate"
qsub -cwd -V -N ppanggolin -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_382 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT && conda deactivate"
PPanGGOLiN statistics on filtered dataset-382:
Content:
Genes: 867520
Genomes: 382
Families: 9643
Edges: 16581
Persistent:
Family_count: 1492
min_genomes_frequency: 0.96
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 3000
min_genomes_frequency: 0.03
max_genomes_frequency: 0.98
sd_genomes_frequency: 0.27
mean_genomes_frequency: 0.23
Cloud:
Family_count: 5151
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.01
mean_genomes_frequency: 0.01
Filtered Dataset - 378
Filtered strain table
::vroom("Genomes-Ceco2-GCF-378.tsv") |>
vroom::datatable() DT
Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_378 and GBFF_FILTERED_378 directories.
mkdir DATA_FILTERED_378 GBFF_FILTERED_378/
cd DATA_FILTERED_378
ln -s ../DATA/* .
cd ../GBFF_FILTERED_378/
ln -s ../GBFF/* .
cd ..
rm -f *_FILTERED_378/GCA_034090505.1*
rm -f *_FILTERED_378/GCA_039954935.1*
rm -f *_FILTERED_378/GCA_039954955.1*
rm -f *_FILTERED_378/GCA_039954965.1*
rm -f *_FILTERED_378/GCA_039954975.1*
rm -f *_FILTERED_378/GCA_039955015.1*
rm -f *_FILTERED_378/GCA_039955055.1*
rm -f *_FILTERED_378/GCF_023221725.1*
rm -f *_FILTERED_378/GCF_023221775.1*
rm -f *_FILTERED_378/GCA_039906345.1*
rm -f *_FILTERED_378/GCF_022806925.1*
rm -f *_FILTERED_378/GCF_022806955.1*
rm -f *_FILTERED_378/GCF_022807045.1*
rm -f *_FILTERED_378/GCF_022807175.1*
ls -1 DATA_FILTERED_378/*.fna | wc -l # 378
ls -1 GBFF_FILTERED_378/*.gbff | wc -l # 378
dRep dendrogram of filtered dataset-378
We use dRep --S_algorithm fastANI
to compute a dendrogram of strains.
On Migale, results are in DREP_FILTERED_378_fastANI
directory.
qsub -cwd -V -N drep-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_378 && conda deactivate"
qsub -cwd -V -N drep-378_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_378_fastANI --S_algorithm fastANI && conda deactivate"
PPanGGOLiN on filtered dataset-378
Finaly, we use PPanGGOLiN
to do a pangenome analysis and add rarefaction curves generation
On Migale, results are in PANGO_FILTERED_378
directory.
for i in GBFF_FILTERED_378/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_378.gbff.list ; done
wc -l genomes.filtered_378.gbff.list #378
qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_378.gbff.list --output PANGO_FILTERED_378 --cpu 64 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_378 && ppanggolin rarefaction -c 64 -p pangenome.h5 && conda deactivate"
Core genes alignement (DNA and protein) :
qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_378 && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA && conda deactivate"
qsub -cwd -V -N ppanggolin-378 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_378 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT && conda deactivate"
PPanGGOLiN statistics on filtered dataset-378:
Content:
Genes: 857846
Genomes: 378
Families: 9490
Edges: 16293
Persistent:
Family_count: 1493
min_genomes_frequency: 0.96
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 2968
min_genomes_frequency: 0.03
max_genomes_frequency: 0.98
sd_genomes_frequency: 0.27
mean_genomes_frequency: 0.24
Cloud:
Family_count: 5029
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.01
mean_genomes_frequency: 0.01
Filtered Dataset - 374
Filtered strain table
::vroom("Genomes-Ceco2-GCF-374.tsv") |>
vroom::datatable() DT
Genomes Fasta and Genbank files are in, respectively DATA_FILTERED_374 and GBFF_FILTERED_374 directories.
mkdir DATA_FILTERED_374 GBFF_FILTERED_374/
cd DATA_FILTERED_374
ln -s ../DATA/* .
cd ../GBFF_FILTERED_374/
ln -s ../GBFF/* .
cd ..
rm -f *_FILTERED_374/GCA_034090505.1*
rm -f *_FILTERED_374/GCA_039954935.1*
rm -f *_FILTERED_374/GCA_039954955.1*
rm -f *_FILTERED_374/GCA_039954965.1*
rm -f *_FILTERED_374/GCA_039954975.1*
rm -f *_FILTERED_374/GCA_039955015.1*
rm -f *_FILTERED_374/GCA_039955055.1*
rm -f *_FILTERED_374/GCF_023221725.1*
rm -f *_FILTERED_374/GCF_023221775.1*
rm -f *_FILTERED_374/GCA_039906345.1*
rm -f *_FILTERED_374/GCF_022806925.1*
rm -f *_FILTERED_374/GCF_022806955.1*
rm -f *_FILTERED_374/GCF_022807045.1*
rm -f *_FILTERED_374/GCF_022807175.1*
rm -f *_FILTERED_374/GCF_039906335.1*
rm -f *_FILTERED_374/GCF_013103375.1*
rm -f *_FILTERED_374/GCF_039955025.1*
rm -f *_FILTERED_374/GCF_039954945.1*
ls -1 DATA_FILTERED_374/*.fna | wc -l # 374
ls -1 GBFF_FILTERED_374/*.gbff | wc -l # 374
dRep dendrogram of filtered dataset-374
We use dRep --S_algorithm fastANI
to compute a dendrogram of strains.
On Migale, results are in DREP_FILTERED_374_fastANI
directory.
qsub -cwd -V -N drep-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_374 && conda deactivate"
qsub -cwd -V -N drep-374_fastani -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate drep-3.2.2 && dRep dereplicate -g DATA_FILTERED_378/* -p 64 DREP_FILTERED_374_fastANI --S_algorithm fastANI && conda deactivate"
PPanGGOLiN on filtered dataset-374
Finaly, we use PPanGGOLiN
to do a pangenome analysis and add rarefaction curves generation
On Migale, results are in PANGO_FILTERED_1
directory.
for i in GBFF_FILTERED_374/*.gbff ; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.filtered_374.gbff.list ; done
wc -l genomes.filtered_374.gbff.list #374
qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && ppanggolin all --anno genomes.filtered_374.gbff.list --output PANGO_FILTERED_374 --cpu 64 --tmp /projet/tmp && conda deactivate"
qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_374 && ppanggolin rarefaction -c 64 -p pangenome.h5 && conda deactivate"
Core genes alignement (DNA and protein) :
qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_374 && ppanggolin msa -c 64 -p pangenome.h5 --source dna --phylo -o MSA_DNA && conda deactivate"
qsub -cwd -V -N ppanggolin-374 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.0.4 && cd PANGO_FILTERED_374 && ppanggolin msa -c 64 -p pangenome.h5 --source protein --phylo -o MSA_PROT && conda deactivate"
PPanGGOLiN statistics on filtered dataset-374:
Content:
Genes: 847688
Genomes: 374
Families: 9206
Edges: 15544
Persistent:
Family_count: 1634
min_genomes_frequency: 0.88
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.02
mean_genomes_frequency: 0.99
Shell:
Family_count: 2447
min_genomes_frequency: 0.03
max_genomes_frequency: 0.93
sd_genomes_frequency: 0.23
mean_genomes_frequency: 0.23
Cloud:
Family_count: 5125
min_genomes_frequency: 0.0
max_genomes_frequency: 0.03
sd_genomes_frequency: 0.01
mean_genomes_frequency: 0.01
Labofarm update (2024 october)
After a new curation of genomes list, all the analyses were recomputed.
qlogin -pe thread 8
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_update
mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS
tail -n +2 Genomes-Ceco2-GCT-382.tsv | while read -r line; do
NAME=$(echo "$line" | cut -f1)
SOURCE=$(echo "$line" | cut -f2)
STRAIN=$(echo "$line" | cut -f3)
if [ $SOURCE == "ncbi_datasets" ]; then
echo -n $NAME " " >> ncbi.list
else
ln -s $SOURCE DATA/$NAME.fna
fi
done
ls -1 DATA/*.fna | wc -l #155
conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome --dehydrated `cat ncbi.list`
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 227 of 227 files for rehydration
mv ncbi_dataset/data/*/*.fna DATA/
rm -fr ncbi_dataset/ ncbi_dataset.zip README.md md5sum.txt ncbi.list
conda deactivate
ls -1 DATA/*.fna | wc -l #382
for i in DATA/*fna; do
id=$(basename $i .fna)
qsub -V -cwd -N bakta_$id -pe thread 20 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.4 && bakta --threads 20 --db /db/outils/bakta-1.9.4/db/ --output GBFF/$id --prefix $id --compliant --genus Enterococcus --species cecorum --strain $id $i && conda deactivate"
done
ls -1 DATA/*.fna | wc -l #382
ls -1 GBFF/*/*.gbff | wc -l #382
# for i in GBFF/*/*.gbff ; do id=$(basename $i .gbff) ; ln -s ../$i DATA/ ; done
##########################
## QUAST
qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 16 -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_update && quast -t 16 -o QUAST DATA/* && conda deactivate"
## GTDB
qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"
## dRep
# ls -1 `tail -n +2 Genomes-Ceco2-GCT-375.tsv | cut -f1 | sed 's/^/DATA\//;s/$/*.fna/' | tr '\n' ' '` | wc -l #375
echo "conda activate drep-3.2.2 && dRep dereplicate -g `tail -n +2 Genomes-Ceco2-GCT-375.tsv | cut -f1 | sed 's/^/DATA\//;s/$/*.fna/' | tr '\n' ' '` -p 16 DREP --S_algorithm fastANI && conda deactivate" > drep.sh
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 16 drep.sh
##########################
## PPANGGO
conda activate ppanggolin-2.1.1
ppanggolin utils --default_config all
# INFO Writting default config in default_config.yaml
sed "s/use_pseudo: False/use_pseudo: True/g" default_config.yaml > use_pseudo_config.yaml
grep use_pseudo use_pseudo_config.yaml
# use_pseudo: True
conda deactivate
for i in GBFF/*/*.gbff; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #382
# grep $'\u2019' GBFF/*/*.gbff
sed -i 's/’/_/g' GBFF/GCA_947055075.1_CIRMBP-1230_assembly_genomic/GCA_947055075.1_CIRMBP-1230_assembly_genomic.gbff
sed -i 's/’/_/g' GBFF/GCA_947055225.1_CIRMBP-1229_assembly_genomic/GCA_947055225.1_CIRMBP-1229_assembly_genomic.gbff
# grep $'\u2019' GBFF/*/*.gbff | wc -l # 0
qsub -cwd -V -N ppanggolin382 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes.gbff.list --output PANGO382 --cpu 64 --tmp /projet/tmp --config use_pseudo_config.yaml && conda deactivate"
grep -Ff <(tail -n +2 Genomes-Ceco2-GCT-375.tsv | cut -f1) genomes.gbff.list > genomes375.gbff.list
wc -l genomes375.gbff.list #375
qsub -cwd -V -N ppanggolin375 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375 --cpu 64 --tmp /projet/tmp --identity 0.94 -f --config use_pseudo_config.yaml && conda deactivate"
ppanggolin fasta -p PANGO375/pangenome.h5 -f -o PANGO375/FASTA/ --genes all --proteins all --prot_families all --gene_families all --cpu 32
ppanggolin fasta -p PANGO382/pangenome.h5 -f -o PANGO382/FASTA/ --genes all --proteins all --prot_families all --gene_families all --cpu 32
Results
- Quast results
- GTDB results
- dRep results
Results 382 genomes , 94% similarity
Genes: 881322
Genomes: 382
Families: 13440
Edges: 21626
Persistent:
Family_count: 1493
min_genomes_frequency: 0.95
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 3727
min_genomes_frequency: 0.02
max_genomes_frequency: 0.97
sd_genomes_frequency: 0.25
mean_genomes_frequency: 0.19
Cloud:
Family_count: 8220
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.0
mean_genomes_frequency: 0.01
Number_of_partitions: 13
Shell_S10: 549
Shell_S6: 129
Shell_S11: 1849
Shell_S9: 209
Shell_S5: 65
Shell_S1: 156
Shell_S8: 425
Shell_S4: 100
Shell_S7: 123
Shell_S2: 86
Shell_S3: 36
RGP: 18413
Spots: 270
Modules:
Number_of_modules: 519
Families_in_Modules: 3345
Partition_composition:
Persistent: 0.0
Shell: 52.53
Cloud: 47.47
- PPanGGOLiN 382 genomes
Results 375 genomes , 94% similarity
Content:
Genes: 864019
Genomes: 375
Families: 13398
Edges: 21534
Persistent:
Family_count: 1497
min_genomes_frequency: 0.94
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 3704
min_genomes_frequency: 0.02
max_genomes_frequency: 0.97
sd_genomes_frequency: 0.24
mean_genomes_frequency: 0.2
Cloud:
Family_count: 8197
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.0
mean_genomes_frequency: 0.01
Number_of_partitions: 15
Shell_S13: 2136
Shell_S4: 107
Shell_S9: 76
Shell_S1: 153
Shell_S11: 61
Shell_S12: 675
Shell_S7: 92
Shell_S2: 85
Shell_S6: 57
Shell_S8: 71
Shell_S10: 129
Shell_S3: 36
Shell_S5: 26
RGP: 17770
Spots: 271
Modules:
Number_of_modules: 517
Families_in_Modules: 3330
Partition_composition:
Persistent: 0.0
Shell: 52.76
Cloud: 47.24
- PPanGGOLiN 375 genomes
Labofarm update (2024 october 15)
- 375 génomes avec les métadonnées
Genomes-Ceco2-382_curated.tsv
::vroom("www/Labofarm_update/PANGO375_CURATED/Genomes-Ceco2-375_curated.tsv") |>
vroom::datatable() DT
Génomes publics récupérés, annotation roiginelles
Génomes privés annotés avec Bakta 1.9.4
Pseudogènes intégrés dans l’analyse du pangénome (use_peudo=true)
Quast
ppanggolin avec paramètre de simliarité à 94% et use_psuedo=true
qlogin -pe thread 8
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated
mkdir -p DATA
mkdir -p GBFF
mkdir -p LOGS
tail -n +2 Genomes-Ceco2-382_curated.tsv | while read -r line; do
NAME=$(echo "$line" | cut -f1)
SOURCE=$(echo "$line" | cut -f2)
STRAIN=$(echo "$line" | cut -f3)
if [ $SOURCE == "ncbi_datasets" ]; then
echo -n $NAME " " >> ncbi.list
else
ln -s $SOURCE DATA/${STRAIN}.fna
fi
done
ls -1 DATA/*.fna | wc -l #155
conda activate ncbi-datasets-cli-16.17.3
datasets download genome accession --include genome,gbff --dehydrated `cat ncbi.list`
unzip ncbi_dataset.zip
datasets rehydrate --directory . #Found 454 of 454 files for rehydration (454=2*227)
tail -n +2 Genomes-Ceco2-382_curated.tsv | while read -r line; do
NAME=$(echo "$line" | cut -f1)
SOURCE=$(echo "$line" | cut -f2)
STRAIN=$(echo "$line" | cut -f3)
if [ $SOURCE == "ncbi_datasets" ]; then
mv ncbi_dataset/data/${NAME}/*.fna DATA/${STRAIN}.fna
mkdir GBFF/${STRAIN}
mv ncbi_dataset/data/${NAME}/genomic.gbff GBFF/${STRAIN}/${STRAIN}.gbff
else
qsub -V -cwd -N bakta_${STRAIN} -pe thread 20 -e LOGS/ -o LOGS -b y "conda activate bakta-1.9.4 && bakta --threads 20 --db /db/outils/bakta-1.9.4/db/ --output GBFF/${STRAIN} --prefix ${STRAIN} --compliant --genus Enterococcus --species cecorum --strain ${STRAIN} DATA/${STRAIN}.fna && conda deactivate"
fi
done
rm -fr ncbi_dataset/ ncbi_dataset.zip README.md md5sum.txt
conda deactivate
ls -1 DATA/*.fna | wc -l #382
ls -1 GBFF/*/*.gbff | wc -l #382
##########################
## QUAST
qsub -cwd -V -N quast -o LOGS/ -e LOGS/ -pe thread 64 -q maiage.q -b y "conda activate quast-5.2.0 && cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/ && quast -t 64 -o QUAST DATA/* && conda deactivate"
## GTDB
qsub -cwd -V -N gtdb -o LOGS/ -e LOGS/ -pe thread 64 -q maiage.q -b y "conda activate gtdbtk-2.4.0 && export GTDBTK_DATA_PATH=/db/outils/gtdbtk-2.4.0/release220/ && gtdbtk classify_wf --genome_dir DATA/ --out_dir GTDB --mash_db /db/outils/gtdbtk-2.4.0/release220/ --cpus 64 && conda deactivate"
## dRep
# ls -1 `tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/DATA\//;s/$/.fna/' | tr '\n' ' '` | wc -l #375
echo "conda activate drep-3.2.2 && dRep dereplicate -g `tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/DATA\//;s/$/.fna/' | tr '\n' ' '` -p 64 DREP --S_algorithm fastANI && conda deactivate" > drep.sh
qsub -cwd -V -N drep -o LOGS/ -e LOGS/ -pe thread 64 drep.sh
##########################
## PPANGGO
conda activate ppanggolin-2.1.1
ppanggolin utils --default_config all
# INFO Writting default config in default_config.yaml
sed "s/use_pseudo: False/use_pseudo: True/g" default_config.yaml > use_pseudo_config.yaml
grep use_pseudo use_pseudo_config.yaml
# use_pseudo: True
conda deactivate
for i in GBFF/*/*.gbff; do id=$(basename $i .gbff) ; echo -e "$id\t$i" >> genomes.gbff.list ; done
wc -l genomes.gbff.list #382
qsub -cwd -V -N ppanggolin382 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes.gbff.list --output PANGO382 --cpu 64 --tmp /projet/tmp --identity 0.94 --config use_pseudo_config.yaml && conda deactivate"
grep -Ff <(tail -n +2 Genomes-Ceco2-375_curated.tsv | cut -f3 | sed 's/^/\//;s/$/.gbff/') genomes.gbff.list > genomes375.gbff.list
wc -l genomes375.gbff.list #375
qsub -cwd -V -N ppanggolin375 -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375 --cpu 64 --tmp /projet/tmp --identity 0.94 --config use_pseudo_config.yaml && conda deactivate"
Pangenome statistics :
Status:
Genomes_Annotated: true
Genes_Clustered: true
Genes_with_Sequences: true
Gene_Families_with_Sequences: true
Neighbors_Graph: true
Pangenome_Partitioned: true
RGP_Predicted: true
Spots_Predicted: true
Modules_Predicted: true
PPanGGOLiN_Version: 2.1.1
Content:
Genes: 868207
Genomes: 375
Families: 14964
Edges: 24212
Persistent:
Family_count: 1491
min_genomes_frequency: 0.94
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 1.0
Shell:
Family_count: 4050
min_genomes_frequency: 0.02
max_genomes_frequency: 0.98
sd_genomes_frequency: 0.24
mean_genomes_frequency: 0.18
Cloud:
Family_count: 9423
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.0
mean_genomes_frequency: 0.01
Number_of_partitions: 13
Shell_S11: 1872
Shell_S1: 166
Shell_S7: 437
Shell_S4: 125
Shell_S10: 782
Shell_S2: 96
Shell_S8: 125
Shell_S9: 186
Shell_S6: 111
Shell_S5: 112
Shell_S3: 35
Shell_S_: 3
RGP: 18285
Spots: 279
Modules:
Number_of_modules: 494
Families_in_Modules: 3201
Partition_composition:
Persistent: 0.0
Shell: 55.67
Cloud: 44.33
Parameters:
annotate:
# used_local_identifiers: True
use_pseudo: True
# read_annotations_from_file: True
cluster:
coverage: 0.8
identity: 0.94
mode: 1
# defragmentation: True
no_defrag: False
translation_table: 11
# read_clustering_from_file: False
graph:
partition:
beta: 2.5
max_degree_smoothing: 10.0
free_dispersion: False
ICL_margin: 0.05
seed: 42
# computed nb of partitions: True
nb_of_partitions: -1
# final nb of partitions: 13
krange: [3, 20]
rgp:
persistent_penalty: 3
variable_gain: 1
min_length: 3000
min_score: 4
dup_margin: 0.05
spot:
set_size: 3
overlapping_match: 2
exact_match_size: 1
module:
size: 3
min_presence: 2
transitive: 4
jaccard: 0.85
dup_margin: 0.05
Metadata:
contigs: annotation_file
genomes: annotation_file
- PPanGGOLiN 375 genomes curated
375 génomes sans les pseudogènes
- 375 génomes avec les métadonnées
Genomes-Ceco2-382_curated.tsv
::vroom("www/Labofarm_update/PANGO375_CURATED/Genomes-Ceco2-375_curated.tsv") |>
vroom::datatable() DT
Génomes publics récupérés, annotation originelles
Génomes privés annotés avec Bakta 1.9.4
Pseudogènes écartés dans l’analyse du pangénome (use_peudo=false)
Quast
ppanggolin avec paramètre de simliarité à 94% , use_pseudo=false
##########################
## PPANGGO
conda activate ppanggolin-2.1.1
qsub -cwd -V -N ppanggolin375woPseudo -o LOGS/ -e LOGS/ -pe thread 64 -b y "conda activate ppanggolin-2.1.1 && ppanggolin all --anno genomes375.gbff.list --output PANGO375woPseudo --cpu 64 --tmp /projet/tmp --identity 0.94 --config default_config.yaml && conda deactivate"
Pangenome statistics :
Status:
Genomes_Annotated: true
Genes_Clustered: true
Genes_with_Sequences: true
Gene_Families_with_Sequences: true
Neighbors_Graph: true
Pangenome_Partitioned: true
RGP_Predicted: true
Spots_Predicted: true
Modules_Predicted: true
PPanGGOLiN_Version: 2.1.1
Content:
Genes: 851351
Genomes: 375
Families: 13014
Edges: 21796
Persistent:
Family_count: 1467
min_genomes_frequency: 0.94
max_genomes_frequency: 1.0
sd_genomes_frequency: 0.01
mean_genomes_frequency: 0.99
Shell:
Family_count: 3847
min_genomes_frequency: 0.02
max_genomes_frequency: 0.98
sd_genomes_frequency: 0.25
mean_genomes_frequency: 0.19
Cloud:
Family_count: 7700
min_genomes_frequency: 0.0
max_genomes_frequency: 0.02
sd_genomes_frequency: 0.0
mean_genomes_frequency: 0.01
Number_of_partitions: 13
Shell_S11: 1851
Shell_S1: 177
Shell_S9: 244
Shell_S6: 98
Shell_S10: 912
Shell_S5: 100
Shell_S8: 130
Shell_S7: 104
Shell_S3: 33
Shell_S4: 110
Shell_S2: 88
RGP: 18412
Spots: 254
Modules:
Number_of_modules: 480
Families_in_Modules: 3111
Partition_composition:
Persistent: 0.0
Shell: 55.64
Cloud: 44.36
Parameters:
annotate:
# used_local_identifiers: True
use_pseudo: False
# read_annotations_from_file: True
cluster:
coverage: 0.8
identity: 0.94
mode: 1
# defragmentation: True
no_defrag: False
translation_table: 11
# read_clustering_from_file: False
graph:
partition:
beta: 2.5
max_degree_smoothing: 10.0
free_dispersion: False
ICL_margin: 0.05
seed: 42
# computed nb of partitions: True
nb_of_partitions: -1
# final nb of partitions: 13
krange: [3, 20]
rgp:
persistent_penalty: 3
variable_gain: 1
min_length: 3000
min_score: 4
dup_margin: 0.05
spot:
set_size: 3
overlapping_match: 2
exact_match_size: 1
module:
size: 3
min_presence: 2
transitive: 4
jaccard: 0.85
dup_margin: 0.05
```bash
- PPanGGOLiN 375 genomes curated without pseuodgenes
- [PPanGGOLiN genomes_statistics](www/Labofarm_update/PANGO375_CURATEDwoP/genomes_statistics.tsv)
- [U-shaped plot](www/Labofarm_update/PANGO375_CURATEDwoP/Ushaped_plot.html)
- [Tile plot](www/Labofarm_update/PANGO375_CURATEDwoP/tile_plot.html)
#### Metadata
Construct metatdata files for [iTol](https://itol.embl.de)
Directory `2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/METADATA`
Metadata file (`metadata-375.xlsx`) communicated via e-mail by Pascale on 20.12.2024.
Converted as CVS file (`metadata-375.csv`)
::: {.cell}
```{.r .cell-code}
vroom::vroom("www/Labofarm_update/PANGO375_CURATEDwoP/metadata-375.csv") |>
DT::datatable()
:::
Generate datset file for itol for :
- Clinic
- Country
- hatchery
- Tissue
cat dataset_binary_template.txt <(echo "FIELD_SHAPES,3") <(echo "FIELD_LABELS,LABOFARM") <(echo "DATA" )<( awk -F';' '{print $2,$8}' metadata-375.csv | sed 's/, /-/' |sed 's/Labofarm/LABOFARM/'| sed 's/RESALAB/LABOFARM/'| sed 's/LABOFARM/0/')
cat dataset_binary_template.txt <(echo "DATA" ) <( awk -F';' '{print $2,$7}' metadata-375.csv | sed -r 's/ C/,1/'|sed -r 's/ NC/,0/'|sed -r 's/ n\/a/-1/') > itol_clinic.txt
cat dataset_color_strip_template.txt <( awk -F';' '{print $2,$11}' metadata-375.csv| sed 's/ GOASDUFF/ GOASDUFF/'|sed 's/Josset/JOSSET/'| sed 's/AVIAGEN/#E1A624 AVIAGEN/'|sed 's/AVILOIRE/#317AC1 AVILOIRE/'|sed 's/BOYE/#384454 BOYE/'|sed 's/GALLINA 22/#D4D3DC GALLINA 22/'|sed 's/GALLINA 85/#AD956B GALLINA 85/'|sed 's/GOASDUFF/#18534F GOASDUFF/'|sed 's/JOSSET/#226D68 JOSSET/'|sed 's/ORVIA/#FEEAA1 ORVIA/'|sed 's/PERROT/#D6955B PERROT/'|sed 's/SAS AMILLY ACCOUVAGE/#B39188 SAS AMILLY ACCOUVAGE/'| sed 's/ND//'|sed 's/NR//') > itol_hatchery.txt
cat dataset_color_strip_template.txt <( awk -F';' '{print $2,$16}' metadata-375.csv | sed 's/Air sacculitis/#1abc9c Air sacculitis/'|sed 's/Blood\/Ascite/#3498db Blood\/Ascite/'|sed 's/ Blood/ #2ecc71 Blood/'|sed 's/Bone-marrow/#9b59b6 Bone-marrow/'|sed 's/Breast meat/#34495e Breast meat/'|sed 's/ Caecum/ #16a085 Caecum/'|sed 's/Carcass/#27ae60 Carcass/'|sed 's/Cloacal content/#2980b9 Cloacal content/'|sed 's/Cull eggs/#8e44ad Cull eggs/'|sed 's/Egg transfer residue/#2c3e50 Egg transfer residue/'|sed 's/Feces/#f1c40f Feces /'|sed 's/Femoral Head Necrosis/#e67e22 Femoral Head Necrosis/'|sed 's/GIT/#e74c3c GIT/'|sed 's/Joint, Spine, Heart/#bdc3c7 Joint-Spine-Heart/'|sed 's/Heart, Liver/#95a5a6 Heart-Liver/'|sed 's/Heart, liver/#95a5a6 Heart-Liver/'|sed 's/Heart, Spine/#f39c12 Heart-Spine/'|sed 's/ Joint, Heart/ #c0392b Joint-Heart/'|sed 's/ Heart$/ #ecf0f1 Heart/'|sed 's/ Joint$/ #d35400 Joint/'|sed 's/ Leg/ #7f8c8d Leg/'|sed 's/ Liver/ #40407a Liver/'|sed 's/n\/a//'|sed 's/ND//'|sed 's/Pericardium/#ff5252 Pericardium/'|sed 's/Peritoin/#ff793f Peritoin/'|sed 's/ Spine$/ #cd6133 Spine/'|sed 's/ Spleen$/ #cc8e35 Spleen/'|sed 's/Tibial pus/#227093 Tibial pus/'|sed 's/Tissue sample/#33d9b2 Tissue sample/'|sed 's/Vertebral osteomyelitis/#ffda79 Vertebral osteomyelitis/'|sed 's/Vertebras/#ccae62 Vertebras/'|sed 's/Yolk/#474787 Yolk/') > itol_tissue.txt
AMR genes research :
NCBI AMrFinderPlus
cd /work_projet/cecotype/2024_LABOFARM/COMPARATIVE_GENOMICS/Labofarm_curated/PANGO375woPseudo/
conda activate ppanggolin-2.1.2
ppanggolin fasta -p pangenome.h5 --protein all -o PROTEOME
amrfinder -p PROTEOME/all_protein_genes.fna --threads 80 --plus -o ncb_amrfinderplus_all_protein_genes.csv