FAMAT

Find functional links between a list of genes and metabolites using data-mining

closed

collaboration

Author

Affiliation

Mouhamadou Ba

Migale bioinformatics facility

Published

January 10, 2024

Modified

February 15, 2024

Note

This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.

Aim of the project

this project is to experiement automatic classification of sentences in order to detect sentences containing fonctional links between genes and metabolites. We will train a classifier to classify pertinent and non pertinent sentences from an annotated dataset.

Partners

Mouhamadou Ba - Migale bioinformatics facility - BioInfomics - INRAE
Mathieu Charles - GABI - INRAE

Deliverables

Deliverables agreed at the preliminary meeting (Table 1).

Table 1: Deliverables

	Definition
1	HTML report
2	Classified file

Data management

Important

All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.

Raw data

The raw data accessible from here are composed of 9600 sentences. 2237 sentences are manually annotated. For each sentence, it is set if it contains or not a fonctionnal link.

73 sentences are annotated as pertinents (contain fonctional links)
2164 sentences are annotated as non pertinents (do not contain fonctional links)
7363 sentences are not annotated

Dataset

Our dateset is small and unbalanced with only 2237 sentences composed of 73 positives exemples and 2164 negatives examples. We also consider the unannotated dataset composed of 7363 sentences.

The training dateset is here
The unannotated dateset is here

We will use the annotated sentences as the training dateset to train a classifier. Then, we use the classifier to automatically classify the unannotated sentences (7363 sentences) in order to manually check the quality of the automatic classification.

Classification

Run_20240119

We will train classifiers based on the training dataset composed of the 73 positive examples and 2164 negative examples. A cross-valudation with 5 folds provide the following scores where the average score is P: 0.631724, R: 0.446970, R1: 0.500415

Fold	Precision	Recall	F1
0	0.777778	0.291667	0.424242
1	0.727273	0.363636	0.484848
2	0.428571	0.375000	0.400000
3	0.625000	0.454545	0.526316
4	0.600000	0.750000	0.666667
average	0.631724	0.446970	0.500415
stdev	0.134892	0.178975	0.105402

We classified the unannotated dateset in order to manually evaluate the predictions of the classier. The results are avalaible in the following file :

Open the file containing the predicted results
Go to sheet prediction_classifier_v20240119
Column PREDICTED_CLASS contains the classifier predictions (where 1 is set when the sentence is predicted as pertinent and 0 when it is predicted as non-pertinent)

Run 20240214

In this new run we train the classifier by filtering the long sentences (ignoring sentences containing more than 400 characters). We also tried with different parameters (adding mentions of the genes and metabolites, weighting the positive examples in the datasets) but it does not enhance the results.

Fold	Precision	Recall	F1
0	0.800000	0.190476	0.307692
1	0.666667	0.307692	0.421053
2	0.500000	0.375000	0.428571
3	0.857143	0.333333	0.480000
4	0.400000	0.500000	0.444444
average	0.644762	0.341300	0.416352
stdev	0.194003	0.112096	0.064843

Reuse

This document will not be accessible without prior agreement of the partners

A work by Migale Bioinformatics Facility
Université Paris-Saclay, INRAE, MaIAGE, 78350, Jouy-en-Josas, France
Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, 78350, Jouy-en-Josas, France