FAMAT
Find functional links between a list of genes and metabolites using data-mining
This document is a report of the analyses performed. You will find all the code used to analyze these data. The version of the tools (maybe in code chunks) and their references are indicated, for questions of reproducibility.
Aim of the project
this project is to experiement automatic classification of sentences in order to detect sentences containing fonctional links between genes and metabolites. We will train a classifier to classify pertinent and non pertinent sentences from an annotated dataset.
Partners
- Mouhamadou Ba - Migale bioinformatics facility - BioInfomics - INRAE
- Mathieu Charles - GABI - INRAE
Deliverables
Deliverables agreed at the preliminary meeting (Table 1).
Definition | |
---|---|
1 | HTML report |
2 | Classified file |
Data management
All data is managed by the migale facility for the duration of the project. Once the project is over, the Migale facility does not keep your data. We will provide you with the raw data and associated metadata that will be deposited on public repositories before the results are used. We can guide you in the submission process. We will then decide which files to keep, knowing that this report will also be provided to you and that the analyses can be replayed if needed.
Raw data
The raw data accessible from here are composed of 9600 sentences. 2237 sentences are manually annotated. For each sentence, it is set if it contains or not a fonctionnal link.
- 73 sentences are annotated as pertinents (contain fonctional links)
- 2164 sentences are annotated as non pertinents (do not contain fonctional links)
- 7363 sentences are not annotated
Dataset
Our dateset is small and unbalanced with only 2237 sentences composed of 73 positives exemples and 2164 negatives examples. We also consider the unannotated dataset composed of 7363 sentences.
We will use the annotated sentences as the training dateset to train a classifier. Then, we use the classifier to automatically classify the unannotated sentences (7363 sentences) in order to manually check the quality of the automatic classification.
Classification
Run_20240119
We will train classifiers based on the training dataset composed of the 73 positive examples and 2164 negative examples. A cross-valudation with 5 folds provide the following scores where the average score is P: 0.631724, R: 0.446970, R1: 0.500415
Fold | Precision | Recall | F1 |
---|---|---|---|
0 | 0.777778 | 0.291667 | 0.424242 |
1 | 0.727273 | 0.363636 | 0.484848 |
2 | 0.428571 | 0.375000 | 0.400000 |
3 | 0.625000 | 0.454545 | 0.526316 |
4 | 0.600000 | 0.750000 | 0.666667 |
average | 0.631724 | 0.446970 | 0.500415 |
stdev | 0.134892 | 0.178975 | 0.105402 |
We classified the unannotated dateset in order to manually evaluate the predictions of the classier. The results are avalaible in the following file :
- Open the file containing the predicted results
- Go to sheet
prediction_classifier_v20240119
- Column
PREDICTED_CLASS
contains the classifier predictions (where 1 is set when the sentence is predicted as pertinent and 0 when it is predicted as non-pertinent)
Run 20240214
In this new run we train the classifier by filtering the long sentences (ignoring sentences containing more than 400 characters). We also tried with different parameters (adding mentions of the genes and metabolites, weighting the positive examples in the datasets) but it does not enhance the results.
Fold | Precision | Recall | F1 |
---|---|---|---|
0 | 0.800000 | 0.190476 | 0.307692 |
1 | 0.666667 | 0.307692 | 0.421053 |
2 | 0.500000 | 0.375000 | 0.428571 |
3 | 0.857143 | 0.333333 | 0.480000 |
4 | 0.400000 | 0.500000 | 0.444444 |
average | 0.644762 | 0.341300 | 0.416352 |
stdev | 0.194003 | 0.112096 | 0.064843 |