Publications des agents du Cirad

Cirad

Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification

Menya E., Roche M., Interdonato R., Owuor D.. 2022. In : Calzolari N. (ed.), Béchet F. (ed.), Blache P. (ed.), Choukri K. (ed.), Cieri C. (ed.), Declerck T. (ed.), Goggi S. (ed.), Isahara H. (ed.), Maegaard B. (ed.), Mariani j. (ed.), Mazo H. (ed.), Odijk J. (ed.), Piperidis .(ed.). Proceedings of the 13th Language Resources and Evaluation Conference. Marseille : European Language Resources Association, p. 3741-3750. Language Resources and Evaluation Conference (LREC 2022). 13, 2022-06-20/2022-06-25, Marseille (France).

DOI: 10.18167/DVN1/MSLEFC

We present EpidBioBERT, a biosurveillance epidemiological document tagger for disease surveillance over PADI-Web system. Our model is trained on PADI-Web corpus which contains news articles on Animal Diseases Outbreak extracted from the web. We train a classifier to discriminate between relevant and irrelevant documents based on their epidemiological thematic feature content in preparation for further epidemiology information extraction. Our approach proposes a new way to perform epidemiological document classification by enriching epidemiological thematic features namely disease, host, location and date, which are used as inputs to our epidemiological document classifier. We adopt a pre-trained biomedical language model with a novel fine tuning approach that enriches these epidemiological thematic features. We find these thematic features rich enough to improve epidemiological document classification over a smaller data set than initially used in PADI-Web classifier. This improves the classifiers ability to avoid false positive alerts on disease surveillance systems. To further understand information encoded in EpidBioBERT, we experiment the impact of each epidemiology thematic feature on the classifier under ablation studies. We compare our biomedical pre-trained approach with a general language model based model finding that thematic feature embeddings pre-trained on general English documents are not rich enough for epidemiology classification task. Our model achieves an F1-score of 95.5% over an unseen test set, with an improvement of +5.5 points on F1-Score on the PADI-Web classifier with nearly half the training data set.

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :