Explainable epidemiological thematic features for event based disease surveillance
Menya E., Interdonato R., Owuor D., Roche M.. 2024. Expert Systems with Applications, 250 : 21 p..
DOI: 10.18167/DVN1/WD1UC2
Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2 score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web where our approach records 92.33 precision score, 94.62 recall score and 93.46 score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79 score points as recorded by EpidBioELECTRA's performance. We also compute Almost Stochastic Order (ASO) scores to track E
Mots-clés : fouille de textes; épidémiologie; surveillance épidémiologique; santé animale; analyse de données; maladie infectieuse; fouille de données; santé publique
Documents associés
Article (a-revue à facteur d'impact)
Agents Cirad, auteurs de cette publication :
- Interdonato Roberto — Es / UMR TETIS
- Menya Edmond — Es / UMR TETIS
- Roche Mathieu — Es / UMR TETIS