Publications des agents du Cirad

Cirad

EpidGPT: A combined strategy to discriminate between redundant and new information for epidemiological surveillance systems

Menya E., Roche M., Interdonato R., Owuor D.. 2024. In : Rapp Amon (ed.), Di Caro Luigi (ed.), Meziane Farid (ed.), Sugumaran Vijayan (ed.). Natural language processing and information systems: 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Turin, Italy, June 25–27, 2024, Proceedings, Part I. Cham : Springer, p. 439-454. (Lecture Notes in Computer Science, 14762). Natural Language Processing and Information Systems (NLDB 2024), 2024-06-25/2024-06-27, Turin (Italie).

DOI: 10.18167/DVN1/WD1UC2

DOI: 10.1007/978-3-031-70239-6_30

Textual documents such as online news articles have become a key source in epidemiological surveillance such as being used in the detection of new and re-emerging diseases. However, such sources suffer redundancies with the need to automate the process of identifying novel information. In this paper, we propose a framework for learning novel thematic information in epidemiological news documents. Our approach involves both extraction and classification of new, duplicate, additional and/or missing pieces of relevant information in epidemiological news documents. Firstly, we propose an initial step to solve the limited data problem where fewer gold labelled datasets exists for training text-based epidemiological surveillance systems. This initial step is built using extractive question answering technique whereby we automate the process of extracting relevant thematic features inclusive of disease and host names, location and date of reported events and reported number of cases in order to create a large silver labelled dataset. We then propose a main step where we build a novelty information classification model that is trained using our large silver labeled dataset. We then test our novelty classifier model alongside competitive ones on the challenge of detecting whether there is novel, redundant and/or missing information in a target epidemiological news article. We later carry out ablation studies on the most informative document segments in epidemiological news articles.

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :