Publications des agents du Cirad

Cirad

Adapting BERT and AgriBERT for agroecology: A small-corpus pretraining approach

Mechhour O., Auzoux S., Jonquet C., Roche M.. 2025. In : Idrissi Najlae (ed.), Hair Abdellatif (ed.), Lazaar Mohamed (ed.), Saadi Youssef (ed.), Chakib Houda (ed.), Erritali Mohammed (ed.), El Kafhali Said (ed.). Artificial intelligence and green computing: Proceedings of the 2nd International Conference on Artificial Intelligence and Green Computing ICAIGC 2025. Cham : Springer, 15 p.. (Lecture Notes in Networks and Systems, 1589). International Conference on Artificial Intelligence and Green Computing (ICAIGC 2025). 2, 2025-05-14/2025-05-16, Beni Mellal (Maroc).

DOI: 10.18167/DVN1/M3W53D

Source variables, or observable properties, used to describe agroecological experiments are often heterogeneous, non-standardized, and multilingual, making them challenging to understand, explain, and utilize in cropping system modeling and multicriteria evaluations of agroecological system performance. A potential solution is data annotation via a controlled vocabulary, known as candidate variables, from the Agroecological Global Information System (AEGIS). However, matching source and candidate variables via their textual descriptions remains a challenging task in agroecology. Domain-general language models, such as BERT, often struggle with domain-specific tasks due to their general-purpose training data. In the literature, these models are adapted to specialized domains through further pretraining, pretraining from scratch, and/or fine-tuning on downstream tasks. However, pretraining a domain-general model on a domain-specific corpus is resource-intensive, requiring substantial time, energy, and computational resources. To the best of our knowledge, no study has further pretrained a domain-general model on a small corpus (less than 100 MB) to adapt it to a domain-specific task and evaluated it on downstream tasks without fine-tuning. To address these shortcomings, this paper proposes further pretraining BERT and AgriBERT on a small agroecology-related corpus. This approach is designed to be both time- and resource-efficient while enhancing domain adaptation. We evaluate the pretrained models on the task of matching source and candidate variable descriptions without fine-tuning. Our results show that our further pretrained AgriBERT (+ Experts + Core) model outperforms all others by more than 8% from P@1 to P@10. These findings showed that small-scale pretraining can significantly improve performance on domain-specific tasks without requiring fine-tuning.

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :