Publications des agents du Cirad


Non-standard texts: from theoretical positions to Natural Language Processing normalisation

Lopez C., Roche M., Panckhurst R.. 2016. Louvain : Université Catholique de Louvain, 1 p.. PLIN 2016: Language and the new (instant) media, 2016-05-12, Louvain-la-Neuve (Belgique).

A finalised digital resource of 88,000 anonymised French text messages, the 88milSMS corpus, two extracts (1,000 SMS transcoded into standardised French and 100 linguistically annotated SMS) and sociolinguistic questionnaire data were released in June 2014 for all to download via a user free-of-charge licence agreement, from the Huma-Num web service (, Panckhurst et al., 2014). The sud4science project (, Panckhurst et al. 2013), enabling authentic text message collection from the general public by a group of academics, is part of a vast international initiative (, Fairon et al. 2006, Cougnon and Fairon, 2014, Cougnon 2015), to build a worldwide database and analyse authentic text messages in different languages. We decided to exclude full transcoding and annotation tagging in the final corpus. This is a theoretical position, since annotation is far from neutral, and is invariably linked to an interpretative framework. Owing to varying theoretical disciplinary and scientific stances, it seems that a true consensus on how to standardise the transcoding and linguistic annotation tagging does not exist (Panckhurst, 2015). Other researchers may disagree and prefer to provide both 'raw' and fully tagged corpora (Chanier et al. 2014). This theoretical position does not exclude exploring Natural Language Processing (NLP) investigation techniques, which could then be implemented in real-life applications. Examples of investigation techniques are indicated as follows: 1) Our corpus can be used to analyse current mediated electronic discourse, and help build knowledge on different SMS writing forms (Roche et al. 2015). 2) Algorithms may be used to learn from this: alignment methods for facilitating automatic transcoding have been explored (Aw et al. 2006, Beaufort et al., 2008, Guimier de Neef and Fessard, 2007, Kobus et al, 2008, Lopez et al, 2014). 3) We have devised a method for classifying 'unknown' items within text messages, which may help to automatically identify lexical 'creativity' within 88milSMS and improve electronic dictionary approaches (Lopez et al. 2015). In order to refine automatic normalisation techniques for initially non-standard texts in French, the next logical step is to compare our resource with different types of instant media (i.e. SMS, forums, tweets). Firstly, a new typology of the detected 'mistakes', based on existing typologies, will be elaborated. Secondly, automatic normalisation techniques ¿ focusing on the most frequent errors ¿ will be proposed. These will then be confronted with traditional automatic translation (Vilariño et al., 2012), speech recognition (Kobus et al., 2008) and spelling/grammatical checker principles (Beaufort et al., 2010). Finally, the approach should enable comparison between different types of instant media. (Résumé d'auteur)

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :