Publications des agents du Cirad

Cirad

Weighted PLSR-DA with unbalanced data – Study case of blood BHB predicted from milk MIR spectra for ketosis detection in dairy herds

Lesnoff M.. 2025. In : Bastianelli Denis (ed.), Gilles Chaix (ed.). Résumés des communications présentées aux 26èmes rencontres HélioSPIR, Montpellier (France), 24-25 juin 2025. Montpellier : Association HélioSPIR, 3 p.. Rencontres HélioSPIR 2025. 26, 2025-06-24/2025-06-25, Montpellier (France).

PLSR-DA is very popular within the “PLSDA” methods. It considers a discrete variable y (class membership) to be predicted from variables X. A dummy table (Ydummy) is built from y, consisting in one binary column for each class (0: the observation belongs to the class, 1: the observation does not belong to the class) and PLS2 scores (T) are computed from the data {X, Ydummy}. Then, an MLR is run on the data {T, Ydummy}, giving predictions Ydummy. For a given observation, the final predicted class corresponds to the Ydummy column showing the maximal value. PLSR-DA has the advantage to be simple and fast even on large datasets (actually this is a PLSR2 algorithm). Nevertheless, a drawback is that it can lose performance when the number of classes is large (e.g. > 5 classes). This is due to possible masking effects between classes in the PLS2 score space (Hastie et al. 2009). Another drawback is its high sensitivity to unbalanced classes, generating biased predictions. For instance, when y contains a predominant class, the method will favor systematically the prediction of this class. This presentation tackles this last point. A common approach to manage unbalanced classes is to subsample the training data {X, y} to equalize the class importance. For instance, if a two-class training dataset (classes A and B) consists in nA = 1000 obs. and nB = 200 obs. (n = nA + nB), the approach will be to sample mA = 200 obs. of class A within the 1000 observations available, and to calibrate the model on the mA + nB = 400 obs. This approach can be too aggressive for the training set, especially in presence of rare classes (e.g. for classes A and B above, if nB = 10 obs., the training set would decrease from 1010 obs. to 20 observations!). More generally, the general limitation is that the model does not use all the available information for the model calibration.

Documents associés

Communication de congrès

Agents Cirad, auteurs de cette publication :