IID-2 Natural Language Processing

Natural Language Processing (NLP) designs algorithms for accessing knowledge conveyed by unstructured, heterogeneous text (e.g. text mining, machine reading) and for language-mediated human-machine communication (e.g. chatbots). Building upon the very active community in this field, we aim at creating NLP algorithms that can process and produce language in its context, in all its modalities (spoken, written, and signed), take advantage of both linguistic and domain knowledge, draw inferences when extracting knowledge from language input, and explain themselves, in a robust, scalable way. NLP combines many challenges from data, knowledge, machine learning, and human interaction. The main difficulty is that NLP has to estimate a large number of parameters with limited prior knowledge and training data for many different languages and specialized domains. Such a situation results in strong data dependencies that impact experiment replicability.

Description of the theme

Natural Language Processing (NLP) designs algorithms for accessing knowledge conveyed by unstructured, heterogeneous text (e.g. text mining, machine reading) and for language-mediated human-machine communication (e.g. chatbots). Building upon the very active community in this field, we aim at creating NLP algorithms that can process and produce language in its context, in all its modalities (spoken, written, and signed), take advantage of both linguistic and domain knowledge, draw inferences when extracting knowledge from language input, and explain themselves, in a robust, scalable way. NLP combines many challenges from data, knowledge, machine learning, and human interaction. The main difficulty is that NLP has to estimate a large number of parameters with limited prior knowledge and training data for many different languages and specialized domains. Such a situation results in strong data dependencies that impact experiment replicability.

In 2022 the « hot topics » in Natural Language Processing are all linked somehow to the neural approaches , not only for machine learning acros
all aspects of language modeling and processing but also for for finding how language is handled in the human brain.

Language in the human brain

« Model-based analysis of brain activity reveals the hierarchy of language in 305 subjects », Charlotte Caucheteux, Alexandre Gramfort, Jean-Rémi King, EMNLP 2022,
https://aclanthology.org/2021.findings-emnlp.308.pdf

– apprentissage par transfert entre tâches (multilingualité, multimodalité), apprentissage de modèles de langue de taille moyenne, plus facilement manipulables et déployables pour des tâches et des domaines spécifiques, « Re-train or Train from Scratch? Comparing Pre-training Strategies of
BERT in the Medical Domain », Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Pierre Zweigenbaum

Biais dans les modèles

« French CrowS-Pairs: Extension à une langue autre que l’anglais d’un corpus de mesure des biais sociétaux dans les modèles de langue masqués », Aurélie Névéol, Yoann Dupont, Julien Bezançon, Karën Fort,
https://hal.inria.fr/hal-03680574/document

Dialogue as universal task for NLP and Large sized language resources for French.

« Benchmarking Transformers-based models on French Spoken Language Understanding tasks », Oralie Cattan, Sahar Ghannay, Christophe Servan, Sophie Rosset, INTERSPEECH 2022.
https://hal.archives-ouvertes.fr/hal-03715340v2/document ;
« État de l’art des technologies linguistiques pour la langue française », Gilles Adda, Annelies Braffort, Ioana Vasilescu, François Yvon adn
Jean-François Nominé, « European Language Equality »,
https://hal.archives-ouvertes.fr/hal-03637784
« Livraison du plus grand modèle de langue multilingue « open science » jamais entraîné », François Yvon, Pierre François Lavallé, Véronique
Etienne, 12/07/2022,
https://www.cnrs.fr/fr/livraison-du-plus-grand-modele-de-langue-multilingue-open-science-jamais-entraine


– optimize training by reducing training set with paradigms like priming or prompting « Decorate the Examples: A Simple Method of Prompt Design for Biomedical Relation Extraction », Hui-Syuan Yeh, Thomas Lavergne, Pierre Zweigenbaum, https://aclanthology.org/2022.lrec-1.403.pdf

– machine learning that respects private life, participation of the team ILES (LISN) to the european project MAPA (2020-2021), développement
d’une application d’anonymisation de corpus pour 24 langues européennes (https://mapa-project.eu/)

– Multimodalité (image, vidéo, audio, texte) ; (mouvements oro-faciaux, gestes co-verbaux, etc.), « Multi-Track Bottom-Up Synthesis from
Non-Flattened AZee Scores », Paritosh Sharma, Michael Filhol, LREC2022,
http://lrec-conf.org/proceedings/lrec2022/workshops/sltat/pdf/2022.sltat-1.16.pdf