IID-3 Machine learning

Antoine Cornuéjols (Link, Agro Paris)
Emmanuel Vazquez (L2S, CentraleSupélec)

Recent advances in machine learning have made it the core of modern data and knowledge management. Current challenges concern the development of richer, more expressive models—aka deep learning—that leverage the complexity of large-scale data. Ensuring an efficient learning on the available computational architectures and a seamless composition of such models remain open questions. Online learning is key to machine learning upscaling, and when data are structured as streams. In data-rich settings, abstractions can be learned from data, leading to end-to-end learning frameworks, where learning procedures handle the full data processing with little-to-no problem-specific tuning. Real-life learning systems are operated in open-loop mode, a setting also called continuous or lifelong learning. The counterpart of enhanced expressivity is the instability of learning systems, that are subject to catastrophic failures, and sensitive to biases; ensuring robustness in learning is thus a major endeavour. More generally, the characterization of the performance of such systems is a fundamental question, that Digicosme will address with concepts borrowed from information theory in collaboration with Comex and Scilex. Beyond reliability, learning systems should provide interpretable and explainable models, enabling users to understand the reasons of an outcome: a pending challenge here is to analyse the causal structure inherent to complex data. The solutions to these problems are expected at the cross-road of machine learning and symbolic artificial intelligence. The domain of computer vision as well as its corollary in machine listening, although already actively studied in DigiCosme, will provide a favorable ground to combine high level, structural and symbolic representations and reasoning with machine learning and uncertainty modeling. E-health, life sciences and more generally e-science will also be targeted as fields where interpretability matters equally as performance.

Data also come with limitations: missing data, heterogeneity and more generally, dirty data, are not handled by most standard procedures; data imbalance also compromises many real-life learning schemes. In the weak data limit, generalization to unseen categories (aka zero-shot learning) is actually needed. In such cases, transfer learning and semi-supervised learning bring essential tools as well as the attempt to learn universal representations in this context. Finally, the reliance on existing knowledge bases is a means to deal with data shortage.