Thematic School 2021 : Graph as models in life sciences: Machine learning and integrative approaches

« Graph as models in life sciences: Machine learning and integrative approaches » is a Fall School (mid-October 2021) supported by the Labex Digicosme on bioinformatics and statistical/machine learning, with graph as a central theme.

This thematic school is interdisciplinary and dedicated to the application of state-of-the-art machine learning methodologies to ongoing challenges in bioinformatic research. Recent developments of statistical methods integrating a priori knowledge of a domain have already demonstrated their efficiency in major subfields of bioinformatics (such as graph neural networks for protein analysis, variational autoencoders for transcriptomic data, …).

The objective of the school is to strengthen the links and exchanges between laboratories and researchers in this interdisciplinary community, but also to expose PhD students or young researchers working in one of the two disciplines (bioinformatics or machine learning) to the second. Having this opportunity early in a career is essential to bolster interdisciplinary research.

In particular, the school wishes to contribute to the dissemination of good practices in the use of machine learning as a modelling tool in biology. Speakers will detail methodologies and biological outcomes as well as risks, biases and their active mitigation in the context of predictive bioinformatics. Data privacy and leakage, for example in the medical field, are also part of the challenges faced by researchers, companies and civil societies. They raise technical questions specific to the data being handled.

Hence, we believe it is stimulating to address these methodologies and issues in a contextualized way, while providing sufficient theoretical ground to allow participants to transfer the acquired knowledge to other biological problems.

Access to the videos of the event

Date

25th – 29th of October

Place

Online

Registration

Registration is free but mandatory*

* Limited number of places for the tutorial

Registration has been closed

Program* (Paris time)

Day 1 – Monday 25th October	8 h 45 – 9 h 00	Introduction	Flora Jay / Yann Ponty
	9 h 00 -12 h 45	Lecture 1	Laurent Jacob
		Lunch
	14 h 00-17 h 30	Lecture 2	Simona Cocco

Day 2 – Tuesday 26th October	9 h 00 -12 h 30	Tutorial topic 1, 2	Laurent Jacob (1) / Simona Cocco (2)
		Lunch
	14 h 00 -17 h 30	Lecture 3	Sergei Grudinin

Day 3 – Wednesday 27th October	9 h 00 -12 h 30	Lecture 4	Chloé-Agathe Azencott
		Lunch
	14 h 00 -17 h 30	Tutorial topic 3, 4	Sergei Grudinin (3) / Chloé-Agathe Azencott (4)

Day 4 – Thursday 28th October	9 h 00 – 12 h 30	Lecture 5	Andrei Zinovyev
		Lunch
	14 h 00 – 17 h 30	Participants’ talks/poster	Everyone

Day 5 – Friday 29th October	9 h 30 – 12 h 45	Lecture 6	Jean Louis Raisaro
		Lunch
	14 h 00 – 17 h 30	Tutorial topic 5, 6	Andrei Zinovyev (5) / Jean Louis Raisaro (6)

*May change for exact times

Speakers

**Chloé-Agathe Azencott**, assoc. professor at CBIO (MINES ParisTech, I Curie, INSERM), Springboard Chair PRAIRIE **Web**

**Jean Louis Raisaro**, PhD, Data Science Team Lead, ICT Department, Lausanne University Hospital

Themes

Lecture/Tutorial 1 –Laurent Jacob: Learning with biological sequences, from neural networks to De Bruijn graphs
The tutorial will introduce elementary tools to make prediction from unaligned biological sequences and to perform genome wide association studies over such sequences.
Keywords: convolutional neural networks, sequence motifs, k-mers, compacted de Bruijn graphs, bacterial GWAS.

Lecture/Tutorial 2 – Simona Cocco, Andrea Di Gioacchino: Direct Coupling Analysis and Restricted Boltzman machines to infer generative models from RNA and proteins from sequence data
Lecture: Introduction to inference methods to extract information on structure and functions of RNA and proteins from sequence data. We will introduce two simple network architectures to infer generative models from sequence data: a direct interaction graph between the input variables (eg. the sequence), used in the so called Direct Coupling analysis, and a bipartite graph on two layers of variables called Restricted Boltzmann Machine. For both models we will introduce algorithms to efficiently infer their parameters from the sequence data.
Applications to predictions of structure and the effects of mutations on the fitness of the protein/RNA as well as to protein/RNA design will be described.
Tutorial: Application to DCA and RBM to sequence data.

Lecture/Tutorial 3 – Sergei Grudinin, Ilia Igashov, Margot Selosse: Geometric deep learning in structural bioinformatics
Lecture : The potential of deep learning has been recognized in structural bioinformatics for already some time and became indisputable after the CASP13 (Critical Assessment of Structure Prediction) community-wide experiment in 2018. In CASP14, held in 2020, deep learning has boosted the field to unexpected levels reaching near-experimental accuracy. Its results demonstrate dramatic improvement in computing the three-dimensional structure of proteins from the amino acid sequence, with many models rivaling experimental structures. This success comes from advances transferred from several machine-learning areas, including computer vision and natural language processing. Novel emerging approaches include, among others, geometric learning, i.e., learning on non-regular representations such as graphs, 3D Voronoi tessellations, and point clouds; equivariant architectures preserving the symmetry of 3D space; and truly end-to-end architectures, i.e., single differentiable models starting from a sequence and returning a 3D structure. This lecture will present this recent progress and also a range of related structural biology applications.
Tutorial : This tutorial will start with the introduction to the PyTorch Geometric library. It will then present a basic description of graph-learning architectures, including convolution and attention operations. The first examples will include binary classification of 3D protein structures. After, we will apply the presented architectures to the regression task for the properties prediction of small molecules in the QM9 dataset. In the end, we will introduce more advanced architectures, specifically constructed to be rotation and translation equivariant, for the property predictions of 3D molecular graphs.
Keywords : geometric deep learning; graph convolutional networks; graph attention networks; rotation-equivariant architectures; 3D molecular graphs

Lecture/Tutorial 4 – Chloé-Agathe Azencott: Boosting genome-wide association studies (GWAS) with biological networks
The lecture will be based on the following publication: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008819
Tutorial: GWAS and biological networks. The tutorial will be the opportunity for participants to dig deeper in both the motivation for incorporating biological networks into GWAS and several of the existing models for this purpose. The tutorial will be organized as a discussion around several published models. Participants are welcome to join with their own questions for discussion. This is not a hands-on session. Keyword: networks, gwas, graph regularization, diffusion on graphs, omnigenic model

Lecture/Tutorial 5 – Andrei Zinovyev: Structured learning for single-cell differentiation trajectories
Thanks to the emergence of single cell assays, it is now possible to measure gene expression and other genome-scale molecular profiles levels across thousands to millions of single cells (scRNA-Seq, scATAC-seq, single cell proteomic data). Using these data, it is possible to look for paths in the data that may be associated with the level of cellular commitment w.r.t. a specific biological process and use the position (called pseudotime) of cells along these paths (called trajectories) to explore how gene expression reflects changes in cell states as the cells progressively commit to a given fate. This kind of analysis is a powerful tool that has been used, e.g., to explore the biological changes associated with development, cellular differentiation and cancer biology. Several graph-based approaches have been suggested to extract cellular trajectories from single cell data, including minimal spanning trees and principal graphs. Principal graphs approximate the multivariate data by a graph injected into the data space with some constraints imposed on the node mapping. In the lecture and tutorial, we will explore the basic concepts and tools for application of graph-based approaches to single cell scRNA-Seq datasets.

Lecture/Tutorial 6 – Jean Louis Raisaro, Jules Fasquelle: Privacy-Preserving Federated Analytics for Personalized Medicine
Using real-world evidence in biomedical research, an indispensable complement to clinical trials, requires access to large quantities of patient data that are typically held separately by multiple healthcare institutions. Centralizing those data for a study is often infeasible due to privacy and security concerns. In this lecture, we will start by making an overview of emerging privacy-preserving techniques such as federated analytics, homomorphic encryption, secure multi-party computation and differential privacy. We will then introduce FAMHE, a novel federated analytics approach (https://www.biorxiv.org/content/10.1101/2021.02.24.432489v2) that, based on multiparty homomorphic encryption (MHE) and federated learning (FL), enables privacy-preserving analyses of distributed medical data across a group of institutions, without sharing patient-level data and by ensuring strong privacy guarantees. Finally, we will demonstrate the applicability of FAMHE to essential biomedical analysis tasks, including Kaplan-Meier survival analysis in oncology and genome-wide association studies in medical genetics. Using the FAMHE approach, we accurately and efficiently reproduced two published centralized studies in a federated setting by yielding highly accurate results that were not possible from individual institutions alone.
Tutorial: Practical Introduction to Federated Learning. In this tutorial we will introduce the basic concepts underlying Federated Learning and focus on gaining hands-on experience. We will implement a simple FL system, showcase it on a toy dataset and give a quick introduction to some useful libraries and tools needed to train models on decentralized, sensitive data without the need to directly accessing them.

Registration is still opened for the lectures, tutorials are full.
Tutorial allocations:

Tutorials – Final allocations Télécharger

Organizers

Flora Jay, CR CNRS, LISN
Yann Ponty, DR CNRS, LIX
Ariane Migault, Chargée de communication du Labex DigiCosme

Computing resources
Thanks to Marco Leoni and Nicolas Thiery who are helping to set up access and tutorial installation on UPSaclay jupyterhub, informatique-scientifique@UPSaclay & collaboration with IJCLAB & Cloud@VirtualData

Une partie des ateliers ont été réalisés sur le service JupyterHub@Paris-Saclay hébergé par le mésocentre DataCenter@UPSud
et géré par informatique-scientifique@UPSaclay en collaboration avec IJCLAB & Cloud@VirtualData