2017 | HiDimStat - Labex DigiCosme

Mots clés : Statistical inference, estimation, high-dimension,sparsity, ensembling, clustering automatique.~~::

Début du financement: 01/10/2017
Axe DigiCosme & tâche:DataSense
Sujet :
The HiDimStat project aims at handling uncertainty in the challenging context of high dimensional regression problem. Though sparse models have been popularized in the last twenty years in contexts where many features can explain a phenomenon, it remains a burning issue to attribute confidence to the predictive models that they produce. Such a question is hard both from the statistical modeling point of view, and from a computation perspective. Indeed, in practical settings, the amount of features at stake (possibly up to several millions in high resolution brain imaging) limit the application of current methods and require new algorithms to achieve computational efficiency. We plan to leverage recent developments in sparse convex solvers as well as more efficient reformulations of testing and confidence interval estimates to provide several communities with practical software handling uncertainty quantification. Specific validation experiments will be performed in the field of brain imaging.
Directeurs de thèse : Bertrand Thirion, INRIA, Joseph Salmon, LTCI et Yohann de Castro, INRIA
Institutions : Inria Saclay, LTCI
Laboratoire gestionnaire : Inria Saclay
Doctorant : Jérome Alexis CHEVALIER
Productions scientifiques :

Jérôme-Alexis Chevalier, Joseph Salmon, Bertrand Thirion. Statistical Inference with Ensemble of Clustered Desparsified Lasso. Accepted at MICCAI 2018. https://hal.inria.fr/hal-01815255v1
Jérôme-Alexis Chevalier, Joseph Salmon, Bertrand Thirion. Statistical Inference on Structured Data. SFdS 2018.

Ressources :Scikit-Learn

Contexte:
In many scientific fields, data-acquisition devices have benefited of hardware improvement to increase the resolution of the observed phenomena, leading to ever larger datasets. While the dimensionality of the resulting measures has increased, the number of samples available is often limited, due to time, physical or financial constraints. Their signal-to-noise ratio is also often intrinsically limited by the physics of the measurement process, so that the resulting datasets display weak contrasts and require advanced statistical modeling. A key issue is then to perform reliable inference on these data: this difficulty has been acknowledged as a major roadblock in brain imaging and genomics.

Multivariate Statistical Models are used to explain some responses of interest through a combination of measurements. Finding such an explanation requires to fit an estimator, whose accuracy is assessed through a pre-defined score of merit, such as prediction accuracy. For instance, one might predict the likelihood for individuals to declare a certain type of disease based on genotyping information: such an analysis reveals i) to which extent the outcome variable is indeed predicted by the measurements and ii) which measurements carry useful information for the prediction. Such multivariate estimators are viewed as powerful tools because they leverage the distribution of information across measurements. Yet, they suffer from the curse of dimensionality: the number p of observed features is much larger than the number n of observations (p >> n), calling for some regularization to make the ensuing estimation problem well-posed. Thus, the estimators investigated reflect a trade-off between data-fitting and a priori regularization scheme, and need to be carefully calibrated.

Multivariate statistical inference Let us denote the prediction target –a vector of size n– by y, and the signal – a set of n p-dimensional observations– by X. The estimation problem consists in estimating a p-dimensional weight vector w in order to fit y given X. The regularizing prior on w can be written R(w); assuming a regression setting, the model fit is written: min |y − Xw| ** 2 + λR(w) ,w can be understood as a discriminative pattern, i.e. a map on the features that measures their importance in the fit of y, and λ > 0 is a parameter controlling the aforementioned trade-off. The regularization R(w) is often a convex function for computational reason mostly. Moreover, it yields sparsity of the solution when considering the particular choice R(w) = |w|_1 , called Lasso, but more complex structures can be enforced on the structure of the targeted patterns on w. Popular instances of this model in imaging are the well-studied total variation penalty, enforcing piece-wise constant regularity, or R(w) = |w|_{2,1} to enforce group/block structures. Statistical inference on multivariate models An important question is then whether w can be characterized statistically, and not only qualitatively, e.g. whether one can make statements such as P (w_i > ŵ_i |i is not associated) < α (i.e., assessing the influence of feature i on the phenomenon investigated), where α is a prescribed threshold; a second type of statement is P (|w_G| > |ŵ_G| | variables in G are not associated) < α, G being a given group of features, i.e. rejecting the null hypothesis for a group of features. The rejection of the null hypothesis leads to the conclusion that the corresponding feature or group is a necessary component of the multivariate model. While testing is well-posed when the analysis deals with the different variables in isolation, it becomes challenging when many variables are included in X and no proper solution is available when p becomes very large (typically of the order 10 ** 6 in our brain imaging applications).

Computational costs associated with multivariate models In this regime, where p is very large, and n grows to thousands or tens of thousands, the computational cost of the fit becomes a major practical issue, especially given that state-of-the art hyper-parameter selection procedures typically require the use of cross-validation, which entails the cost of many fits. In particular, non-parametric procedures for significance control, that are often taken as a gold standard require repeating the estimation on permuted data several thousands of times. The computational cost of the estimators is thus a major hurdle at the moment, and needs further improvements in view of the applications.

Statistical inference on datasets with a huge number of features is an open problem, yet it is needed in many fields of data science, where the models need rigorous statistical assessment. We aim at developing theoretically grounded and practical estimation procedures that render statistical inference feasible in such hard cases.

Objectif scientifique:
The project will rely on two main building blocks to develop novel strategies for inference:

First, high-dimensional data often display some statistical structure (short- or long-range correlations) that can be leveraged to make statistical tests consistent in high dimension, but also to reduce the amount of computation: for instance, data-driven feature clustering has been shown to improve accuracy, sensitivity as well as computational efficiency in statistical tests.
The second idea is that of sparsity: in recent works, statistical tests have increasingly been addressed by analyzing the solution of sparse estimators. While the ensuing solutions are very elegant, they do not generalize well to n << p settings. Crucially, sparsity alone leads to poor estimators, unless the solution is indeed very sparse, and in most cases, to unstable solutions: it has to be combined with some kind of randomization to yield useful estimators.

Sparse methods providing confidence bounds on the estimated coefficients often require the computation of the precision matrix, that is the inverse of the feature correlation matrix. This is the case in particular for variants of the de-sparified Lasso. Evaluating such estimators is still a challenge, and so far current solvers perform brute force evaluation of p sparse estimators, one for each columns of the matrix, see also for empirical comparisons. Relying on sparse precision matrix estimation can lead to improving tests / confidence bounds, but requires new algorithmic developments. In particular, efficient solvers must be proposed to leverage the sparsity of the targeted precision matrix. This can be done by revisiting the optimization problems to solve and by adapting recent speed-ups band for Lasso and multi-task Lasso to this context.

On a broader scale, an effective strategy is to combine both sparse randomized estimators and enforce (hierarchical) structures on the features, typically by means of clustering algorithms. Indeed, recent contributions rely on hierarchical clustering and propose sequential tests to characterize the involvement of variables at different resolutions. From a practical point of view, efficient hierarchical clustering algorithms would be required to evaluate such approaches with the data constraints faced in neuro-imaging.

Perspectives:
We will thus develop methods that combine the above intuitions and algorithmic optimization for tractability, and study the statistical guarantees that these approaches bring. Intensive validation experiments will be carried out, in particular in the field of brain imaging. An implementation of the inference tool will be produced and released as open-source library compatible with existing machine learning libraries.

HiDimStat is a potentially high-impact project, as it addresses deep and hard issues in statistical analysis of high-dimensional data, a topic of central interest for biological sciences, economic or social sciences. While we will clearly not cover all the facets of the problem, the project aims at designing high quality software that can provide reliable and efficient practical solutions for practitioners. The field is evolving fast, yet the strength of Digicosme in terms of algorithmic and software development for data sciences is already important and will benefit from HiDimStat .

Some aspects of the project will also be an opportunity to collaborate with the maths and Life science departments, which we think is important for the long-term perspective within Digicosme and STIC department.