
Journée de la statistique ENSAE-ENSAI-INSEE
Journée de la statistique
ENSAE-ENSAI-INSEE
Lieu : Salle Closon-Malinvaud (RC-B-224/228), INSEE
88, avenue Verdier
92120 Montrouge
Programme
Mercredi 21/05
13h30 : arrivée
14h : mot d’accueil de Corinne Prost
14h15 – 15h45 : Olivier Meslin (INSEE), Vincent Divol (ENSAE), Mohammedreza Mousavi-Kalan (ENSAI)
15h45 – 16h45 : Session posters
16h45 – 18h15 : Ludovic Stephan (ENSAI), Khaled Larbi (INSEE), Cristina Butucea (ENSAE)
19h : dîner
Jeudi 22/05
9h15 : arrivée
9h30 – 11h : Arnak Dalalyan (ENSAE), Vincent Loonis (INSEE), Sébastien Herbreteau (ENSAI)
11h – 11h20 : pause-café
11h20 – 12h20 : Anna Simoni (ENSAE), Guillaume Maillard (ENSAI)
12h20 – 13h30 : déjeuner
13h30 – 15h : Daniel Bonnery (INSEE), Marie-Pierre Etienne (ENSAI), Nicolas Chopin (ENSAE)
15h – 15h20 : pause-café
15h20 – 16h50 : Guillaume Chauvet (ENSAI), Félix Pasquier (ENSAE), Simon Quantin (INSEE)
Titres et résumés des exposés
Daniel Bonnery (INSEE)
Sondages et confidentialité différentielle
La confidentialité différentielle au seuil epsilon est un critère mathématique contrôlant l’écart entre les distributions d’un estimateur auquel a été ajouté un bruit aléatoire pour différentes valeur possibles des caractéristiques d’une unité d’observation, tandis que la confidentialité différentielle aux seuils epsilon-delta permet une faible probabilité de défaillance. Les deux critères sont largement adoptés comme standards de mesure de protection des données dans l’industrie. Les résultats obtenus pour la protection offerte par l’étude des données en termes de confidentialité différentielle epsilon et epsilon-delta seront présentés. Plus précisément, nous déduisons la relation entre le niveau de protection offert la sélection aléatoire selon un plan de sondage quelconque, ses probabilités d’inclusion simple et double, et la dispersion de la variable d’intérêt, lorsque la statistique publiée est l’estimation de Horvitz-Thompson d’un total. Nous fournissons un outil permettant de calculer la variance du mécanisme supplémentaire de protection de la vie privée à appliquer, si nécessaire, pour atteindre un niveau de protection souhaité.
Cristina Butucea (ENSAE)
Gentle measurements of quantum states
Gentle measurements of quantum states result in both a random variable and a non-collapsed post-measurement state which is at a prescribed trace-distance from the initial state. Unlike collapsed states, this can be further used in quantum computing. We introduce here locally gentle measurements and prove a quantum data processing inequality for such measurements. We introduce a physically feasible gentle measurement called quantum Label Switch and show optimal rates for learning and testing of qubits.
Guillaume Chauvet (ENSAI)
Inference from spatial samples in Forest Inventories
Surveying natural populations presents challenges due to their dispersed distribution across a territory. National forest inventories (NFIs) are based on probabilistic sampling designs. A sample of points is randomly selected in the continuous territory under study, and fixed-shape supports (e.g., plots or polygons) are built from these points to select and survey the trees; see for example Vidal et al. (2016) for a worldwide overview of forest inventories. To produce spatially balanced samples, i.e., well spread over the territory, these surveys generally perform a tessellation of the territory based on a spatial grid, and either use the nodes of a grid to constitute the sample, or draw points within the cells of the grid. Sampling the cells creates an additional stage, and is currently used in the French NFI for producing annual estimates.
Although the sampling design may be formalized in several manners (e.g., Eriksson, 1995), the infinite population approach (Stevens and Urqhart, 2000; Barabesi, 2003; Mandallaz, 2007; Gregoire and Valentine, 2007) is arguably the simplest device for inference. Inference may be performed directly from the sampled population, which is straightforward by using the theory of continuous Horvitz-Thompson (HT) estimation (Cordy, 1993), both in terms of point estimation and variance estimation.
In this talk, I present sampling and estimation methods used for the French NFI, and discuss past and current research. I explain how an extended version of the weight-share method (Deville and Lavallée, 2006) enables to link the sampled population and the surveyed population (Chauvet, Bouriaud et Brion, 2023; Bouriaud, Brion, Chauvet, Duong et Pulkkinen, 2024). I also present an inference framework proposed for the French NFI (Duong, Bouriaud and Chauvet, 2024). I will discuss works in progress to study the behavior of the HT estimator (weak convergence and asymptotic normality), either when the sample is selected directly from the territory, or by means of a multi-stage design as in the French NFI.
Nicolas Chopin (ENSAE)
Saddlepoint Monte Carlo, and its application to vote carryover in French elections
Assuming X is a random vector and A a non-invertible matrix, one sometimes need to perform inference while only having access to samples of Y = AX. The corresponding likelihood is typically intractable. One may still be able to perform exact Bayesian inference using a pseudo-marginal sampler, but this requires an unbiased estimator of the intractable likelihood.
We propose saddlepoint Monte Carlo, a method for obtaining an unbiased estimate of the density of Y with very low variance, for any model belonging to an exponential family. Our method relies on importance sampling of the characteristic function, with insights brought by the standard saddlepoint approximation scheme with exponential tilting. We show that saddlepoint Monte Carlo makes it possible to perform exact inference on particularly challenging problems and datasets. We focus on the ecological inference problem, where one observes only aggregates at a fine level. We present in particular a study of the carryover of votes between the two rounds of various French elections, using the finest available data (number of votes for each candidate in about 60,000 polling stations over most of the French territory).
We show that existing, popular approximate methods for ecological inference can lead to substantial bias, which saddlepoint Monte Carlo is immune from. We also present original results for the 2024 legislative elections on political centre-to-left and left-to-centre conversion rates when the far-right is present in the second round. Finally, we discuss other exciting applications for saddlepoint Monte Carlo, such as dealing with aggregate data in privacy or inverse problems.
Arnak Dalalyan (ENSAE)
Parallelized Midpoint Randomization for Langevin Monte Carlo
We study the problem of sampling from a target probability density function in frameworks where parallel evaluations of the log-density gradient are feasible. Focusing on smooth and strongly log-concave densities, we revisit the parallelized randomized midpoint method and investigate its properties using recently developed techniques for analyzing its sequential version. Through these techniques, we derive upper bounds on the Wasserstein distance between sampling and target densities. These bounds quantify the substantial runtime improvements achieved through parallel processing. Joint work with Yu Lu.
Vincent Divol (ENSAE)
Spectral estimation of Laplace operators
Diffusion-based methods are widely used to construct spectral representations of data for tasks such as dimension reduction (e.g., LLE, Laplacian Eigenmaps, diffusion maps), regression, and spectral clustering. These methods typically rely on the top eigenvectors of a graph Laplacian, derived from a random walk over the dataset. As the number of observations increases, the spectrum of this discrete operator converges to that of a continuous Laplace operator. While much prior work has focused on giving rates of convergence for this problem, we adopt a different perspective: we study the estimation of the spectrum from a minimax standpoint. We provide sharp minimax rates for estimating both eigenvalues and eigenfunctions of the limiting operator. Our approach also yields a general framework for estimating regular real-valued functionals on function spaces, with potential implications for broader statistical estimation problems. (Joint work with Yann Chaubet (Université de Nantes)
Marie-Pierre Etienne (ENSAI)
Improving spatio-temporal species distribution monitoring by leveraging multiple heteregeneous data sources
In the context of sustainable fisheries management, it is crucial to accurately characterize the spatio-temporal dynamics of exploited species. Monitoring surveys conducted by IFREMER, for instance, follow standardized protocols but provide data with limited spatial and temporal resolution. We propose a latent variable spatio-temporal model that enables the integration of information from additional data sources, which may suffer from preferential sampling or be available only in aggregated form. Inference is performed using numerical approximation techniques and automatic differentiation.
Khaled Larbi (INSEE)
Traitement de la non-réponse et combinaison d’échantillons dans le cadre d’un dispositif d’enquête multimode
Avec l’introduction d’internet comme nouveau mode de collecte et les difficultés croissantes à contacter les ménages, de plus en plus d’enquêtes des services statistiques publics évoluent vers des protocoles de collecte multimodes. Par ailleurs, l’existence d’un mode de collecte à faible coût – ici Internet – offre la possibilité de tirer, en sus de l’échantillon principal multimode, un échantillon de taille plus importante qui sera collecté en mobilisant uniquement ce mode de collecte peu onéreux. Cet échantillon de taille plus importante pourrait ainsi permettre d’effectuer des analyses plus précises, sur des domaines plus fins (exploitations régionales par exemple). Toutefois, le mécanisme de réponse sur cet échantillon monomode risque d’être différent de celui affectant l’échantillon multimode : en particulier, du fait de taux de réponse (nettement) plus faibles dans le protocole monomode Internet que dans l’approche multimode, le risque de non-réponse non ignorable est (nettement) plus important sur cet échantillon monomode. On présente ainsi plusieurs stratégies de combinaison d’échantillons et de correction de la non-réponse et compare leur efficacité à l’aide de simulations. Nous comparons les différentes méthodes dans différents cas : cas de non-réponse ignorable pour les deux échantillons, présence de sélection sur inobservable uniquement lors de la réponse par Internet, présence de sélection sur inobservable Internet et totale.
Sébastien Herbreteau (ENSAI)
A unified framework of nonlocal parametric methods for image denoising
We propose a unified view of nonlocal methods for single-image denoising, for which BM3D is the most popular representative, that operate by gathering noisy patches together according to their similarities in order to process them collaboratively. Our general estimation framework is based on the minimization of the quadratic risk, which is approximated in two steps, and adapts to photon and electronic noises. Relying on an unbiased risk estimate (URE) for the first step and on “internal adaptation,” a concept borrowed from deep learning theory, for the second, we show that our approach enables one to reinterpret and reconcile previous state-of-the-art nonlocal methods. Within this framework, we propose a novel denoiser called NL-Ridge that exploits linear combinations of patches. While conceptually simpler, we show that NL-Ridge can outperform well-established state-of-the-art single-image denoisers.
Vincent Loonis (INSEE)
Apport potentiel des processus déterminantaux aux procédures de sélection des échantillons d’enquêtes de l’Insee
Dans la continuité de la précédente présentation aux journées de la statistique ENSAE-ENSAI-INSEE, on évoquera dans cette intervention l’apport potentiel des processus déterminantaux aux procédures de sélection des échantillons d’enquêtes de l’Insee. On s’intéressera plus particulièrement à un cas concret apparu dans le cadre de la gestion d’un réseau d’enquêteurs en face-à-face et faisant intervenir des probabilités d’inclusion jointes d’ordre 2 et plus, pour lesquelles les processus déterminantaux sont particulièrement adaptés. On s’interrogera ensuite sur les opporunités futures liées à des travaux académiques récents concernant les processus déterminantaux dont le noyau est non-symétrique.
Guillaume Maillard (ENSAI)
Concentration properties of bootstrapped empirical process suprema and an application to two-sample hypothesis testing
Much recent work has been devoted to validating the boot- strap heuristic for empirical process suprema on a finite index set through distributional approximation. In another direction, inequalities com- paring the bootstrapped and original supremum in expectation were developed by Han and Wellner (2018). In contrast, the concentration properties of these variables have not been investigated, to the best of our knowledge. In this talk, we start filling this gap with new concen- tration inequalities for bootstrapped empirical process suprema. Like Bousquet’s inequality and unlike results obtained through distributional approximation, they are completely dimension-free. As an application, we provide new theoretical guarantees on the power and separation rate of two-sample MMD tests in a non-asymptotic setting.
Olivier Meslin (INSEE)
Machine Learning for Housing Wealth Estimation: Methodological Challenges and Potential Solutions
This project aims at predicting the market value of all privately-held dwellings in France for each year over the 2010-2024 period. A machine learning model is trained on a large sample of observed real estate transactions, and then is used to predict market values for all dwellings. This project tries to address three distinct methodological challenges. First, how to account accurately for the spatial and temporal structure of housing prices ? Second, how to correct training data selection (some dwellings are more likely to be sold than others)? Third, how to measure the uncertainty of the predictions? The first challenge is addressed by introducing time and geographical coordinates in a gradient boosting algorithm, along with other dwelling features. Sample selection is addressed by reweighting the training data, based on an inverted probability approach. Finally, split conformal prediction is applied to complement the model with valid prediction intervals.
Mohammadreza Mousavi-Kalan (ENSAI)
Transfer Learning in Outlier Detection
A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to similar but imperfect outlier data, from which they might transfer information to the target outlier detection task. While transfer learning has been extensively considered in traditional classification, the problem of transfer in outlier detection, and more generally in imbalanced classification settings, has received less attention. In this work, we adopt the traditional framework of Neyman-Pearson classification, which formalizes supervised outlier detection, and determine the information-theoretic limits of the problem under a measure of discrepancy. We also propose a transfer learning algorithm for outlier detection that is practical and data-adaptive: it leverages information from the source data when it is relevant and disregards it when it is not, without requiring prior knowledge of the relatedness. As a result, the algorithm is resilient to negative transfer.
Félix Pasquier (ENSAE)
Difference-in-Differences for Continuous Treatments and Instruments with Stayers
We propose difference-in-differences (DID) estimators in designs where the treatment is continuously distributed in every period, as is often the case when one studies the effects of taxes, tariffs, or prices. We assume that between consecutive periods, the treatment of some units, the switchers, changes, while the treatment of other units, the stayers, remains constant. We show that under a parallel-trends assumption, weighted averages of the slopes of switchers’ potential outcomes are nonparametrically identified by difference-in-differences estimands comparing the outcome evolutions of switchers and stayers with the same baseline treatment. Controlling for the baseline treatment ensures that our estimands remain valid if the treatment’s effect changes over time. We highlight two possible ways of weighting switcher’s slopes, and discuss their respective advantages. For each weighted average of slopes, we propose a doubly-robust, nonparametric, root n-consistent, and asymptotically normal estimator. We generalize our results to the instrumental-variable case. Finally, we apply our method to estimate the price-elasticity of gasoline consumption. Joint work with Clément de Chaisemartin, Xavier D’Haultfoeuille, Doulo Sow and Gonzalo Vasquez-Bare.
Simon Quantin (INSEE)
Appréhension des effets liés à une collecte multimode avec une approche économétrique
Il s’agit (i) de considérer la non-réponse comme un problème de valeurs manquantes (troncature incidentale – voir Wooldridge, « Econometric Analysis of Cross Section and Panel Data, chapitre 19, section 19.6) et (ii) l’effet de mesure lié au mode de collecte comme l’estimation d’un paramètre d’une variable (binaire) explicative endogène. L’objectif est d’utiliser l’estimation des paramètres du modèle pour obtenir (i) une estimation de la probabilité de répondre pour corriger de la non-réponse par repondération tout en tenant compte de la sélection endogène (données Missing Not-At-Random) et (ii) d’identifier de manière causale l’effet de mesure associé à un mode de collecte alternatif. Il est aussi possible d’imputer les réponses manquantes à l’enquête. Si ces deux effets peuvent coexister, on s’intéresse aussi au cas où l’absence d’effet de mesure peut être (raisonnablement) supposée. Dans ce cas de figure, seule une estimation du modèle d’Heckman est nécessaire. En l’état de nos connaissances/travaux, pour une variable expliquée continue (comme la surface du logement), l’extension du modèle d’Heckman en intégrant une variable explicative endogène est cadrée (cf. section 19.6.2, ibid.) Dans le cas d’une variable binaire, l’estimation du modèle d’Heckman sans variable endogène est classique (section 19.6.3). Des simulations ont permis d’obtenir par algorithme une estimation convergente de l’effet de mesure, i.e. du paramètre du modèle associé à la variable explicative endogène. Les questions au sein d’une enquête sont cependant de nature souvent autre que celles évoquées précédemment. Par exemple, certaines peuvent correspondre à des variables de comptage – non bornées (nombre de déménagement) ou bornées (nombre de réponses positives à une série fixée de questions). Le cas d’une réponse multinomiale ou ordinale est aussi fréquent. Nous serions intéressés par des échanges sur l’estimation des modèles d’Heckman pour ces différents types de variables et/ou l’intégration d’une variable explicative endogène dans ces modèles.
Anna Simoni (ENSAE)
Panel data models with randomly generated groups: Bayesian inference and density forecasts
We consider a dynamic panel data model that accounts for a latent group structure across individuals which is constant over time. Differently from the previous literature, we adopt a structural modeling that assumes that the individual effects are generated from a finite mixture with an unknown number of components and unknown parameters for each components. We first establish identification of this model. Then, we specify a prior for the number of components, the parameters of the mixture as well as for the coefficients of the dynamic and exogenous covariates. This extends the mixture of finite mixtures model to panel data settings. We establish asymptotic frequentist properties for the posterior of the parameters of interest as well as for the number of components. A Monte Carlo exercise illustrates finite sample properties.
Ludovic Stéphan (ENSAI)
Tensor completion with fewer and fewer samples
Tensor completion refers to the task of inferring a low-rank tensor from as few samples as possible. When the samples are randomly drawn, this problem is known to require at least n^{k/2} samples (for an order-k tensor) to solve efficiently, but current algorithms miss this bound by at least polylog factors. I will present a novel algorithm that succeeds at recovering at least some information about the tensor without needing additional samples. Building on this insight, I will also present how this n^{k/2} bound can be reduced to n samples if we are allowed to choose which entries are seen, using a so-called wedge sampling scheme. Based on joint works with Yizhe Zhu, Hengrui Luo and Anna Ma.