2002 seminars

Europe/Lisbon
Online

Dennis Prangle, University of Bristol, England

Distilling importance sampling for likelihood-free inference

Likelihood-free inference involves inferring parameter values given observed data and a simulator model. The simulator is computer code taking the parameters, performing stochastic calculations, and outputting simulated data. In this work, we view the simulator as a function whose inputs are (1) the parameters and (2) a vector of pseudo-random draws, and attempt to infer all these inputs. This is challenging as the resulting posterior can be high dimensional and involve strong dependence.

We approximate the posterior using normalizing flows, a flexible parametric family of densities. Training data is generated by ABC importance sampling with a large bandwidth parameter. This is "distilled" by using it to train the normalising flow parameters. The process is iterated, using the updated flow as the importance sampling proposal, and slowly reducing the ABC bandwidth until a proposal is generated for a good approximation to the posterior. Unlike most other likelihood-free methods, we avoid the need to reduce data to low dimensional summary statistics, and hence can achieve more accurate results.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Sebastian Engelke, Research Center for Statistics, University of Geneva

Machine learning beyond the data range: extreme quantile regression

Machine learning methods perform well in prediction tasks within the range of the training data. When interest is in quantiles of the response that go beyond the observed records, these methods typically break down. Extreme value theory provides the mathematical foundation for estimation of such extreme quantiles. A common approach is to approximate the exceedances over a high threshold by the generalized Pareto distribution. For conditional extreme quantiles, one may model the parameters of this distribution as functions of the predictors. Up to now, the existing methods are either not flexible enough or do not generalize well in higher dimensions. We develop new approaches for extreme quantile regression that estimate the parameters of the generalized Pareto distribution with tree-based methods and recurrent neural networks. Our estimators outperform classical machine learning methods and methods from extreme value theory in simulations studies. We illustrate how the recurrent neural network model can be used for effective forecasting of flood risk.

Additional file

document preview

Slides

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

João Paulo Martins, Escola Superior de Saúde, P.Porto e CEAUL

Meta-análise em Saúde

Revisão sistemática é uma revisão de literatura que segue um protocolo de pesquisa e seleção de estudos realizados sobre algum tema de interesse. Quando a revisão incide sobre dados quantitativos é possível uma análise dos resultados que vai para além de um resumo dos mesmos: meta-análise. Este ramo de Estatística tornou-se popular desde a publicação de alguns livros sobre este tema nos anos 80 do século passado.

Neste seminário, serão abordados alguns dos aspetos a ter em conta numa revisão sistemática, as medidas de efeito mais utilizadas na área da Saúde, os modelos de meta-análise mais utilizados, nomeadamente, os modelos de efeitos fixos ou aleatórios. A análise de subgrupos como ferramenta para explicar a heterogeneidade dos estudos também é explorada.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Cláudia Neves, King’s College London, UK

Extreme value statistics born out of domains of attraction

Extreme value statistics is essentially concerned with the modelling of rare events which are hard to predict and occur with only little warning. In this talk, I will address a number of challenges highlighted in the literature and how these align with the domain of attraction characterisation for extremes. Such a characterisation stems from a suite of mildly restrictive conditions, qualitative in nature, which not only provide computational convenience but also furnish sharp approximations to asymptotically justified models for extreme values, a key aspect to statistical testing procedures as well as interval estimation methodology in a nonparametric setting.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Rafael Medeiros Cabral, KAUST, Saudi Arabia

Latent non-Gaussian models and efficient estimation using variational Bayes

Latent Gaussian models (LGMs) are perhaps the most commonly used class of models in statistical applications. Nevertheless, in areas ranging from longitudinal studies in biostatistics to geostatistics, it is easy to find datasets that contain inherently non-Gaussian features, such as sudden jumps or spikes, that adversely affect the inferences and predictions made from an LGM. These datasets require more general latent non-Gaussian models (LnGMs) that can handle these non-Gaussian features automatically. However, fast implementation and easy-to-use software are lacking, which prevent LnGMs from becoming widely applicable. In this seminar, I will present the generic class of LnGMs and variational Bayes algorithms for fast and scalable inference of LnGMs. The methods can be applied to a wide range of models, such as autoregressive processes for time series, simultaneous autoregressive models for areal data, and spatial Matérn models. To facilitate Bayesian inference, we have built the ngvb package, where LGMs implemented in R-INLA can be easily extended to LnGMs by adding a single line of code.

Paper: https://arxiv.org/abs/2211.11050

Package: https://github.com/rafaelcabral96/ngvb

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Jorge Tendeiro, Hiroshima University, Japan

Perspectives on the Bayes Factor

In this talk I will discuss the Bayes factor: What it is, why (or why not) it should be used, and how to use it. My emphasis will be more on conceptual understanding and less on technicalities, as much as possible. My talk will include both theoretical and practical features, hopefully catering for an informed use of the Bayes factor.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room P3.10, Mathematics Building — Online

Paula Moraga, KAUST, Saudi Arabia

Bayesian spatial modeling of misaligned data using INLA and SPDE

Spatially misaligned data are becoming increasingly common due to advances in data collection and management. We present a Bayesian geostatistical model for the combination of data obtained at different spatial resolutions. The model assumes that underlying all observations, there is a spatially continuous variable that can be modeled using a Gaussian random field process. The model is fitted using the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) approaches. In order to allow the combination of spatially misaligned data, a new SPDE projection matrix for mapping the Gaussian Markov random field from the observations to the triangulation nodes is proposed. We show the performance of the new approach by means of simulation and an application of PM2.5 prediction in USA. The approach presented provides a useful tool in a wide range of situations where information at different spatial scales needs to be combined.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room P3.10, Mathematics Building — Online

Peter Rousseeuw
Peter Rousseeuw, KU Leuven

New graphical displays for classification

Classification is a major tool of statistics and machine learning. Several classifiers have interesting visualizations of their inner workings. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class, yielding so-called class maps. The proposed displays are constructed for discriminant analysis, k-nearest neighbors, support vector machines, CART, random forests, and neural networks. The graphical displays are illustrated and interpreted on data sets containing images, mixed features, and texts.

Joint work with: Jakob Raymaekers, Mia Hubert

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Katiane S. Conceição, Universidade de São Paulo, Brazil

Regression Model for Zero-Modified Count Data

In this work, we present a family of distributions for count data, named Zero-Modified Power Series (ZMPS), an extension of the Power Series distributions family whose support starts at zero. This extension consists of modifying the probability of observing zero of each Power Series distribution, allowing the new zero-modified distribution appropriately accommodate datasets that have any amount of zero observations (for instance, zero-inflated or zero-deflated datasets). Power Series distributions included in the Zero-Modified Power Series family are Poisson, Generalized Poisson, Geometric, Binomial, Negative Binomial, and Generalized Negative Binomial. In addition, we introduce the Zero-Modified Power Series regression models and propose a Bayesian approach. Two real datasets are analyzed: the first corresponds to leptospirosis notifications in cities of Bahia State in Brazil; the second corresponds to the number of goals scored by a team in a sports competition.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room P3.10, Mathematics Building — Online

Rui Pires da Silva Castro, Eindhoven University of Technology, The Netherlands

Detecting a (late) changepoint in the preferential attachment model

Motivated by the problem of detecting a change in the evolution of a network, we consider the preferential attachment random graph model with a time-dependent attachment function. We frame this as a hypothesis testing problem where the null hypothesis is a preferential attachment model with $n$ vertices and a constant affine attachment with parameter $\delta_0$, and the alternative hypothesis is a preferential attachment model where the affine attachment parameter changes from $\delta_0$ to $\delta_1$ at an unknown changepoint time $\tau_n$. For our analysis we focus on a scenario where one only sees the final network realization (and not its evolution), and the changepoint occurs “late”, namely $\tau_n = n − cn^\gamma$ with $c \geq 0$ and $\gamma\in(0,1)$. This corresponds to the relevant scenario where we aim to detect the changepoint shortly after it has happened. We present two asymptotically powerful tests that are able to distinguish between the null and alternative hypothesis when $\gamma\gt 1/2$. The first test requires knowledge of $\delta_0$, while the second test is significantly more involved, and does not require the knowledge of $\delta_0$ while still achieving the same performance guarantees. Furthermore, we determine the asymptotic distribution of the test statistics, which allows us to easily calibrate the tests in practice. Finally, we conjecture that in the setting considered there are no powerful tests when $\gamma\lt 1/2$. Our theoretical results are complemented with numerical evidence that illustrates the finite sample characteristics of the proposed procedures.

Joint work with Gianmarco Bet, Kay Bogerd, and Remco van der Hofstad.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Andreas Mayr, Department for Medical Biometry, Informatics, and Epidemiology University of Bonn, Germany

Statistical Boosting, Advanced Statistical Modeling And Clinical Reality

Biostatisticians nowadays can choose from a huge toolbox of advanced methods and algorithms for prediction purposes. Some of these tools are based on concepts from machine learning; other methods rely on more classical statistical modeling approaches. In clinical settings, doctors are sometimes reluctant to consider risk scores that are constructed by black-box algorithms without clinically meaningful interpretation. Furthermore, even both an accurate and interpretable model will not often be used in practice, when it is based on variables that are difficult to obtain in clinical routine or when its calculation is too complex.

In this talk, I will give a non-technical introduction to statistical boosting algorithms which can be interpreted as the methodological intersection between machine learning and statistical modeling. Boosting is able to perform variable selection while estimating statistical models from potentially high-dimensional data. It is mainly suitable for exploratory data analysis or prediction purposes. I will give an overview of some current methodological developments (including the development of polygenic scores) and provide an example of the construction of a clinical risk score with surprisingly simple solutions.

Joint seminar CEMAT and CEAUL


Room 6.4.30, Faculty of Sciences of the Universidade de Lisboa

Danilo Alvares, University of Cambridge (UK)

A two-stage approach for Bayesian joint models: reducing complexity while maintaining accuracy

Several joint models for longitudinal and survival data have been proposed in recent years. In particular, many authors have preferred to employ the Bayesian approach to model more complex structures, make dynamic predictions, or use model averaging. However, Markov chain Monte Carlo methods are computationally very demanding and may suffer convergence problems, especially for complex models with random effects, which is the case for most joint models. These issues can be overcome by estimating the parameters of each submodel separately, leading to a natural reduction in the complexity of the joint modeling, but often producing biased estimates. Hence, we propose a novel two-stage approach that uses the estimations from the longitudinal submodel to specify an informative prior distribution for the random effects when estimating them within the survival submodel. In addition, as a bias correction mechanism, we incorporate the longitudinal likelihood function in the second stage, where its fixed effects are set according to the estimation using only the longitudinal submodel. Based on simulation studies and real applications, we empirically compare our proposal with joint specification and standard two-stage approaches considering different types of longitudinal responses (continuous, count, and binary) that share information with a Weibull proportional hazard model. The results show that our estimator is more accurate than its two-stage competitor and as good as jointly estimating all parameters. Moreover, the novel two-stage approach significantly reduces the computational time compared to the joint specification.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room 6.4.30, Faculty of Sciences of the Universidade de Lisboa — Online

Rosina Savisaar, Mondego Science

What On Earth Is Bayesian Statistics And Why Do I Keep Hearing About It?

Rosina Savisaar, is a Statistics educator and consultant, founder of Mondego Science, a Teaching and Consultancy company specialized in Statistics and Data Analysis. In this talk, Rosina will share her thoughts on Bayesian Statistics.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Carlo Giovanni Camarda, Institut National d’Études Démographiques, France

Coherent Cause-Specific Mortality Forecasting Via Constrained Penalized Regression Models

In this seminar, Carlo Giovanni Camarada presents a work co-authored with Maria Durbán that proposes a clear-cut and fast method to obtain coherent cause-specific mortality trajectories based on Lagrange multipliers. The authors apply the method proposed to fit and forecast the mortality of males in the USA for the five leading causes of death.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Eliana Duarte, Departamento de Matemática, Universidade do Porto

Representation of context-specific graphical causal models with observational and interventional data

Graphical models are multivariate statistical models where conditional independence relations among random variables are represented by the missing edges of a graph whose nodes are the random variables. When the graph used to represent the model is a directed acyclic graph (DAG) these models are also useful to represent causal relations. This causal interpretation makes these models useful in areas such as genomics, psychology, and epidemiology. When the random variables under consideration are all discrete, it is useful, for modelling purposes, to consider a more general form of conditional independence called context-specific independence. Encoding context-specific independence using graphical models is an interesting challenge which has been considered previously by Heckerman (1990), Geiger and Heckerman (1996), Boutelier et al (1996), Smith and Anderson (2008), and Pensar et al 2015. The goal of this talk is to present a new way of representing context-specific causal models. We prove that these models generalize several important properties of graphical models and present a way to model interventions in these models. This is joint work with Liam Solus (KTH, Sweden).

Additional file

document preview

Slides

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Andreas Bender, Department of Statistics, LMU Munich

Reduction Techniques For Survival Analysis

Reduction techniques for survival analysis have become popular in recent years. These transform a survival task into a more standard regression task based on suitable data transformation. In this talk, we will introduce one such technique, the Piece-wise exponential Additive Mixed Model (PAMM). The talk will illustrate how the model can be used for flexible modeling of covariate effects and accommodate non-proportional hazards. In addition to single-event, right-censored data, the talk will cover left-truncated data, recurrent events, and competing risks. The talk will provide hands-on examples of how to fit and interpret the model using R Code, based on an implementation in package ‘pammtools’.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Gilles Stupfler, University of Angers, France

Extreme risk assessment and expectiles

Expectiles were originally introduced by Newey and Powell (Econometrica 1987) in order to test for symmetry in heteroskedastic regression models. They have recently seen a regain of interest due to nice properties they have in the context of risk assessment in insurance and finance. I will discuss the definition of expectiles, some of their most important properties, and recent work around their estimation and inference at extreme levels. I will illustrate the results on actuarial and financial data.

Joint seminar CEMAT and CEAUL