Recent seminars


Room 6.4.30, Faculty of Sciences of the Universidade de Lisboa

Danilo Alvares, University of Cambridge (UK)
A two-stage approach for Bayesian joint models: reducing complexity while maintaining accuracy

Several joint models for longitudinal and survival data have been proposed in recent years. In particular, many authors have preferred to employ the Bayesian approach to model more complex structures, make dynamic predictions, or use model averaging. However, Markov chain Monte Carlo methods are computationally very demanding and may suffer convergence problems, especially for complex models with random effects, which is the case for most joint models. These issues can be overcome by estimating the parameters of each submodel separately, leading to a natural reduction in the complexity of the joint modeling, but often producing biased estimates. Hence, we propose a novel two-stage approach that uses the estimations from the longitudinal submodel to specify an informative prior distribution for the random effects when estimating them within the survival submodel. In addition, as a bias correction mechanism, we incorporate the longitudinal likelihood function in the second stage, where its fixed effects are set according to the estimation using only the longitudinal submodel. Based on simulation studies and real applications, we empirically compare our proposal with joint specification and standard two-stage approaches considering different types of longitudinal responses (continuous, count, and binary) that share information with a Weibull proportional hazard model. The results show that our estimator is more accurate than its two-stage competitor and as good as jointly estimating all parameters. Moreover, the novel two-stage approach significantly reduces the computational time compared to the joint specification.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Andreas Mayr, Department for Medical Biometry, Informatics, and Epidemiology University of Bonn, Germany
Statistical Boosting, Advanced Statistical Modeling And Clinical Reality

Biostatisticians nowadays can choose from a huge toolbox of advanced methods and algorithms for prediction purposes. Some of these tools are based on concepts from machine learning; other methods rely on more classical statistical modeling approaches. In clinical settings, doctors are sometimes reluctant to consider risk scores that are constructed by black-box algorithms without clinically meaningful interpretation. Furthermore, even both an accurate and interpretable model will not often be used in practice, when it is based on variables that are difficult to obtain in clinical routine or when its calculation is too complex.

In this talk, I will give a non-technical introduction to statistical boosting algorithms which can be interpreted as the methodological intersection between machine learning and statistical modeling. Boosting is able to perform variable selection while estimating statistical models from potentially high-dimensional data. It is mainly suitable for exploratory data analysis or prediction purposes. I will give an overview of some current methodological developments (including the development of polygenic scores) and provide an example of the construction of a clinical risk score with surprisingly simple solutions.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room P3.10, Mathematics Building — Online

Rui Pires da Silva Castro, Eindhoven University of Technology, The Netherlands
Detecting a (late) changepoint in the preferential attachment model

Motivated by the problem of detecting a change in the evolution of a network, we consider the preferential attachment random graph model with a time-dependent attachment function. We frame this as a hypothesis testing problem where the null hypothesis is a preferential attachment model with $n$ vertices and a constant affine attachment with parameter $\delta_0$, and the alternative hypothesis is a preferential attachment model where the affine attachment parameter changes from $\delta_0$ to $\delta_1$ at an unknown changepoint time $\tau_n$. For our analysis we focus on a scenario where one only sees the final network realization (and not its evolution), and the changepoint occurs “late”, namely $\tau_n = n − cn^\gamma$ with $c \geq 0$ and $\gamma\in(0,1)$. This corresponds to the relevant scenario where we aim to detect the changepoint shortly after it has happened. We present two asymptotically powerful tests that are able to distinguish between the null and alternative hypothesis when $\gamma\gt 1/2$. The first test requires knowledge of $\delta_0$, while the second test is significantly more involved, and does not require the knowledge of $\delta_0$ while still achieving the same performance guarantees. Furthermore, we determine the asymptotic distribution of the test statistics, which allows us to easily calibrate the tests in practice. Finally, we conjecture that in the setting considered there are no powerful tests when $\gamma\lt 1/2$. Our theoretical results are complemented with numerical evidence that illustrates the finite sample characteristics of the proposed procedures.

Joint work with Gianmarco Bet, Kay Bogerd, and Remco van der Hofstad.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Online

Katiane S. Conceição, Universidade de São Paulo, Brazil
Regression Model for Zero-Modified Count Data

In this work, we present a family of distributions for count data, named Zero-Modified Power Series (ZMPS), an extension of the Power Series distributions family whose support starts at zero. This extension consists of modifying the probability of observing zero of each Power Series distribution, allowing the new zero-modified distribution appropriately accommodate datasets that have any amount of zero observations (for instance, zero-inflated or zero-deflated datasets). Power Series distributions included in the Zero-Modified Power Series family are Poisson, Generalized Poisson, Geometric, Binomial, Negative Binomial, and Generalized Negative Binomial. In addition, we introduce the Zero-Modified Power Series regression models and propose a Bayesian approach. Two real datasets are analyzed: the first corresponds to leptospirosis notifications in cities of Bahia State in Brazil; the second corresponds to the number of goals scored by a team in a sports competition.

Joint seminar CEMAT and CEAUL

Europe/Lisbon
Room P3.10, Mathematics Building — Online

Peter Rousseeuw

Peter Rousseeuw, KU Leuven
New graphical displays for classification

Classification is a major tool of statistics and machine learning. Several classifiers have interesting visualizations of their inner workings. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class, yielding so-called class maps. The proposed displays are constructed for discriminant analysis, k-nearest neighbors, support vector machines, CART, random forests, and neural networks. The graphical displays are illustrated and interpreted on data sets containing images, mixed features, and texts.

Joint work with: Jakob Raymaekers, Mia Hubert

Joint seminar CEMAT and CEAUL