Shared frailty models are particularly useful in recurrent events analysis to account for the within-subject dependence among event times. Usually, such models rely on the assumption that frailty acts multiplicatively on the hazard/rate function. However, in certain scenarios, it may be more realistic for frailty to be included in an additive way. Furthermore, the unobserved heterogeneity may be due to the presence of some subjects who are non-susceptible to the event of interest, and others with a varying degree of susceptibility.
This talk aims to introduce a new additive shared frailty model for recurrent gap time data, characterized by a Weibull rate function derived from a non-homogeneous Poisson process and by a mixed frailty following a non-central chi-squared distribution with zero degrees of freedom. It will be shown that the resulting model may have a competing risk interpretation. Additionally, the Weibull rate model and the classical homogeneous Poisson process are two special cases of degenerate frailty. A frequentist approach for parameter estimation using the maximum likelihood method will be discussed. An application to a well-known data set is provided for illustrative purposes.
The seminar will be taught in Portuguese, but the presentation slides will be in English.
Air pollution is a global challenge with deep implications in public health and environment. We examine air quality data from a monitoring station in Entrecampos, Lisbon, using Symbolic Data Analysis. The dataset consists of hourly concentrations of nine pollutants during three years, which are logarithmically transformed and aggregated in intervals, taking the daily minimum and maximum values. The symbolic mean and variance are estimated for each variable through the method of moments, and the pairwise dependencies are captured using a bivariate copula. Symbolic principal component scores are obtained from the estimated covariance matrix and used to fit generalized extreme value distributions. Control charts, based on these distributions' quantiles, are used to identify outlying observations. A comparative analysis with daily average-based outlier detection methods is conducted. The results show the relevance of Symbolic Data Analysis in revealing new insights into air quality.
The Receiver Operating Characteristic curve (ROC curve) is a graphical tool that assesses the accuracy of a classification method based on a continuous random variables, usually known as the marker. Nowadays it is a well-accepted technique that reflects how well this classifier discriminates between two different groups or classes.
In this talk, our focus will be in situations where some covariates with impact on the performance of the ROC curve are registered so it is advisable to incorporate this additional information into the study. We take account the covariate effect through regression models. More precisely, for each population, the markers distribution is modelled separately in terms of the covariates and just after, the induced ROC curve is computed.
The motivation of this talk is the extended belief that ROC curves are robust. Our talk tackles the concept of robustness in the sense of protection against anomalous data in the sample. Aware of the impact that outlying values may have on the diagnostic test accuracy, we center our attention on the robust aspects of the estimation procedures of the conditional ROC curve. Moreover, since regression models are involved in both the direct and induced approaches, atypical data among the responses or the covariates may severely affect the estimation methods. To achieve robustness is even more complex when dealing with functional data, since, in such a situation, different types of atypical data may arise.
Due to the lack of stability of the classical ROC curve estimators, when there are outliers between observations, we will introduce a procedure to obtain robust estimators within the framework of the induced methodology. The proposal is based on a semi-parametric approach in which for each sample a regression model is robustly fitted to the marker and estimators of the distribution functions of the adaptive errors are considered to down-weight large residuals. Robust procedures will be introduced both when there are real covariates and functional covariates.
We will present results regarding the uniform consistency of the estimators. The infinite-sample numerical study illustrates the robustness of the proposal. A real data set is also analysed.
The methods to be described are based on the following papers:
Bianco, A. M. and Boente, G. (2022). Addressing robust estimation in covariate specific ROC curves. Econometrics and Statistics.
Bianco, A. M., Boente, G. and Gonzalez Manteiga, W. (2022). Robust consistent estimators for ROC curves with covariates. Electronic Journal of Statistics, 16, 4133-4161.
Accurate values for population estimates are always a challenge and can be further burdened in scenarios with high migrant mobility. In these cases, bias emanating from over-coverage, i.e., resident individuals whose death or emigration is not registered, may result in significant ramifications for policymaking and research. In this talk, we will consider different approaches to obtain over-coverage estimates, using Swedish Population registers for the period 2003-2016. We will discuss difficulties regarding the high dimensionality of the data and we will show current developments and ideas for the future.
Understanding co-infection systems with multiple interacting strains remains difficult. High dimensionality and complex nonlinear feedbacks make the analytical study of such systems very challenging. When strains are similar, we can model trait variation as perturbations in parameters, which simplifies analysis. Applying singular perturbation theory to such multi-strain system we have obtained the explicit collective dynamics in terms of: a fast (neutral) dynamics, and a slow (non-neutral) dynamics. The slow dynamics are given by the replicator equation for strain frequencies, a key equation in evolutionary game theory, which in our case governs selection among N strains. In this talk, I will highlight some key features of this derivation, the use of the replicator equation to better understand such multi-strain system, and discuss links with diversity data both in epidemiology and ecology.
Os modelos multiestado permitem descrever processos complexos nos quais os indivíduos se podem mover entre um número finito de estados ao longo do tempo. No caso de aplicações biomédicas, através deste tipo de modelos, é possível analisar a progressão de uma doença; investigar o efeito de preditores para o aumento do risco de transição entre estados; ou efetuar predições de probabilidades de transição para estados futuros dado o histórico de eventos. Em ambos os casos, uma avaliação prévia do pressuposto de Markov é fundamental para evitar, por exemplo, inconsistências nas estimativas obtidas. No seminário serão introduzidos os conceitos fundamentais sobre modelos multiestado, assim como diferentes métodos de inferência e validação do pressuposto de Markov (retirados da literatura e outros publicados pelo orador). Por fim, serão apresentados exemplos práticos de aplicação dos métodos a dados reais na área da saúde usando para tal a biblioteca R markovMSM.
Dynamic event prediction, using joint modeling of survival time and longitudinal variables, is extremely useful in personalized medicine. However, estimating joint models that include multiple longitudinal markers remains a computational challenge due to the large number of random effects and parameters that need to be estimated. We propose a model-averaging strategy to combine predictions from several joint models for the event, including models with only one longitudinal marker or pairwise longitudinal markers. The prediction is computed as the weighted mean of the predictions from the one-marker or two-marker models, with the time-dependent weights estimated by minimizing the time-dependent Brier score. This method enables us to combine a large number of predictions issued from joint models to achieve a reliable and accurate individual prediction. The advantages and limitations of the proposed methods are highlighted by comparing them with the predictions from well-specified and misspecified all-marker joint models, as well as one-marker and two-marker joint models, using the available PBC2 dataset. The method is used to predict the risk of death in patients with primary biliary cirrhosis. The method is also used to analyze a French cohort study called the 3C data. In our study, seventeen longitudinal markers are being considered to predict the risk of death.
Understanding co-infection systems with multiple interacting strains remains difficult. High dimensionality and complex nonlinear feedbacks make the analytical study of such systems very challenging. When similar strains are similar, we can model trait variation as parameter perturbations, simplifying analysis. Applying singular perturbation theory to such a multi-strain system we have obtained the explicit collective dynamics in terms of fast (neutral) dynamics, and slow (non-neutral) dynamics. The slow dynamics are given by the replicator equation for strain frequencies, a key equation in evolutionary game theory, which in our case governs selection among N strains. In this talk, I will highlight some key features of this derivation, the use of the replicator equation to understand such a multi-strain system better, and discuss links with diversity data both in epidemiology and ecology.
The maximum likelihood problem for Hidden Markov Models is usually numerically solved by the Baum-Welch algorithm, which uses the Expectation-Maximization algorithm to find the estimates of the parameters. This algorithm has a recursion depth equal to the data sample size and cannot be computed in parallel, which limits the use of modern GPUs to speed up computation time. A new algorithm is proposed that provides the same estimates as the Baum-Welch algorithm, requiring about the same number of iterations, but is designed in such a way that it can be parallelized. As a consequence, it leads to a significant reduction in the computation time. We illustrate this by means of numerical examples, where we consider simulated data as well as real datasets.
Agências nacionais de estatística do mundo inteiro têm experimentado uma necessidade crescente de fornecer estimativas confiáveis de índices económicos e sociais, como proporções ou taxas, a nível de pequenas áreas ou pequenos domínios a partir de dados de pesquisas amostrais. No entanto, devido ao pequeno tamanho da amostra nessas áreas, não é viável obter estimativas com um nível de precisão aceitável sem usar abordagens baseadas em modelos. Este trabalho propõe modelar conjuntamente o estimador direto de índices no intervalo (0,1) e suas respectivas precisões utilizando-se as distribuições Beta e Beta prime. A novidade é modelar também o estimador de precisão amostral como uma distribuição Beta prime. Um estudo de avaliação com dados reais mostra que há ganho extra na modelagem conjunta do estimador direto e seu estimador de precisão com relação ao modelo Beta que não utiliza informação amostral sobre a precisão das estimativas. Uma aplicação para estimar o índice de insegurança alimentar em pequenas áreas do Estado de Minas Gerais, usando dados da Pesquisa Nacional de Orçamentos Familiares (POF) para o ano de 2018 é também apresentada.
Trabalho conjunto com Soraia Pereira (CEAUL/FCUL) e Giovani Silva (CEAUL/IST).
O desenvolvimento de Modelos Estatísticos para Detecção de Fraudes em Testes tem ganhado relevância nos últimos, particularmente aqueles baseados na Teoria da Resposta ao Item (TRI). Exames e avaliações podem ter suspeitas de fraude associadas se os resultados estiverem vinculados a vantagens financeiras ou vagas em instituições de ensino. Serão apresentados os principais modelos, comportamentos estatísticos associados, desempenho computacional para execução dos mesmos e uma aplicação a dados reais. Foi construído um pacote computacional no R que será apresentado e disponibilizado ao público.
Os avanços das tecnologias da informação e dos computadores têm permitido a possibilidade de armazenar grandes e múltiplas bases de dados e frequentemente estes dados podem ser não estruturados com variáveis definidas por múltiplos valores ou múltiplas unidades. Por exemplo, temperaturas diárias registadas por valores mínimos e máximos e preferência de usuários para analisar fenômenos por regiões ao invés de habitantes. A fim de reduzir o tamanho e melhorar a eficiência de modelos associados a esses dados, uma solução é obter novas unidades estatísticas para descrever os fenômenos via dados multivalorados. Em Análise de Dados Simbólicos (ADS) as entradas das bases de dados são novas unidades descritas por variáveis que não se limitam a serem valores reais uma vez que podem ser selecionados de uma lista mais ampla: conjuntos, intervalos, histogramas, árvores, gráficos, funções, fuzzy, etc. O objetivo de ADS é estender as técnicas estatísticas e aprendizagem de máquina (árvores de decisão, regras de classificação, redes neurais, análise fatorial) para dados mais complexos, chamados de dados simbólicos. Nesta última década, diferentes métodos de regressão e agrupamento para dados multivalorados têm sido propostos na literatura de ADS. Diferentes aplicações ilustram o uso desses métodos.
Within the general aim of extreme value statistics lies the estimation of an event that is so rare that might have never been witnessed in the past. Whilst the parametric estimation of an extreme quantile has found its way to the lore of many applied sciences, in terms of evaluating return levels, analogous non-parametric methodology is far less explored. This is an interesting topic because there are different albeit equivalent ways to define an (extreme) out-of-sample quantile as underpinned by different constructs arising from the same foundational extreme value theorem.
In this talk, I will address two of these definitions through the domains of attraction framework and will explain how we succeeded in generalising one of them to allow for either cases of finite or infinite upper bound to the distribution underlying the sampled data.
This seminar will begin with an introduction of the multidimensional construct of healthcare access, providing a well-established definition and common objectives in access measurement and inference. Different approaches will be presented, focusing on rigorous mathematical models to estimate access, including optimization and simulation under uncertainty of the model inputs. Important aspects will be covered including spatial dependence in the decision parameters of optimization models used to estimate healthcare access and Bayesian hierarchical models used to specify the sampling distributions of model inputs. The models will be illustrated for modeling access to mental healthcare in Georgia, United States.
In this seminar, Dr. Luis Gimeno-Sotelo will provide an overview of his most recent advances on the extreme value analysis of the main hydrological extreme events (heavy rainfall and droughts) in terms of their main drivers. The most relevant statistical methods for non-stationary extreme value modelling will be presented, as well as a variety of methods from the copula theory to study bivariate extremes and conditional probabilities. He will explain the main applications of these statistical methodologies in the aforementioned environmental context, allowing for the identification of hotspot regions of high statistical dependence between the drivers and the hydrological extremes, as well as the analysis of the projected changes in the probabilities of occurrence of these extreme events in a global warming context.
Misdiagnosis can occur when different case definitions are used by clinicians (relative misdiagnosis) or when failing the genuine diagnosis of another disease (misdiagnosis in a strict sense). In complex diseases, such as myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS), this problem translates to a recurrent difficulty in reproducing research findings. To explore these effects, we simulated data from case-control studies under the assumption of misdiagnosis in a strict sense. We estimated the power to detect a genuine association between a potential causal factor and ME/CFS and demonstrated how current research studies may have suboptimal power. To address the implications of these findings, suggestions for how power can be improved are given and explained within the context of the disease.
A estatística bayesiana tem sido cada vez mais utilizada em ensaios clínicos, oferecendo maior flexibilidade e eficiência no desenvolvimento de novos fármacos.
Neste seminário abordaremos este tópico utilizando como exemplo base num grande ensaio clínico muito conhecido mas que poucos sabem que utilizou métodos bayesianos. Vamos explorar em detalhe a metodologia utilizada no ensaio e em como é aplicável a outros ensaios. Será também abordado o tema de escolha do tipo de distribuições a priori e como escolher parâmetros de uma distribuição.
Fitting spatial models with a Gaussian random field as spatial random effect poses computational challenges for Markov Chain Monte Carlo (MCMC) methods, primarily due to two factors: computational speed and convergence of chains for the hyperparameters. To deal with this, a Gaussian random field can be approximated by a Gaussian Markov random field using stochastic partial differential equations. This methodology is commonly used in “latent Gaussian models”, where the inference is done by the Integrated Nested Laplace Approximations, but rarely used in an MCMC method. In this contribution, we evaluated different parameterizations of the approximated Gaussian random field, specifically using the Hamiltonian Monte Carlo algorithm of the Stan software. A simulation study demonstrated that models using the hyperparameters ρ and σu were better able to estimate the values used to simulate the spatial random field. Their speed computation were faster compared to models parameterized with κ and τ. In real data application, the index of relative abundance estimated for Pollock indicates similar trends for the six models proposed. However, models incorporating ρ and σu demonstrated faster computation compared to those utilizing κ and τ, corroborating the results found in the simulation. Even more important, none of these models encountered convergence issues, as indicated by the Rhat statistic.