We discuss a novel approach for modeling multivariate binary transaction data and inferring co-purchase patterns in market basket data. To this end, we exploit a latent graph capturing these purchase associations, where each transaction is a clique, and set meaningful priors based on expected transaction sizes and frequency. We present a MCMC sampling procedure that handles large datasets and conclude that this model provides sparser representations of inferred associations compared to traditional frequent itemset mining (FIM) approaches, without sacrificing predictive accuracy. This is joint work with David Reynolds
In this talk, I will present four bi-parametric extensions of the Zipf distribution. The first two belong to the class of Random Stopped Extreme distributions. The third extension is the result of applying the concept of Poisson-Stopped-Sum to the Zipf distribution and, the last one is obtained by including an additional parameter to the probability generating function of the Zipf. An interesting characteristic of three of the models presented is that they allow for a parameter interpretation that gives some insights about the mechanism that generates the data. Also, I analyze the performance of these models when used to fit the degree sequences of real networks from different areas as: social networks, protein interaction networks, or collaboration networks. The fits obtained have been compared with those obtained with other bi-parametric models such as: the Zipf-Mandelbrot, the discrete Weibull, or the negative binomial.
This talk presents a concise review of integer-valued GARCH (INGARCH) modeling for time series of counts. Attention is paid to some commonly used specifications and the main approaches for studying their ergodic properties and their estimation methods. In particular, the focus is on the class of INGARCH processes with equal conditional stochastic and mean orders. Some recent mixture INGARCH extensions, in particular, Markov-switching INGARCH models are also presented.
Accurate spatiotemporal modeling of conditions leading to moderate and large wildfires provides better understanding of mechanisms driving fire-prone ecosystems and improves risk management. We here develop a joint model for the occurrence intensity and the wildfire size distribution by combining extreme-value theory and point processes within a novel Bayesian hierarchical model, and use it to study daily summer wildfire data for the French Mediterranean basin during 1995-2018. The occurrence component models wildfire ignitions as a spatiotemporal log-Gaussian Cox process. Burnt areas are numerical marks attached to points and are considered as extreme if they exceed a high threshold. The size component is a two-component mixture varying in space and time that jointly models moderate and extreme fires. We capture non-linear influence of covariates (Fire Weather Index, forest cover) through component-specific smooth functions, which may vary with season. We propose estimating shared random effects between model components to reveal and interpret common drivers of different aspects of wildfire activity. This leads to increased parsimony and reduced estimation uncertainty with better predictions. Fast approximate (but accurate) Bayesian estimation is carried out in the framework of the integrated nested Laplace approximation. Our methodology provides a holistic approach to explaining and predicting the drivers of wildfire activity and associated uncertainties.
Epilepsy is a chronic neurological disorder affecting more than 50 million people globally. An epileptic seizure acts like a temporary shock to the neuronal system, disrupting normal electrical activity in the brain. Epilepsy is frequently diagnosed with electroencephalograms (EEGs). Current methods study only the time-varying spectra and coherence but do not directly model changes in extreme behavior, neglecting the fact that neuronal oscillations exhibit non-Gaussian heavy-tailed probability distributions. To overcome this limitation, we propose a new approach to characterize brain connectivity based on the joint tail (i.e., extreme) behavior of the EEGs. Our proposed method, the conditional extremal dependence for brain connectivity (Conex-Connect), is a pioneering approach that links the association between extreme values of higher oscillations at a reference channel with the other brain network channels. Using the Conex-Connect method, we discover changes in the extremal dependence driven by the activity at the foci of the epileptic seizure. Our model-based approach reveals that, pre-seizure, the dependence is notably stable for all channels when conditioning on extreme values of the focal seizure area. By contrast, the dependence between channels is weaker during the seizure, and dependence patterns are more "chaotic." Using the Conex-Connect method, we identified the high-frequency oscillations as the most relevant features explaining the conditional extremal dependence of brain connectivity.
This workshop aims to introduce extreme value theory analysis. We will start by motivating the need for modelling extreme observations and present the most common methodologies to do so: the block maxima and the peaks over threshold approaches. We will apply these methods to real data using the most common R packages for extremes and the inference will be made under a frequentist framework.
It is advised that participants install RStudio, together with the packages "ismev" and "evd", previously to the session.
In statistics of extremes, the upper tail inference is usually based on the sample values over a high threshold. In a semiparametric framework, we consider the probability-weighted moment estimator of a positive Extreme value Index. Due to the specificity of the properties of the estimator, a direct estimation of an "optimal" threshold is not straightforward. In this talk, we consider two adaptive procedures for choosing such a threshold. The performance of the methods will be analysed with a simulation study. An illustration with a real dataset in the field of insurance is also provided.
Fires continue to be a leading cause of property damage, psychological effects, physical damage and death in modern society. Since 43% of Portuguese population lives in urban areas, these numbers potentially carry severe consequences. In literature, several approaches are used to predict and model fire occurrences. Regardless the approach, most studies emphasize the need to consider spatial techniques to model urban fire occurrences. Spatial econometric models may present benefits as they offer the possibility to consider spatial autocorrelation either in the response variable, the explanatory variables and/or the random error terms. Hence, this research aims at modelling urban fire occurrences while making a comparative analysis of different strategies to account for spatial autocorrelation. In addition, we intend to identify factors that explain the relationship between fire events and the urban pattern. Ultimately, we seek to map the probability of urban fires occurrence in Portugal.
To the best of our knowledge this is the first study that models the urban fire incidence using spatial modelling techniques in relation to socio-economic characteristics on a global spatial scale in Portugal. We conclude suggesting that spatial analytical techniques should be further applied in main districts to explore local dynamics and model the relationship with social-economic and -demographic features on a micro-level fire incident data.
Accurate diagnosis of disease is of fundamental importance in clinical practice and medical research. Before a medical diagnostic test is routinely used in practice, its ability to distinguish between diseased and nondiseased states must be rigorously assessed. The receiver operating characteristic (ROC) curve is the most popular used tool for evaluating the diagnostic accuracy of continuous-outcome tests. It has been acknowledged that several factors (e.g., subject-specific characteristics such as age and/or gender) can affect the test outcomes and accuracy beyond disease status. Recently, the covariate-adjusted ROC curve has been proposed and successfully applied as a global summary measure of diagnostic accuracy that takes covariate information into account. In this talk I will motivate the importance of including covariate-information, whenever available, in ROC analysis and, in particular, how the covariate-adjusted ROC curve is an important tool in this context. I will also detail the development of a highly flexible Bayesian method, based on the combination of a Dirichlet process mixture of additive normal models and the Bayesian bootstrap, for conducting inference about the covariate-adjusted ROC curve. Illustrations with simulated and real data will be provided.
Extreme U-statistics arise when the kernel of a U-statistic has a high degree but depends only on its arguments through a small number of top order statistics. As the kernel degree of the U-statistic grows to infinity with the sample size, estimators built out of such statistics form an intermediate family in between those constructed in the block maxima and peaks-over-threshold frameworks in extreme value analysis. The asymptotic normality of extreme U-statistics based on location-scale invariant kernels is established. Although the asymptotic variance corresponds with the one of the Hájek projection, the proof goes beyond considering the first term in Hoeffding's variance decomposition; instead, a growing number of terms needs to be incorporated in the proof.
To show the usefulness of extreme U-statistics, we propose a kernel depending on the three highest order statistics leading to an unbiased estimator of the shape parameter of the generalized Pareto distribution. When applied to samples in the max-domain of attraction of an extreme value distribution, the extreme U-statistic based on this kernel produces a location-scale invariant estimator of the extreme value index which is asymptotically normal and whose finite-sample performance is competitive with that of the pseudo-maximum likelihood estimator.
In this talk we will briefly introduce models accounting for the correlation over the spacetime domain.
We consider recent results for a class of non-separable spacetime models and to outline a computational implementation including a simple way to introduce non-stationarity.
We discuss some practical details through a working example considering a new R package dedicated to this work.
In this presentation new types of multivariate EWMA control charts are presented. They are based on the Euclidean distance and on the distance defined by using the inverse of the diagonal matrix consisting of the variances. The design of the proposed control schemes does not involve the computation of the inverse covariance matrix and, thus, it can be used in the high-dimensional setting. The distributional properties of the control statistics are obtained and are used in the determination of the new control procedures. Within an extensive simulation study the new approaches are compared with the multivariate EWMA control charts which are based on the Mahalanobis distance.
The presented results are based on a joint work with Rostyslav Bodnar and Taras Bodnar.
Statistical methods play an important role in infectious disease epidemiology. They provide the main set of tools to compute estimates of key epidemiological parameters and to shed light on the transmission dynamics of a pathogen. Markov chain Monte Carlo (MCMC) methods are powerful simulation techniques used to explore the posterior parameter space and carry out inference under the Bayesian paradigm. As MCMC samplers are iterative by design, drawing samples from the target posterior distribution often requires huge computational resources. This computational bottleneck is particularly unwelcome when analysis of epidemic data and estimation of model parameters is required in (near) real-time, as is often the case during epidemic outbreaks where massive datasets are updated on a daily basis. We explore the synergy between the Laplace approximation and Bayesian P-splines in epidemic models to deliver a flexible inference methodology with fast and nimble algorithms that outperform MCMC-based approaches from a computational perspective. The socalled “Laplacian-P-splines” method is illustrated in the context of nowcasting (i.e. the real-time assessment of the current epidemic situation corrected for imperfect data information caused by delays in reporting) and in the recently proposed EpiLPS framework for estimating the time-varying reproduction number with applications on data of SARS-CoV-2.
O aumento da população mundial e o desenvolvimento de novas economias, aumentaram a procura de recursos energéticos, em larga escala.
A energia eólica está a tornar-se uma importante fonte de produção de energia elétrica. Num parque eólico onshore, a energia elétrica é recolhida numa subestação a partir das várias turbinas eólicas que o constituem, através de cabos elétricos colocados sobre valas terrestres. Neste trabalho considera-se o problema de otimização do desenho de parques eólicos, assumindo que as localizações da subestação e das turbinas eólicas são conhecidas, e se dispõe de um conjunto de cabos elétricos que se podem utilizar. O problema é definido como um modelo de Programação Linear Inteira, que é depois reforçado com diferentes conjuntos de desigualdades válidas.
Para além disso, as fontes de produção renovável e as baterias de armazenamento, surgem como opções para o desenvolvimento de Smart Grids. Através de um modelo de otimização de Programação Linear Inteira Mista (MILP), estuda-se o efeito de fontes de produção renovável e de sistemas de armazenamento numa rede elétrica, de forma a maximizar o lucro do Virtual Power Plant (VPP).
In breeding programmes, the observed genetic change is a sum of the contributions of different groups of individuals. Quantifying these sources of genetic change is essential for identifying the key breeding actions and optimizing breeding programmes. However, it is difficult to disentangle the contribution of individual groups due to the inherent complexity of breeding programmes. Here we extend the previously developed method for partitioning genetic mean by paths of selection to work both with the mean and variance of breeding values. We first extended the partitioning method to quantify the contribution of different groups to genetic variance assuming breeding values are known. Second, we combined the partitioning method with the Markov Chain Monte Carlo approach to draw samples from the posterior distribution of breeding values and use these samples for computing the point and interval estimates of partitions for the genetic mean and variance. We implemented the method in the R package AlphaPart. We demonstrated the method with a simulated cattle breeding programme.We showed how to quantify the contribution of different groups of individuals to genetic mean and variance. We showed that the contributions of different selection paths to genetic variance are not necessarily independent. Finally, we observed some limitations of the partitioning method under a misspecified model, suggesting the need for a genomic partitioning method. We presented a partitioning method to quantify sources of change in genetic mean and variance in breeding programmes. The method can help breeders and researchers understand the dynamics in genetic mean and variance in a breeding programme. The developed method for partitioning genetic mean and variance is a powerful method for understanding how different paths of selection interact within a breeding programme and how they can be optimised.
We extend extreme value statistics to independent data with possibly very different distributions. In particular, we present novel asymptotic normality results for the Hill estimator, which now estimates the positive extreme value index of the average distribution. Due to the heterogeneity, the asymptotic variance can be substantially smaller than that in the i.i.d. case. As a special case, we consider a heterogeneous scales model where the asymptotic variance can be calculated explicitly. The primary tool for the proofs is the functional central limit theorem for a weighted tail empirical process. A simulation study shows the good finite-sample behavior of our limit theorems. We present an application to assess the tail heaviness of earthquake energies. This is joint work with Yi He (Univ. of Amsterdam).