DOCUMENT DE TREBALL XREAP2010-4 Modelling dependence in a ratemaking procedure with multivariate Poisson regression models Lluís Bermúdez (RFA-IREA) Dimitris Karlis Modelling dependence in a ratemaking procedure with multivariate Poisson regression models Llu´ Berm´deza†& Dimitris Karlisb ıs u April 6, 2010 a University of Barcelona. Spain b Athens University of Economics and Business. Greece ∗ Abstract When actuaries face with the problem of pricing an insurance contract that contains different types of coverage, such as a motor insurance or homeowner’s insurance policy, they usually assume that types of claim are independent. However, this assumption may not be realistic: several studies have shown that there is a positive correlation between types of claim. Here we introduce different multivariate Poisson regression models in order to relax the independence assumption, including zero-inflated models to account for excess of zeros and overdispersion. These models have been largely ignored to date, mainly because of their computational difficulties. Bayesian inference based on MCMC helps to solve this problem (and also lets us derive, for several quantities of interest, posterior summaries to account for uncertainty). Finally, these models are applied to an automobile insurance claims database with three different types of claims. We analyse the consequences for pure and loaded premiums when the independence assumption is relaxed by using different multivariate Poisson regression models and their zero-inflated versions. JEL classification: C51; IM classification: IM11; IB classification: IB40. Keywords: Multivariate Poisson regression models, Zero-inflated models, Automobile insurance, MCMC inference, Gibbs sampling. ∗ Acknowledgements. The first author wishes to acknowledge discussions with researchers at RFA-IREA at the University of Barcelona and the support of the Spanish Ministry of Education and FEDER grant SEJ 2007-63298. † Corresponding Author. Departament de Matem`tica Econ`mica, Financera i Actuarial, Universia o tat de Barcelona, Diagonal 690, 08034-Barcelona, Spain. Tel.:+34-93-4034854; fax: +34-93-4034892; e-mail: lbermudez@ub.edu 1 1 Introduction Automobile insurance aims at covering different type of claims incurred as a result of traffic accidents. In most developed countries motor insurance is compulsory for driving a motor vehicle on public roads. The degree of each jurisdiction varies greatly, but essentially, the aim of compulsory motor insurance for all vehicle owners is to cover damage to third parties. This coverage is usually termed third-party liability coverage and provides financial compensation to cover any injuries caused to other people or their property. Apart from this liability coverage, motor insurance can also cover the insured party (vehicle damage and personal injury). Property coverage or first-party coverage provides different levels of protection depending on the policy the insured purchases. Car owners may take out comprehensive coverage (damage to the vehicle caused by any unknown party, for example, damage resulting from theft, flood or fire), collision coverage (damage resulting from a collision with another vehicle or object when the policyholder is at fault), or a set of basic guarantees such as an emergency roadside assistance, legal assistance or insurance covering medical costs. Pricing is especially complicated in the branch of motor insurance, due to the heterogeneity of the portfolios and the fact that policies cover different risks. One way to handle the problem of this heterogeneity is to segment the portfolio into homogeneous classes so that all policyholders belonging to the same class pay the same premium. To achieve this, an a priori ratemaking based on generalized linear models (GLM) is usually accepted. A thorough review of ratemaking systems for motor insurance, when modelling claim count data, can be found in Denuit et al. (2007). With the usual ratemaking procedure, modelling the number of claims incurred using Poisson regression models, the expected number of claims (the pure premium, assuming the amount of the expected claim equals one monetary unit) is obtained for each class of guarantee as a function of different factors. Then, assuming independence between types of claims, the total motor insurance premium is obtained by the sum of the expected number of claims of each guarantee. This procedure presents at least three important limitations. First, not all factors influencing risk can be identified, measured and introduced in the a priori tariff system, and hence, the tariff classes may be quite heterogeneous. To correct for this unobserved heterogeneity an a posteriori tariff (or bonus-malus system) can be used, by fitting 2 an individual premium based on the experience of claims for each insured party. There is a large amount of literature on bonus-malus systems (see Denuit et al., 2007). Another way to handle unobserved heterogeneity is to introduce a random effect into the model (Cameron and Trivedi, 1998 and Boucher and Denuit, 2006). Second, unobserved heterogeneity and serial dependence (when the data consist of repeated observations regarding the same policyholder) will often lead to overdispersion (variance greater than mean) which cannot be fully remedied by Poisson regression models. Failing to account for overdispersion may increase the number of factors considered significant by artificially increasing their level of significance. To account for overdispersion, some generalizations of the model have been considered (see e.g. zero-inflated models as in Boucher et al., 2007). Finally, it remains to be established whether the independence assumption between types of claims is realistic. This question is not widely discussed in the actuarial literature. When this assumption is relaxed, it is interesting to see how the tariff system is affected. In Frees and Valdez (2008) and Frees et al. (2009) a hierarchical statistical model is fitted using microlevel data. In Berm´dez (2009), the interpretation of a number of bivariate Poisson models was u illustrated in the context of motor insurance claims and the conclusion was that using a bivariate Poisson model leads to an a priori ratemaking that presents larger variances and, hence, larger loadings than those obtained under the independence assumption. In that study, only two types of claim were considered: claims for third-party liability or for the rest of guarantees. Obviously, this is a limitation that other multivariate count data models can overcome: for instance, we could divide claims for third-party liability into vehicle damage and personal injury claims, or distinguish between motor collision coverage and the rest of guarantees. In the present paper we deal with this kind of extension. Here we introduce different multivariate Poisson regression models in order to relax the independence assumption when pricing several guarantees simultaneously in automobile insurance. Creating multivariate Poisson models is not easy, as many different models can be obtained. In the present paper we use two such models and their zero-inflated variants (to account for the excess of zeros observed in automobile databases, see e.g. Boucher et al., 2007 and Berm´dez, u 2009). The first one, which we call the “common covariance model”, has been defined in Tsionas (2001) and the second one, the “full covariance model”, in Karlis and Meligkotsidou (2005). In addition, here we extend these models with their zero-inflated variants. It is important to re3 alize that zero inflation also introduces overdispersion in the marginal distributions. Hence, zero-inflated models can introduce improvements in several aspects of the data. Multivariate zero-inflated models are well known for claim counts data, see for example Boucher and Denuit (2008) for a credibility application. Our approach differs from this paper as we attempt to model dependence between different types of claims and not for a panel data, i.e. one type observed in different time periods. Moreover, they are focus on a posteriori premiums and we use these models for a priori ratemaking procedure. Finally, we use a Bayesian approach for fitting the models that offers some advantages. It facilitates the estimation for such complicated models, while at the same time, allows for deriving posterior quantities of interest not as simple point estimates but together with their posterior distribution providing more insight and better understanding for correct ratemaking. To our knowledge, the derived MCMC scheme for multivariate zero-inflated Poisson models is novel. The article is organized as follows. First, in Section 2 we introduce several multivariate Poisson regression models. In Section 3 we discuss the Bayesian methodology used to fit the statistical model to the data. In Section 4 the database from a Spanish insurance company is described. In Section 5 the results are summarized. Finally, we provide concluding remarks in Section 6. 2 Multivariate Poisson regression models Let us consider a policyholder with N1 the number of claims for motor third-party liability coverage, N2 the number of claims for motor collision coverage, N3 the number of claims for the rest of motor guarantees and N = N1 + N2 + N3 the total number of claims during one year. Our aim is to analyze different multivariate Poisson models as a way to relax the independence assumption between types of claims when a ratemaking procedure is developed. First, we analyze a simple multivariate Poisson model with common covariance parameter (Johnson et al., 1997, Tsionas, 2001). Second, we study a multivariate Poisson model with full covariance following the model introduced by Karlis and Meligkotsidou (2005). Finally, we consider zeroinflated versions of these models to account for the excess of zero claims and the overdispersion observed typically in such datasets. 4 2.1 A model with common covariance The first model is based on a simple multivariate reduction. Namely we assume that N1 = Y1 + Y0 N2 = Y2 + Y0 N3 = Y3 + Y0 where Yi ∼ P o(θi ), i ∈ {0, 1, 2, 3}, θi > 0. Then, each Ni , i ∈ {1, 2, 3} marginally follows a Poisson distribution with parameter θi + θ0 . θ0 is a common covariance parameter which measures the covariance of each pair. The covariance matrix is  θ + θ0 θ0 θ0  1  Cov(N) =  θ0 θ2 + θ0 θ0  θ0 θ0 θ3 + θ0 The joint probability function of the vector N is given by s (1)    .  P (n1 , n2 , n3 ) = exp(−θ) k=0 n n n k θ0 θ1 1 −k θ2 2 −k θ3 3 −k , k! (n1 − k)! (n2 − k)! (n3 − k)! where s = min{n1 , n2 , n3 } and θ = θ1 + θ2 + θ3 + θ0 . We will denote the above distribution as M P1 (θ1 , θ2 , θ3 , θ0 ). Let us assume that N1q , N2q and N3q denote respectively the random variables indicating the number of claims of each type of guarantee for the qth policyholder. We may allow for covariates by considering that log(θiq ) = xiq βi , where xiq is a vector of explanatory variables and βi denotes the corresponding vector of regression coefficients. Note that different covariates can be used to model each parameter θi , i = 1, 2, 3. In general we may use covariates to θ0 as well but this would make the interpretation much more difficult. If covariates are introduced to model θ1 , θ2 and θ3 , a multivariate Poisson regression model can be defined with the following scheme (for more details see Tsionas, 2001): (N1q , N2q , N3q ) ∼ M P1 (θ1q , θ2q , θ3q , θ0q ), log(θ1q ) = x1q β1 , log(θ2q ) = x2q β2 , log(θ3q ) = x3q β3 . 5 (2) Limitations of this model are that it assumes a common covariance for each pair; it allows only for positive covariance (correlation); and the marginal distributions are Poisson, and so we cannot model over(under)dispersion. There are some other models that allow for negative correlation (see van Ophem 1999, Chib and Winkelmann 2001, Berkhout and Plug 2004, Karlis and Melogkotsidou (2007), Nikoloulopoulos and Karlis, 2009), but they are much more complicated and require a special effort for parameter estimation. In the context of automobile insurance, it is not necessary to consider negative correlation for these type of claims. However, in the next sections, we consider a more complex model to allow different covariance for each pair of variables, and zero-inflated models to deal with overdispersion which has often been observed when modelling claim counts in automobile insurance data (Dean, 1992). 2.2 A model with full covariance In order to extend the previous model and allow for modelling the covariance structure of the data in a flexible way, we consider the case of the trivariate Poisson model with full two-way covariance structure: N1 = Y1 + Y12 + Y13 N2 = Y2 + Y12 + Y23 N3 = Y3 + Y13 + Y23 where Yi ∼ P o(µi ), i ∈ {1, 2, 3} and Yij ∼ P o(θij ), i, j ∈ {1, 2, 3}, i < j, µi , θij > 0 . Then, each Ni , i ∈ {1, 2, 3} marginally follows a Poisson distribution with parameter µi + θij + θik , i, j, k ∈ {1, 2, 3}, i = j = k. Now, random variables N1 , N2 , N3 jointly follow a trivariate Poisson distribution with parameter θ = (µ1 , µ2 , µ3 , θ12, θ13 , θ23 ) . The means of the random variables are µ1 + θ12 + θ13 , µ2 + θ12 + θ23 and µ3 + θ13 + θ23 respectively and their variance-covariance matrix is given by   µ1 + θ12 + θ13 θ12 θ13     Cov(N) =  . θ12 µ2 + θ12 + θ23 θ23   θ13 θ23 µ3 + θ13 + θ23 The parameters θij , i, j = 1, 2, 3, i = j, can be interpreted straightforward as the covariances between the variables Xi and Xj and, thus, we refer to them as the covariance parameters. The 6 (3) parameters µi , i = 1, 2, 3, appear only at the marginal means and variances and we refer to them as the mean parameters. It is clear that this model is more flexible for real applications than the one with common covariance. For example, if the data refer to the number of claims for different coverage of an automobile insurance, it is natural to assume that each pair of different coverage has different covariance due to the intrinsic nature of these coverages instead of assuming that all pairs have the same covariance. Again in order to extend the applicability of the model we may assume that the parameters θi (including both the mean and the covariance parameters) are functions of explanatory variables. Therefore we may add covariates by assuming that log(θiq ) = xiq βi , where xiq is a vector of explanatory variables and βi denotes the corresponding vector of regression coefficients. To make the model easier to interpret, we consider covariates only for the mean parameters µi , i = 1, 2, 3. While covariates can also be added to the covariance parameters, this again would make the interpretation of the model very difficult and so we do not consider them here. Finally, note that the covariates associated with each parameter may be different. The joint probability function (jpf) is given by: s1 s2 s3 P (n1 , n2 , n3 ) = exp(−θ) k1 =0 k2 =0 k3 =0 k1 k2 k3 µ2 2 2 3 µ3 3 1 3 θ12 θ13 θ12 µ1 1 1 2 k1 ! k2 ! k3 ! (n1 − k1 − k2 )! (n2 − k2 − k3 )! (n3 − k1 − k3 )! (n −k −k ) (n −k −k ) (n −k −k ) where s1 = min{n1 , n2 }, s2 = min{n1 − s1 , n3 }, s3 = min{n2 − s1 , n3 − s2 } and θ = µ1 + µ2 + µ3 + θ12 + θ13 + θ23 . We will denote the above distribution as M P2 (µ1 , µ2 , µ3 , θ12 , θ13 , θ23 ). Note that this model allows for different covariances between different pairs, making the model more realistic at the cost of having two additional parameters to estimate. The jpf is quite complicated as it involves successive summations. One may improve it by deriving a recurrence relationship between the probabilities, i.e. by calculating probabilities based on ones that have already been calculated. This reduces the computation burden by avoiding excessive summation and reducing error accumulation. On the other hand, the data augmentation offered by the multivariate reduction makes Bayesian methods appealing. More details for the model can be found in Karlis and Meligkotsidou (2005). 2.3 Zero-inflated models The multivariate Poisson models treated above have Poisson marginal distributions and thus they cannot model overdispersion. Certain amounts of overdispersion can be introduced by 7 considering inflated versions of multivariate Poisson regression models, like the models described in Karlis and Ntzoufras (2003, 2005) and in Berm´dez (2009) used in the automobile insurance u context for the bivariate case. In the univariate case, zero-inflated models are well understood as models to account for the excess of zeros observed in certain circumstances. In the multivariate case, inflation can occur in different patterns. A particulary interesting case in practice, is when the (0, 0, . . . , 0) cell occurs more often than the assumed model would predict. Multivariate zero-inflated models have attracted much less interest than univariate and bivariate inflated models (see, e.g. Li et al., 1999). For an actuarial application see Boucher and Denuit (2008). We propose zero-inflated versions of the previous models with the following form:   p + (1 − p) P (n1 , n2 , n3 ) if n1 = n2 = n3 = 0 PZI (n1 , n2 , n3 ) =  (1 − p) P (n , n , n ) otherwise 1 2 3 i.e. the model moves probability from other cells to the (0, 0, 0) cell. A natural interpretation for this is that most clients never report an accident and thus the number of zeros is larger than would be expected under a Poisson model. Note that one may define more complicated models by assuming other kind of inflations. Moreover, one may add covariates to p, implying that inflation depends on external factors. We will not pursue this here. It is important that zero inflation introduces overdispersion to the marginal distributions. One can easily see that the marginal distributions are no longer simple Poisson distributions but zero-inflated versions. It is well known (see, e.g. Bohning et al., 1999) that zero-inflated Poisson models are overdispersed relative to simple Poisson models. In the bivariate (multivariate case) it has been shown that the covariance also increases (see, Wang et al., 2003 and Karlis and Ntzoufras, 2005). Hence, inflated models can introduce improvements in several aspects of the data. 2.4 Moments For the analysis presented in the following sections, some moments and covariances of the four models presented here need to be calculated. Tables 1 and 2 contain the values for the marginal expectations and variances, as well as the covariances (for ease of exposition we present the general form for Ni for the common covariance models, but for the full covariance model we present it with specific variates N1 and N2 in order to diminish the notational burden; of course 8 Common Covariance E(Ni ) = V (Ni ) = θi + θ0 Cov(Ni , Nj ) = θ0 E(N ) = V (N ) = 3 i=1 3 i=1 Full Covariance E(N1 ) = V (N1 ) = µ1 + θ12 + θ13 Cov(Ni , Nj ) = θij E(N ) = V (N ) = 3 i=1 3 i=1 θi + 3θ0 θi + 9θ0 µi + 2(θ12 + θ13 + θ23 ) µi + 4(θ12 + θ13 + θ23 ) Table 1: Expectations and variances for CC and FC models. Z-I Common Covariance E(Ni ) = (1 − p)(θi + θ0 ) V (Ni ) = (1 − p) (θi + θ0 ) + p(θi + θ0 ) 2 2 Z-I Full Covariance E(N1 ) = (1 − p)(µ1 + θ12 + θ13 ) V (N1 ) = (1 − p) (µ1 + θ12 + θ13 ) + p(µ1 + θ12 + +θ13 )2 Cov(N1 , N2 ) = (1 − p) {θ12 + (µ1 + θ12 + θ13 )(µ2 + θ12 + θ23 )} − (1 − p)2 (µ1 + θ12 + θ13 )(µ2 + θ12 + θ13 ) E(N ) = (1 − p)( V (N ) = 3 i=1 3 i=1 Cov(Ni , Nj ) = (1 − p) {θ0 + (θi + θ0 )(θj + θ0 )} − (1 − p) (θi + θ0 )(θj + θ0 ) E(N ) = (1 − p)( V (N ) = 3 i=1 3 i=1 θi + 3θ0 ) Cov(Ni , Nj ) i,j, i