 Research
 Open access
 Published:
Analysis of longitudinal semicontinuous data using marginalized twopart model
Journal of Translational Medicine volume 16, Article number: 301 (2018)
Abstract
Background
Connective tissue growth factor (CTGF), is a secreted matricellular factor that has been linked to increased risk of cardiovascular disease in diabetic subjects. Despite the biological role of CTGF in diabetes, it still remains unclear how CTGF expression is regulated. In this study, we aim to identify the clinical parameters that modulate plasma CTGF levels measured longitudinally in type 1 diabetic patients over a period of 10 years. A number of patients had negligible measured values of plasma CTGF that formed a point mass at zero, whereas others had high positive values of CTGF that were measured on a continuous scale. The observed combination of excessive zero and continuous positively distributed nonzero values in the CTGF outcome is referred to as semicontinuous data.
Methods
We propose a novel application of a marginalized twopart model (mTP) extended to accommodate longitudinal semicontinuous data in which the marginal mean is expressed in terms of the covariates and estimates of their effect on the mean responses are generated. The continuous component is assumed to follow distributions that stem from the generalized gamma family whereas the binary measure is analyzed using logistic model and both have correlated random effects. Other approaches including the one and twopart with uncorrelated and correlated random effects models were also applied and their estimates were all compared.
Results
Our results using the mTP model identified intensive glucose control treatment and smoking as clinical factors that were associated with decreased and increased odds of observing nonzero CTGF values respectively. In addition, hemoglobin A1c, systolic blood pressure, and high density lipoprotein were all shown to be significant risk factors that contribute to increasing CTGF levels. These findings were consistently observed under the mTP model but varied with the distributions for the other models. Accuracy and precision of the mTP model was further validated using simulation studies.
Conclusion
The mTP model identified new clinical determinants that modulate the levels of CTGF in diabetic subjects. Applicability of this approach can be extended to other biomarkers measured in patient populations that display a combination of negligible zero and nonzero values.
Background
Diabetes mellitus is a progressive disease of the vasculature, leading to increased risk of both microvascular complications such as diabetic nephropathy (DN) and retinopathy (DR), and cardiovascular disease (CVD), including myocardial infarction and stroke [1, 2]. Emerging evidence points to a mechanistic link between microvascular complications such as DN and DR and increased risk of cardiovascular disease [3,4,5]. Since early pathologic events are similar within small and large vessels, it is postulated that common risk markers and mechanisms that initiate and promote vascular damage are involved. One such factor that has been identified as a pathogenic risk determinant for the development of microvascular and cardiovascular complications is connective tissue growth factor (CTGF). CTGF is a secreted matricellular potent chemotactic and extracellular matrixinducing factor that has been implicated in progression of inflammatory and fibroproliferative disorders [6]. Plasma CTGF levels were independently associated with hypertension, increased albumin excretion rate, increased carotid intimamedia thickness, hemoglobin A1c (HbA1c) and circulating levels of lipoproteins [7]. Plasma CTGF was also linked to increased risk of cardiovascular events and mortality in patients with atherosclerotic disease and was associated with plaque stabilization following stroke [8, 9]. Moreover, plasma CTGF levels were shown to predict myocardial infraction in type 2 diabetic subjects [10]. Taken together, these studies suggest that CTGF may have substantial value both as a pathogenic risk marker of inflammationinduced tissue injury and as a therapeutic target.
Despite that the divergent biological effects of CTGF on the vasculature was established, it still remains unclear how CTGF expression is regulated. To gain insights into the factors that modulate plasma CTGF levels, circulating levels of CTGF were measured longitudinally in type 1 diabetic patients over a period of 10 years. Our results indicated that a number of patients had negligible measured values of plasma CTGF that formed a point mass at zero, and other patients had high values of CTGF that were measured on a continuous scale. The combination of excessive zero and continuous positively distributed nonzero values in the CTGF outcome observed in our study is referred to as semicontinuous data [11]. The cause behind the semicontinuous data of plasma CTGF may be attributed to factors or clinical covariates that either promote expression and release of CTGF and/or inhibition of CTGF in diabetic subjects. Hence, it is important to identify the clinical factors that associate with the odds of having detected nonzero CTGF values as well as determining the factors that correlate with CTGF levels. This clinical problem motivated our research work in which we present different models for analysis of the semicontinuous CTGF data considered in this manuscript.
Semicontinuous data is given special attention in the literature due to its widespread occurrence under different settings, and the importance of its appropriate analysis in order to obtain accurate estimates and inferences [12]. Given the mixture of zero and nonzero values, it was intuitive to view the semicontinuous outcome as arising from two different stochastic processes. One process, referred to as the binary part, indicates if the outcome is zero or not, and the second referred to as the continuous part, determines the positive values conditional on the outcome being nonzero. Semicontinuous data are typically analyzed using twopart models wherein the zero process and the continuous values are modeled separately using logistic regression for the binary part and lognormal for the continuous part to ensure prediction of positive values [12]. To analyze longitudinal semicontinuous data, two frameworks were proposed, the twopart mixed models with either correlated or noncorrelated random effects in both parts [11,12,13,14,15] and the other is based on the twopart marginal models [16], in addition to Smith et al. [17] who proposed a Bayesian inferential approach for a marginalized Twopart model with correlated random effects. Interpretation of the covariate estimates depends on the model’s specification. Estimates of the continuous component of the Twopart mixed models are interpreted as having conditional effect on the population average given that the outcome values are positive and nonzeros. However, parameter estimate of a covariate in the continuous part of twopart marginal models, is interpreted as having subjectspecific and population average multiplicative effect on the population marginal mean if the corresponding covariate is not a random effect. If intercept is the only parameters included as a random effect in the specification of the overall mean, then all covariates will have a multiplicative effect on the population mean. Traditional approaches such as zero inflated Poisson and zero inflated negative binomial models are mainly implemented to address data with zero mass discrete count outcome and crosssectional data. Available approaches for semicontinuous data are known to be computationally intensive and sometimes not feasible to implement [15]. Some of the available models involve complex and intractable integration of high dimensional integration over the stochastic processes in the marginal likelihood function rendering them difficult to implement [12]. In this manuscript we present a novel application of different approaches to analyze semicontinuous data with the aim of assessing the effect of clinical parameters on the processes of the zero and nonzero values of CTGF. The approaches considered in this manuscript included marginalized twopart that we extend to accommodate longitudinal repeated measures data, twopart correlated and uncorrelated random effects, and onepart models. These models are advantageous in terms of feasibility of implementation by using available statistical procedures such as SAS proc NLMIXED used for mTP and TP with correlated random effects, and SAS Proc Glimmix for TP with uncorrelated random effects compared to the other available approaches which are complex and computationally intensive [12, 15] and require EM algorithms or Bayesian approaches to be implemented. Another advantage of the mTP when modelled for cross sectional data is that its estimates were consistent and unbiased [18].
To our knowledge, all studies that centered on CTGF as an outcome applied conventional statistical methods that ignore the zero part and just analyze the nonzero continuous part. Ignoring the zeros will not allow for determining the factors that affect the zero values of CTGF, and information from the zero component will not contribute to the likelihood function, which introduces bias to the estimation process.
The Marginalized twopart model implemented here is an extension of a previous work for cross sectional studies by Voronca et al. [18] that we developed to analyze longitudinal semicontinuous data. This extension of the mTP model to longitudinal data with repeated measures imposes an added level of complexity due to the inclusion of random effects that account for the withinsubject correlation. In addition, this extended model incorporates higher dimensional variance–covariance matrix that accounts for the correlations between the random effects of the zero and nonzero processes. In specific, the withinsubject correlation is accounted for by including correlated random effects in the binary and continuous parts. Random intercepts in both parts are jointly modeled in a marginalized likelihood function integrated over the random effects. Intercepts are the only random effect included in the overall specification of the mean. The marginal mean is parameterized directly in terms of regression coefficients using both zero and nonzero values and direct interpretation of the covariate effects on the marginal mean can be drawn for the entire population and not conditional on the positive values. The generalized gamma family of distributions known for its flexibility to account for different types of data was incorporated in our longitudinal model. In addition, we considered the special cases of the generalized gamma that include standard gamma, Weibull, and lognormal. Generalized gamma distribution is defined by parameters for the shape and scale that give it flexibility and appropriateness to fit datasets with different skewness and asymmetry.
In the model’s section, we start first by describing the generalized gamma family of distributions, and the twopart models for longitudinal data with uncorrelated and correlated random effects. We then describe the marginalized twopart model with generalized gamma and other distributions that stem from this family, and the onepart model that analyzes the whole data with zeros and nonzeros. These models were applied to a cohort of diabetic patients with CTGF measures as the outcome of interest that motivated this study.
Study population
Plasma CTGF levels were measured on 693 subjects from the Diabetes Control and Complications Trial (DCCT)cohort of type 1 diabetes [19]. The patients enrolled in the DCCT study between 1983 and 1989 and half of the subject population was randomly assigned to conventional diabetes treatment and the other half was assigned to intensive diabetes treatment. In 1993, the DCCT study was stopped when intensive treatment was clearly shown to reduce the risks of microvascular complications [20]. The DCCT study was approved by the Institutional Review Boards of all participating DCCT centers and all participants provided written informed consent. Clinical factors such as blood pressure, HbA1c, lipoprotein, duration of diabetes, and demographic factors such as age, gender, and smoking were all collected on these patients and were used as covariates in our analysis to assess their effects on CTGF levels.
CTGF measurement
Plasma CTGF levels were measured longitudinally at baseline [study entry (1983–1989)], midpoint of DCCT (1988–1991) and end of DCCT (1993) with a sandwich ELISA that detects both intact CTGF, and cleaved CTGF to release the Nfragment of CTGF (N + WCTGF assay). The capture antibody is human antihuman CTGFdomain 1 and the detection antibody is mouse antihuman CTGFdomain 2 (FibroGen, San Francisco, CA, USA). Standard curve was prepared with rhCTGF (CTGF expressed in CHO cells and affinity purified with an antiCTGF antibody column, FibroGen, San Francisco, CA USA). Absorbance at 405 nm was acquired on a SpectraMax 340PC spectrophotometer and analyzed with SoftMax Pro 4.8 software (Molecular Devices, Sunnyvale, CA, USA). The total CTGF values of the repeated measures (n = 1985) in all subjects throughout the study are plotted in Fig. 1. The data shows that about 62% (n = 1231) of CTGF levels measured were negligible and close to zero, suggesting that the production and/or release of CTGF into the plasma is inhibited in subjects with zero measured values of CTGF.
The models
Four models were explored and illustrated: (1) twopart model for longitudinal data with uncorrelated random effects, (2) twopart model for longitudinal data with correlated random effects, (3) marginalized twopart model, and (4) onepart model. We first start by describing the generalized gamma and the distributions that stem from this family which are the gamma, Weibull and lognormal distributions. These different distributions were considered for the continuous part of the Twopart models and the marginalized twopart model. As for the onepart model, these distributions were applied on the entire sample that has both the zero values and the continuous part altogether.
Generalized gamma family of distributions
We describe here the modeling framework of the generalized gamma distribution determined by three parameters for the shape and scale. Specifications of these parameters result in certain distributions such as standard gamma, lognormal, and Weibull. Thus, this family of distributions is appropriate to help understand the dependent variable and the process behind generating its values by comparing the model fit for each of the distributions and to select the best estimates using maximum likelihood approach in a regression framework.
The generalized gamma probability density function is specified as such:
where \(\varGamma \left( . \right)\) is the standard gamma function, \(u = sign\left( k \right){{\left( {\log y  \mu } \right)} \mathord{\left/ {\vphantom {{\left( {\log y  \mu } \right)} \sigma }} \right. \kern0pt} \sigma }\), for shape parameter k, location parameter \(\mu > 0\) and scale parameter \(\sigma > 0\) and \(\eta = \left k \right^{  2} > 0\).
The sth moments of the GG distribution are specified as such:
And mean and variance are respectively
where \(C\left( {\sigma ,k} \right) = \frac{{\sigma \log \left( {k^{2} } \right)}}{k} + \log \left[ {\varGamma \left( {\frac{1}{{k^{2} }} + \frac{\sigma }{k}} \right)} \right]  \log \left[ {\varGamma \left( {\frac{1}{{k^{2} }}} \right)} \right]\)and
Specifications of σ and k result in different distributions. When σ = k the gamma distribution is obtained, when k = 1 the Weibull distribution is obtained, and when k → 0 the limiting distribution of the generalized gamma reduces to lognormal distribution.
Twopart model for longitudinal data with uncorrelated random effects
The longitudinal twopart model can be described as such:
where Y_{ij} represents the observation for the positive continuous outcome with a point mass at zero for the ith subject at the jth time point, \(\pi_{ij} = \Pr \left( {Y_{ij} > 0} \right)\) is the probability of being non zero for the ith subject at the jth time point, and \(f\left( {Y_{ij} ;X^{\prime}_{ij} \delta } \right)\) is the density function for the positive values of Y_{ij}. The parameterization of this model is done in two parts that are fit separately:
In part 1 the binary outcome is modeled as
where b_{1i} represents the random effect intercept that accounts for the within subject correlation pertaining to the repeated measures for the same subject in the zero part
Assuming that the log for the g link function, the location parameter μ_{ij} for the continuous component is modeled in the second part as
where b_{2i} represents the random effect intercept that accounts for the within subject correlation pertaining to the repeated measures for the same subject in the continuous part
The two random effect intercepts b_{1i} and b_{2i} in the two process of zero and nonzero are assumed to be independent and uncorrelated. \(Z^{\prime}_{ij}\) is the vector of covariates for the ith subject measured at the jth time point for the binary part and \(X^{\prime}_{ij}\) is the vector of covariates for the ith subject measured at the jth time point used for the continuous part. The two parts might have common covariates or completely different ones. α is the vector of model coefficients corresponding to the binary part and δ is the vector of coefficients corresponding to the continuous part conditional on the values being nonzero.
The marginal mean and variance of Y_{ij} from a TP model can be derived as such:
When GG is assumed in the continuous part, the marginal mean is
The variance of Y_{ij} corresponding to TP can be obtained using the variance formula in Eq. (10) and the sth moments for GG in Eq. (2). C is defined in Eq. (3) and its specification leads to the different distributions that belong to the GG family of distributions. For instance, when C = 0 and σ = k then the GG distribution reduces to the TPgamma distribution model for the continuous part; when \(C\left( \sigma \right) = \log \left[ {\varGamma \left( {1 + \sigma } \right)} \right]\) and k = 1 then the TPWeibull distribution is obtained, and when \(C\left( \sigma \right) = {{\sigma^{2} } \mathord{\left/ {\vphantom {{\sigma^{2} } 2}} \right. \kern0pt} 2}\) and k → 0 then the TPlognormal distribution is obtained.
In the binary part, the estimates of the vector of coefficients α represent population based averages for the whole population for the probability of positive values. When taken on an exponential scale, \(\exp \left( \alpha \right)\) can be interpreted as the odds ratio of having positive value for a one unit increase in the corresponding covariate. Meanwhile, in the continuous part the vector of coefficients δ are estimated for only those with positive nonzero values that represent a portion of the data and not the whole sample. When the log link is assumed in the continuous part, then conditional on the observation being nonzero, the exponential of the estimate of δ is the multiplicative change in the value of the outcome when the corresponding covariate increases by one unit. Hence, the binary part provides population estimates for the probability of nonzero, and the continuous part provides estimates for the effect on the population mean given that the value is nonzero.
Twopart model for longitudinal data with correlated random effects
So far it was assumed that the intercepts in the two processes are the only random effect specified in the twopart model and that these random variables are independent. This assumption of independence leads to biased estimates in the regression coefficients and the variance components in the continuous part [21]. To correct this assumption the two random effects are assumed to be dependent and their correlation is included in the model specification and likelihood function. In this case the random effects are assumed to have joint distribution which could be the bivariate normal distribution determined as such:
The binary part provides subjectspecific estimates for the probability of obtaining nonzero values, and the continuous part provides subjectspecific estimates of the conditional mean of log of the outcome provided the value is nonzero.
The estimation of Z_{i}, X_{i}, and Σ can be estimated by maximizing the marginal of the log likelihood function that is integrated over the random effects that can be described as such:
where f represents the distribution function of the continuous part of the outcome Y, and θ represents the bivariate normal distribution for the random intercepts.
Marginalized twopart models (MTP) extended to longitudinal data
The longitudinal form of the probability density function (pdf) for an MTP model \(\left( {g_{MTP} } \right)\) can be written as such:
where π_{ij} is the probability of nonzero value for the outcome Y_{ij} and is obtained from a logistic model, thus it will take the form of \(\pi_{ij} = \frac{{\exp \left( {Z^{\prime}_{ij} \alpha + u_{1i} } \right)}}{{1 + \exp \left( {Z^{\prime}_{ij} \alpha + u_{1i} } \right)}}\) and β representing the vector of marginal coefficients corresponding to the continuous part of an MTP model, u_{i} represents the correlated random effect intercepts in both parts of MTP
The marginal mean is of the form
Solving for the location parameter of the GG distribution in E(Y_{ij}) expressed in Eq. (11), we get the following parameterization:
C is defined in Eq. (3) and its specification leads to the different distributions that belong to the GG family of distributions. For instance, when C = 0, and σ = k the GG distribution reduces to the MTPgamma distribution model for the continuous part, when \(C\left( \sigma \right) = \log \left[ {\varGamma \left( {1 + \sigma } \right)} \right]\), and k = 1 the MTPWeibull distribution is obtained, and when \(C\left( \sigma \right) = {{\sigma^{2} } \mathord{\left/ {\vphantom {{\sigma^{2} } 2}} \right. \kern0pt} 2}\) and k → 0 the MTPlognormal distribution is obtained. The distribution used in the continuous nonzero part of MTP should have a finite closedform mean that can be parameterized as in Eq. (16).
The binary part provides subjectspecific estimates of the probability of having nonzero values for the outcome wherein the exponential of α is interpreted as the subjectspecific odds ratio for having a nonzero response attributed to a one unit increase in the respective covariate. The continuous part provides effects of the estimates on subjectspecific and population mean for parameters corresponding to covariates that are not included as random effects in the model’s specification. This specification was assumed in the correlated mTP model described earlier. Parameter estimates in the continuous part will only have subjectspecific interpretation if the corresponding covariates are included as random effects. The exponential of the parameter β in the continuous part represents the multiplicative effect on the overall mean for the whole population attributed to a one unit increase in the corresponding covariate X. The continuous component of the correlated marginalized Twopart model provides effects of the estimates on the entire sample, while that of the correlated Twopart model provides estimates of the effect on portion of the sample pertaining to the positive nonzero values.
Statistical estimation and inference for MTP longitudinal models
The general format of the likelihood function can be described as such:
where f represents the pdf of the GG distribution or any other distribution from its family and q is the bivariate normal distribution for the random intercepts. Expressing μ_{ij} in terms of Eq. (16), and π_{ij} as denoted earlier, the marginal likelihood function for the GG distribution can be described as such
where
In this likelihood function (Eq. 18), q(u_{i}) represents the bivariate normal distribution for the random effects with mean vector of zeros and variance–covariance matrix G. The random effects u_{i} are integrated out to get the marginal likelihood function. The log of the marginal likelihood function is then maximized by taking the first derivative with respect to each parameter and setting the equation to zero to obtain the maximum likelihood estimate for each of the fixed effects, α, β, k, σ, G respectively. Empirical Bayes estimators using the adaptive Gaussian quadrature approach [22] was used to obtain predicted values of the random effects u_{i}. The likelihood function for the standard Gamma, lognormal, and Weibull distributions are obtained in a similar manner by just replacing the distribution of the continuous nonzero part by the corresponding probability density function. The asymptotic standard errors are computed using Fisher information after substituting the maximum likelihood estimates for α, β, k, σ, G corresponding to the MTPGG model:
The marginal likelihood function is maximized using dual QuasiNewton optimization [23].
Onepart model for longitudinal data
The onepart model does not distinguish between zero and nonzero values in the sense that it assumes that all values are generated from the same process and the concept of having a zero and nonzeros processes as in the Twopart models does not apply here. Hence, the onepart model analyzes both the zeros and nonzeros as one sample and produces parameter estimates for the whole data (for both the zero and nonzero values altogether).
The onepart model can be described as:
where \(b_{0i} \sim N\left( {0,\sigma_{{b_{0i} }}^{2} } \right)\), γ is the vector of parameters for the fixed effect covariates W_{ij}. The parameter estimates are generated for the entire sample using approaches such as quasilikelihood generalized linear models that allow fitting of zero values, or by adding a small constant to the zero values [24]. The onepart model provides estimates of the populationbased effects of parameters γ on the overall marginal mean, with the exponential of γ representing the multiplicative change on E(Y_{ij}) corresponding to a one unit increase in the respective covariate when the log link is assumed. It was established that this model results in less efficient, imprecise and biased estimates with inflated type 1 error [24, 25].
Results
The different models, mTP, TP with correlated random intercepts, TP with uncorrelated random intercepts wherein the zero and continuous parts are fit separately, and the onepart model that fits the zero and nonzero values together in one model and assumes that a single process generates these values, were all applied on the CTGF levels measured longitudinally presenting the outcome of interest. The objective was to identify factors that associate with the zero and nonzero processes that generate the CTGF values in order to gain insight on how these levels are regulated.
Three different distributions were assumed for the continuous part of CTGF; gamma, lognormal and Weibull and their corresponding respective results are shown in Tables 1, 2, 3 wherein we included the slope estimates, its standard errors and Pvalues. The distribution that fits the data the most was the one that had the least measures of fit, Akaike information criterion (AIC), Bayesian information criterion (BIC) and log Likelihood values. Gamma distribution had the lowest AIC, BIC and log likelihood values indicating that it fits best the data for mTP and TP with correlated random intercepts models (Table 4). Our results discussed in detail below indicated that mTP model gives parameter estimates that are consistent across all distributions while the other models had discrepancy in the hypothesis testing and inferences that were dependent on the distribution of the continuous measures. The models that appeared to have increased inconsistent, inaccurate and biased estimates are the onepart and twopart uncorrelated random intercepts models. The low coverage in the confidence interval and the inflated type 1 error in the onepart model and the attributed bias in the estimates under this model and the uncorrelated twopart especially when zero values are prevailing in the data, explain some of the contradictory results observed in this study.
Our results showed that smoking status was significantly associated with an increase in the probability of nonzero values for CTGF. Specifically smokers had higher odds of 1.7–1.96 of getting nonzero levels of CTGF than nonsmokers with Pvalues ranging between < 0.0001 and 0.04 depending on the model. This result was consistently demonstrated for mTP, TP with uncorrelated and correlated random effects models, and for all 3 distributions. This result suggests that smoking is associated with increased plasma CTGF levels in type 1 diabetic patients. This lends support to previous findings that CTGF expression levels in pulmonary vessels isolated from smokers was higher than those from nonsmokers [26].
With respect to the impact of intensive glycemic treatment, its effect on the probability of nonzero values of CTGF varied between models and distributions. The mTP model consistently demonstrated its significant effects on the nonzero probability across all 3 distributions. Patients that were on intensive glycemic treatment had 1.34 times lower odds of getting nonzero CTGF values compared to patients on standard treatment (Pvalues were < 0.0001, 0.0113, 0.0125). Hence, mTP model showed that intensive glycemic treatment is associated with increased probability of having negligible CTGF values. However, the effect of intensive glycemic treatment was not consistently observed in the TP model with correlated random effects in all 3 distributions. In Table 1, when the gamma distribution was used for the continuous measure and in Table 2 with the lognormal distribution, intensive glycemic treatment was significantly associated with decreased odds of having nonzero values for CTGF by about 1.6 times compared to patients on conventional glycemic treatment (Pvalues = 0.01 and 0.04). However, using the Weibull distribution for the continuous part (Table 3), intensive glycemic treatment had a borderline significant effect with Pvalue of 0.0509. Given that the gamma distribution was the best fit for this data (AIC = 4279 Table 4), one can deduce that the odds of observing zero CTGF values is exp (0.6384) = 1.89 times higher in intensively treated patients compared to those on the conventional arm. On the other hand, the TP model with uncorrelated random effects failed to capture this significant association between the intensive glycemic treatment group and the probability of nonzero values (Pvalue = 0.0944). The fact that this model ignores the correlation between the two components and fits the zero part separately from the continuous part treating them as two independent entities might have lowered the power of hypothesis testing and introduced bias in the parameter estimates and inaccuracy in the results.
Unlike mTP and TP, onepart model fits the entire sample and assumes that all values are obtained from a single process instead of two different zero and nonzero processes. Hence, parameter estimates under this model for treatment group and smoking represent the effect of these covariates on CTGF levels and not on the probability of nonzero values. Our results showed that under the onepart model treatment group had no effect on CTGF levels and this was consistently demonstrated for all 3 distributions (Tables 1, 2, 3, onepart model). This result is not in agreement with mTP and clinical findings which supported the hypothesis that intensive glucose control regulates and lowers CTGF levels [7, 27]. As for smoking, its effect on CTGF levels were captured under lognormal (Pvalue = 0.0289) and Weibull (Pvalue = 0.0471) distributions but not with Gamma distribution. This detected association showed that smoking contributes to decreased levels of CTGF which is not in line with the results from mTP and TP models as well as clinical findings [26]. This inaccuracy in the estimates could be due to the bias and lack of consistency in the parameter estimates attributed to onepart model.
As for the continuous nonzero part, high density lipoprotein (HDL) was consistently associated with the nonzero values of CTGF, and this was demonstrated for the TP and mTP models and in all 3 distributions (all Pvalues < 0.05), except for the onepart model that failed to show this association and which could be attributed to the low coverage in the confidence interval of this model. When a significant association was captured, patients with higher plasma HDL levels appeared to have increased levels of CTGF. For example, under the gamma distribution and the mTP model, type 1 diabetic patients who have 1 mg/dl higher HDL had about 1.24% (ng/ml) increased levels of CTGF (Table 1, Pvalue < 0.0001).
Duration of diabetes and gender consistently demonstrated a nonsignificant effect on the observed CTGF levels across the TP and mTP models with different distributions. However, the onepart model showed that duration of diabetes, and gender were significantly associated with CTGF only under the gamma and Weibull distributions respectively. These significant associations could be attributed to the fact that the onepart model has increased biased in its estimates and inflated type one error that lead to inaccurate conclusion of significant association when in reality a correlation does not exist. The increased bias and type 1 errors in the onepart model are triggered by the fact that under this approach and unlike the twopart models estimates are generated for the entire sample without distinguishing between the zeros and nonzeros when in fact these values are generated by two different processes.
Systolic blood pressure (SBP) did not have consistently detected effects on the observed values of CTGF in all four models. In this regard, SBP showed a positive significant effect on CTGF under mTP model and TP model with correlated random intercepts for all 3 distributions (Pvalues < 0.05). A borderline significant effect of SBP on CTGF was detected under TP with uncorrelated random intercepts model with gamma distribution (Table 1, Pvalue = 0.0534), an insignificant effect with lognormal distribution under these same models (Table 2, Pvalue = 0.0801), and a significant effect for these models under the Weibull distribution (Table 3, Pvalue = 0.0303). Onepart model showed a nonsignificant association between SBP and CTGF for all 3 distributions which again could be attributed to the lower coverage in the confidence interval in this model. In addition, given that mTP model and TP model with correlated random effects have less bias in the estimates than onepart and TP with uncorrelated random intercepts models, then one can conclude that SBP has a significant positive effect on the observed nonzero CTGF values. If we were to interpret its marginal effect on the CTGF population mean under mTP model with gamma (Table 1), we can deduce that when SBP increases by 1 mmHg the observed values of CTGF increase by about (exp(0.0243) − 1)*100% = 3% (ng/ml) on average.
When the effect of age on CTGF values was assessed, it also had inconsistent relationship with CTGF observed values that varied with each model. Both mTP and TP models with correlated random effects indicated a nonsignificant association between age and CTGF under all distributions. TP with uncorrelated random effects and onepart models showed significant effect of age on CTGF values (Pvalue < 0.05) for all distributions under the TP uncorrelated model, and for lognormal and Weibull under the onepart model. These significant results could be attributed to the inflation of type 1 error in the onepart model and TP model with uncorrelated random effects.
Our results also showed that HbA1c, a marker of metabolic control, was significantly associated with CTGF under mTP model with all 3 distributions. In this respect, an increase of 8 ng/ml in the marginal mean of CTGF was attributed to a 1% increase in HbA1c (Pvalue = 0.0005) under the gamma distribution. Similar interpretation can be drawn for mTP with lognormal and Weibull distributions. The TP model with correlated random effects showed a significant effect of HbA1c only with gamma distribution but did not capture any significant effect with the other distributions. This could be attributed to the fact that mTP had better precision and accuracy in the parameter estimates compared to all other models.
The TP model with uncorrelated random effects did not show any significant association for HbA1c with CTGF. This could be triggered by the decreased power in the hypothesis testing when the zero values are not incorporated in the analysis but rather analyzed separately. Onepart model showed a significant negative association between HbA1c and CTGF under the lognormal and Weibull distributions which contradicts the hypothesized positive correlation between HbA1c and CTGF. Lack of precision and increased bias in the onepart model might have led to this inconsistent and inaccurate inference on the direction of the association between HbA1c and CTGF. Hence as previously indicated, our results suggested that mTP exhibited more accurate, precise and consistent estimates and inferences compared to the other models. This conclusion was further examined using a simulation study that intended to determine the performance of each of the models and which we discuss in the following section.
Simulation study
To assess the performance of each of the models: mTP, TP with correlated random intercepts, TP with uncorrelated random intercepts, and the onepart model, a simulation study was conducted wherein different proportions of zeros were included. We performed 1000 simulations with sample size of 200 and with 9 repeated measures and 30% proportions of zeros, and with 12 repeated measures with 50% proportion of zeros. The performance of each of the models was determined in terms of bias and mean square errors (MSE) and the smaller these performance indicators the more accurate and precise the model’s estimates are. Our simulation results included in Table 5 indicated that mTP had the smallest bias and MSE compared to the remaining models and under the different zero proportions. In this regard, mTP had 35% decrease in bias and 88.4% decrease in MSE compared to TP with correlated random effects in the simulation study that had 30% proportion of zeros, and 39% decrease in bias and 98% decrease in MSE under the simulation study with 50% proportion of zeros. Hence mTP performed better than the TP with correlated random effect and this was evident in both studies especially in the case of higher zero proportion of 50%. In the simulation study with 30% zero proportion, TP with correlated random effect had a decrease of 7% to 12% in bias, and 18% to 21% in MSE compared to TP with uncorrelated random intercepts and the onepart model respectively. Similarly, under the simulation study of 50% zero proportion, a decrease of 2% and 9% was denoted for bias and 8% to 16% in MSE compared to the TP with uncorrelated random intercepts and the onepart model respectively. Hence, our simulation study results suggested that TP with correlated random effects had better performance than the remaining two models (TP with uncorrelated random effects and onepart model), and that mTP had smaller attributed bias and MSE compared to the other 3 models indicating better accuracy and precision of its estimates.
Discussion
In this manuscript, we present a novel application of a likelihoodbased approach to analyze semicontinuous longitudinal data using a marginalized twopart model that we extended to incorporate longitudinal repeated measures. Various distributions were incorporated that included gamma, lognormal and Weibull. Random intercepts at an individual patient level were introduced in both the zero and nonzero components to account for the within subject correlation inherent due to the repeated measures on the same subject. We applied this model on a cohort of type 1 diabetic subjects with the aim of identifying clinical determinants that associate with CTGF, a pathogenic risk factor for diabetic complications. CTGF levels measured in this cohort displayed a mixture of negligible low values forming a point mass at zero and continuous observed positive values. The objective of this study is to determine what risk factors impact, these two components that ultimately result in CTGF levels. We also compared the estimates under different distributions using other models for analyzing semicontinuous longitudinal outcomes. The models explored here included in addition to marginalized twopart model, the twopart model with correlated, and uncorrelated random effects wherein, the continuous and zero components are fit separately, and the onepart model that provides estimates for the entire sample without distinguishing between the zero and nonzero processes. The marginalized twopart model allows for interpretation of the estimate in the continuous part as 1 unit increase in the covariate on the overall marginal mean comprised of zeros and nonzeros, while the effect of the estimates in the continuous part under the twopart model are interpreted conditional upon the values being observed.
When the mTP model was applied on a cohort of type 1 diabetic patients, it gave consistent results for the parameter estimates across all 3 distributions, demonstrating robustness for the underlying distribution compared to onepart and twopart models with uncorrelated random intercepts. The clinical determinants that displayed significant associations with the probability of nonzero values for CTGF under the mTP model were the glycemic treatment and smoking status. However, the clinical parameters that were significantly associated with the continuous observed positive values of CTGF were HDL, HbA1c and SBP.
In general, the TP model with correlated random effects resulted in estimates that are close to the parameter estimates under the mTP model but showed some discrepancy in the results of some clinical parameters that varied between the different distributions. Specifically, HbA1c was shown to be significantly associated with continuous observed values for CTGF under the TP correlated random intercepts with gamma distribution, but this association was not significant under the lognormal and Weibull distributions. Similarly, the intensive glycemic treatment group was shown to be significantly associated with the probability of nonzero under the gamma and lognormal distributions, but this association was not significant under the Weibull distribution. Gamma and lognormal distributions were better fit for this data given their lower AIB and BIC values and resulted in more stable results than TP with Weibull. This inconsistency in the inferences in the TP model with correlated random intercepts, could be attributed to its sensitivity to the underlying distribution and the true random effects structure, which is not the case with the mTP model [15]. It is worth noting here that from a clinical perspective, HbA1c was shown in some studies to be positively associated with CTGF levels, and treatment was shown to be correlated with the detection levels for CTGF [7, 27].
With respect to the TP model with uncorrelated random intercepts and separate fitting of the zero and continuous components, this approach was consistent with the mTP and the TP models with correlated random intercepts in the inference for smoking that was shown to be significantly associated with the probability of nonzero. However, this was not the case with the parameter estimate for intensive glycemic treatment in the zero part, wherein this model failed to capture the significant effect of intensive treatment on the probability of nonzero measures. Similar result was obtained with HbA1c where a nonsignificant association was reached under this model with all distributions. SBP was shown to be significantly associated with the continuous part under the Weibull distribution, but not with lognormal or gamma distributions. This result is not in line with findings from other clinical studies that showed a significant association between hypertension and CTGF [7, 9]. The discrepancy in the inferences between the mTP model and the TP model with uncorrelated random effect could be attributed to the increased lack of efficiency, and bias in the parameter estimates due to ignoring the correlation between the random effects in the two components and fitting the zero and continuous parts separately [21, 28].
The onepart model produced estimates and inferences that contradicted clinical findings. For instance this model suggested that increased HbA1c and smoking are protective factors that contribute to decreasing CTGF which is opposite of what has been clinically demonstrated [26, 27]. In this regard, CTGF levels were shown to be significantly associated with HbA1c in type 1 diabetic patients with nephropathy [27]. The expression of CTGF was also shown to be increased in the kidney and vasculature isolated from animal models of diabetes, implicating a role for hyperglycemia in modulating CTGF expression [29, 30]. Furthermore, hyperglycemia was shown to stimulate the expression of CTGF in mesangial cells, podocytes and vascular smooth muscle cells and this process involved activation of transforming growth factor beta, MAPK kinase pathway and protein kinase C [31,32,33,34]. In addition, the onepart and twopart with uncorrelated random effects were the only models that detected a significant association between age and CTGF levels. This inaccuracy in the results could be due to the inflated type 1 error and negative bias that pose major disadvantages of the onepart model [35].
A simulation study was conducted whereby two different proportions of zero values were considered (30% and 50%) to assess the performance of each of these models. Our simulation results showed that mTP had a superior performance in the sense that it had the smallest attributed bias and MSE compared to the other three models, which suggests better accuracy and precision of the estimates under mTP. This is in line with the results obtained from the clinical application previously discussed wherein we denoted that the mTP model generated consistent and robust estimates for the assumptions pertaining to the distributions of the continuous part and also accounts for the longitudinal measures and skewness in the data due to the point mass at zero. An advantage of this model resides in the consistency of the estimates and feasibility of its implementation, unlike most of the available approaches for fitting longitudinal semicontinuous data that are computationally intensive and difficult to implement [15]. Other approaches require high dimensional integrations of the stochastic processes in the marginal likelihood function which could be very complex and intractable [12]. As for the execution time, mTP and TP with correlated random effects needed more time to converge, which was about double the time needed for the TP with uncorrelated random effects and the onepart models. This increased execution time is not surprising given the complexity of the likelihood functions and its maximization under mTP and TP with correlated random effects compared to the simpler models of TP uncorrelated random effects that fits the zero and continuous components separately, and the onepart model that fits both components as one sample. However, the overall execution time is still short and it needed less than or approximately 1 min maximum time to converge successfully. Nevertheless, this additional execution time is outweighed by the gain in accuracy and precision of the mTP model.
Conclusion
In summary, our findings showed that mTP provided stable estimates that are less sensitive to the underlying distributions when compared to the twopart and onepart models. Our simulation results showed superiority of mTP over the other models in terms of minimum bias and mean square errors indicating better accuracy and precision of the parameters’ estimates. Incorporating the withinsubject correlation and the correlation between the zero and continuous nonzero processes and expressing the marginal mean directly in terms of parametrization of the regression coefficients using both the zero and nonzero values could all contribute to the precision and accuracy of this model.
Furthermore, in this manuscript we adopted a novel approach that analyzes for the first time CTGF from the perspective of having different processes that result in the zero and nonzero values. The mTP model presented here has identified new clinical determinants that modulate the levels of CTGF in diabetic subjects. In this regard, intensive glycemic treatment was shown to be associated with decreased odds of CTGF detection, and smoking was identified as a factor that associates with increased probability of nonzero which indicates its association with increased levels of CTGF. Moreover, HDL, SBP and HbA1c were associated with increased levels of CTGF. This finding is of clinical significance, since it provides insights into factors that affect the levels of CTGF, a pathogenic risk factor for diabetic complications. In addition, a key advantage is that the analytical approaches described herein are applicable to all inflammatory biomarkers and cytokine profiles measured in patient populations that display a combination of negligible zero and nonzero values to understand the factors that regulate their production. Moreover, the models illustrated in this study are not limited to only clinical outcome datasets but could be also applicable to a vast array of real life situations such as health services research whereby lack or absence of a service may lead to proliferation of zero values, which requires this type of analyses. Hence, this study could be utilized as a model approach for the analyses of similar settings wherein semicontinuous data is present.
Abbreviations
 CTGF:

connective tissue growth factor
 CVD:

cardiovascular disease
 mTP:

marginalized twopart model
 TP:

twopart model
 AIC:

Akaike information criterion
 BIC:

Bayesian information criterion
 SBP:

systolic blood pressure
 HDL:

high density lipoprotein
 HbA1c:

hemoglobin A1c
References
Cheung N, Wong TY. Diabetic retinopathy and systemic vascular complications. Prog Retin Eye Res. 2008;27:161–7.
Hojs R, Ekart R, Bevc S, Hojs N. Biomarkers of renal disease and progression in patients with diabetes. J Clin Med. 2015;4:1010–24.
De Zeeuw D, Parving HH, Henning RH. Microalbuminuria as an early marker for cardiovascular disease. J Am Soc Nephrol. 2006;17:2100–5.
Brown WV. Microvascular complications of diabetes mellitus: renal protection accompanies cardiovascular protection. Am J Cardiol. 2008;102:10L–3L.
Granger DN, Rodrigues SF, Yildirim A, Senchenkova EY. Microvascular responses to cardiovascular risk factors. Microcirculation. 2010;17(3):192–205.
Kular L, Pakradouni J, Kitabgi P, Laurent M, Martinerje C. The CCN family: a new class of inflammation modulators. Biochemie. 2011;93:377–88.
Jaffa AA, Usinger WR, McHenry MB, Jaffa MA, Lipsitz SR, Lackland D, et al. Connective tissue growth factor and susceptibility to renal and vascular disease risk in type 1 diabetes. J Clin Endocrinol Metab. 2008;93:1893–900.
Leeuwis JW, Nguyen TQ, Theunissen MGJ, Peeters W, Goldschmeding R, Pasterkamp G, et al. Connective tissue growth factor is associated with a stable atherosclerotic plaque phenotype and is involved in plaque stabilization after stroke. Stroke. 2010;2010(41):2979–81.
Gerritsen KG, Falke LL, van Vuuren SH, Leeuwis JW, Broekhuizen R, Nguyen TQ, et al. Plasma CTGF is independently related to an increased risk of cardiovascular events and mortality in patients with atherosclerotic disease: the SMART study. Growth Factors. 2016;4:149–58.
Hunt KJ, Jaffa MA, Garrett SM, Luttrell DK, Lipson KE, LopesVirella M, Luttrell LM, Jaffa AA, VADT Investigators. Plasma connective growth factor (CTGF/CCN2) levels predict myocardial infraction in the Veterans Affairs Diabetes Trial (VADT) cohort. Diabetes Care. 2018;41(4):840–6.
Olsen MK, Schafer JL. A twopart randomeffects model for semicontinuous longitudinal data. JASA. 2001;96:730–45.
Yiu S, Tom DMB. Twopart models with stochastic processes for modelling longitudinal semicontinuous data: Computationally efficient inference and modelling the overall marginal mean. SMMR. 2017. https://doi.org/10.1177/0962280217710573.
Berk KN, Lachenbruch PA. Repeated measures with zeros. SMMR. 2002;11:303–16.
Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. SMMR. 2002;11(4):341–55.
Su L, Tom BD, Farewell VT. A likelihoodbased twopart marginal model for longitudinal semicontinuous data. SMMR. 2015;24(2):194–205.
Hall DB, Zhang Z. “Marginal models for zero inflated clustered data. Stat Model. 2004;2004(4):161–80.
Smith VA, Preisser JS, Neelon B. A marginalized twopart model for semicontinuous data. Stat Med. 2014;2014(33):4891–930.
Voronca DC, Gebregziabher M, Durkalski VL, Liu L, Egede LE. Marginalized two part models for generalized gamma family of distributions. cornell university library. arXiv: 2015; 1511.05629[stat.ME].
The DCCT Research Group. The diabetes control and complications trial (DCCT): design and methodologic considerations for the feasibility phase. Diabetes. 1986;35:530–45.
The Diabetes Control and Complications Trial Research Group. The effect of intensive treatment of diabetes on the development and progression of long term complications in insulindependent diabetes mellitus. N Engl J Med. 1993;329:977–86.
Su L, Tom BD, Farewell VT. Bias in 2part mixed models for longitudinal semicontinuous data. Biostatistics. 2009;10:374–89.
Pinheiro JC, Bates DM. Approximations to the Loglikelihood function in the nonlinear mixedeffects model. J Comput Graph Stat. 1995;4:12–35.
Fletcher R. Practical methods of optimization. 2nd ed. New York: Willey; 1987.
Buntin MB, Zaslavsky AM. Too much ado about twopart models and transformations? Comparing methods of modeling Medicare expenditures. J Health Econ. 2004;23:525–42.
Smith VA, Neelon B, Preisser JS, Maciejewski ML. A marginalized twopart model for longitudinal semicontinuous data. SMMR. 2017;26:1949–68.
Zhou S, Li M, Zeng D, Hu X, Li Y, Wang R, Sun G. Expression variations of connective tissue growth factor in pulmonary arteries from smokers with and without chronic obstructive pulmonary disease. Sci Rep. 2015;5:8564.
Roestenberg P, van Nieuwenhoven FA, Wieten L, Boer P, Diekman T, Tiller AM, et al. Connective tissue growth factor is increased in plasma of type 1 diabetic patients with nephropathy. Diabetes Care. 2004;27(5):1164–70.
Albert PS, Shen J. Modelling longitudinal semicontinuous emesis volume data with serial correlation in an acupuncture clinical trial. JRSS Series C. 2005;54:707–20.
Tan Y, Keum JS, Wang B, McHenry MB, Lipsitz SR, Jaffa AA. Targeted deletion of B2kinin receptors protects against the development of diabetic nephropathy. Am J Physiol Renal Physiol. 2007;293:F1026–35.
Riser BL, Denichilo M, Cortes P, Baker C, Grondin JM, Yee J, Narins RG. Regulation of connective tissue growth factor activity in cultured rat mesangial cells and its expression in experimental diabetic glomerulosclerosis. J Am Soc Nephrol. 2000;11:25–38.
Furlong F, Crean J, Thornton L, O’leary B, Murphy M, Martin F. Dysregulated intracellular signaling impairs CTGFstimulated responses in human mesangial cells exposed to high extracellular glucose. AJP Renal Physiol. 2007;292:F1691–700.
Dai HY, Zheng M, Ly LL, Tang RN, Ma KL, Liu D, Wu M, Liu BC. The roles of connective tissue growth factor and integrinlinked kinase in high glucoseinduced phenotypic alterations of podocytes. J Cell Biochem. 2012;113:293–301.
Liu X, Luo F, Pan K, Wu W, Chen H. High glucose upregulates connective tissue growth factor expression in human vascular smooth muscle cells. BMC Cell Biol. 2007;16(8):1.
Wang B, Carter RE, Jaffa MA, Nakerakanti S, Lackland D, LopesVirella M, Trojanowska M, Luttrell LM, Jaffa AA. Genetic variant in the promoter of connective tissue growth factor gene confers susceptibility to nephropathy in type 1 diabetes. J Med Genet. 2010;47:391–7.
Smith VA, Neelon B, Maciejewski ML, Preisser JS. Two parts are better than one: modeling marginal means of semicontinuous data. Health Serv Outcomes Res Methodol. 2017;17:198–218.
Authors’ contributions
MAJ and MG conceived the methodological approaches for the manuscript and performed all the analysis for the study. MAJ wrote, reviewed and edited the manuscript. MG reviewed and edited the manuscript. DKL and SMG oversaw the measurements of CTGF and reviewed the manuscript. KEL reviewed and edited the manuscript. LML reviewed the manuscript. AAJ conceived the study, amassed the resulting data and wrote reviewed and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgements
We would like to acknowledge the DCCT/EDIC as the source of data. A complete list of participants in the DCCT/EDIC Research Group is presented in the Additional Material published online for the article in N Engl J Med 2015;372:1722–33.
Competing interests
Kenneth E Lipson is employed by Fibrogen Inc., the company which makes the CTGF assay used to measure CTGF in the DCCT study participants. None of the other authors have any potential conflicts of interest or financial disclosure to report.
Availability of data and materials
Restrictions apply to the availability of the data that support the findings of the current study.
Consent for publications
Not applicable.
Ethics
The DCCT was approved by the Institutional Review Boards of all participating DCCT centers and all participants provided written informed consent.
Funding
This work was supported by the National Institutes of Health Grants HL077192 (AAJ) and 5 P01 HL055782.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Jaffa, M.A., Gebregziabher, M., Garrett, S.M. et al. Analysis of longitudinal semicontinuous data using marginalized twopart model. J Transl Med 16, 301 (2018). https://doi.org/10.1186/s1296701816745
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1296701816745