NTCP modeling and dose–volume correlations for acute xerostomia and dry eye after whole brain radiation

Background Whole brain radiation (WBRT) may lead to acute xerostomia and dry eye from incidental parotid and lacrimal exposure, respectively. We performed a prospective observational study to assess the incidence/severity of this toxicity. We herein perform a secondary analysis relating parotid and lacrimal dosimetric parameters to normal tissue complication probability (NTCP) rates and associated models. Methods Patients received WBRT to 25–40 Gy in 10–20 fractions using 3D-conformal radiation therapy without prospective delineation of the parotids or lacrimals. Patients completed questionnaires at baseline and 1 month post-WBRT. Xerostomia was assessed using the University of Michigan xerostomia score (scored 0–100, toxicity defined as ≥ 20 pt increase) and xerostomia bother score (scored from 0 to 3, toxicity defined as ≥ 2 pt increase). Dry eye was assessed using the Subjective Evaluation of Symptom of Dryness (SESoD, scored from 0 to 4, toxicity defined as ≥ 2 pt increase). The clinical data were fitted by the Lyman–Kutcher–Burman (LKB) and Relative Seriality (RS) NTCP models. Results Of 55 evaluable patients, 19 (35%) had ≥ 20 point increase in xerostomia score, 11 (20%) had ≥ 2 point increase in xerostomia bother score, and 13 (24%) had ≥ 2 point increase in SESoD score. For xerostomia, parotid V10Gy–V20Gy correlated best with toxicity, with AUC 0.68 for xerostomia score and 0.69–0.71 for bother score. The values for the D50, m and n parameters of the LKB model were 22.3 Gy, 0.84 and 1.0 for xerostomia score and 28.4 Gy, 0.55 and 1.0 for bother score, respectively. The corresponding values for the D50, γ and s parameters of the RS model were 23.5 Gy, 0.28 and 0.0001 for xerostomia score and 32.0 Gy, 0.45 and 0.0001 for bother score, respectively. For dry eye, lacrimal V10Gy–V15Gy were found to correlate best with toxicity, with AUC values from 0.67 to 0.68. The parameter values of the LKB model were 53.5 Gy, 0.74 and 1.0, whereas of the RS model were 54.0 Gy, 0.37 and 0.0001, respectively. Conclusions Xerostomia was most associated with parotid V10Gy–V20Gy, and dry eye with lacrimal V10Gy–V15Gy. NTCP models were successfully created for both toxicities and may help clinicians refine dosimetric goals and assess levels of risk in patients receiving palliative WBRT.

More-recently, in a prospective study, we reported that patients receiving standard WBRT (without prospective delineation of the parotid or lacrimal glands) resulted in clinically-significant acute xerostomia and dry eye in roughly 35% and 25% of cases, respectively with toxicity rates associated with glandular doses [1,10]. In those reports, a summary of the reported toxicities was provided together with a statistical analysis of their correlation against given dose volume metrics. However, as there are no prior studies reporting NTCP model parameter values for these acute toxicities, we performed additional dosimetric analyses including normal tissue complication probability (NTCP) modelling. Different groups in research and clinical domain are more familiar with and prone to use a given NTCP model. For this reason, in this study, parameter values were derived for two popular NTCP models in order to facilitate such groups incorporate NTCP metrics for the examined toxicities in their analyses. Also, dose thresholds were identified in this study, which seeks to generate risk assessment tools to aid in clinical decision making.

Patient selection, treatment and OAR delineation
Patients were treated with WBRT on a prospective, IRBapproved study (ClinicalTrials.gov #NCT02682199), with details of study procedures previously reported [1,10]. In brief, patients were eligible if they were planned to receive WBRT for any indication using 3D planning to a total dose of 25-40 Gy in 10-20 fractions, at 2-3 Gy per fraction. Patients provided written consent and were enrolled at one academic center and two affiliated community hospitals.
All patients were treated in the supine position using customized head-cast immobilization. Patients were treated using 3D planning without prospective delineation of the parotid glands or lacrimal glands, though some providers delineated the globe and/or lens for avoidance. Using digitally-reconstructed radiographs, WBRT fields were designed based on bony anatomy and covered the entire skull and extended to the inferior border of the C1 or C2 vertebrae [1].
The bilateral parotid and lacrimal glands were retrospectively delineated without knowledge of clinical outcomes. Parotids were delineated using planning CT alone, whereas MRI fusion was used for lacrimal gland delineation. The dose volume histograms (DVH) were calculated for the bilateral parotid and lacrimal glands. DVH-based metrics were correlated with patient reported outcomes and definitions of toxicity as described below.

Definition of toxicity for xerostomia and dry eye
Patients completed xerostomia and dry eye questionnaires at baseline pre-WBRT, at the conclusion of WBRT, and at 1, 3, and 6 months post-WBRT. All toxicity analyses refer to outcomes at 1 month (1M) post-WBRT, which was the prospectively-specified primary time point. To allow NTCP modelling, clinically significant toxicity was defined as a binary variable using threshold worsening of patient-reported symptom scores, as described below.
For xerostomia, patients completed the validated University of Michigan Xerostomia Questionnaire (xerostomia score), calculated using eight questions each scored from 0 to 10 and linearly converted to a 100-point scale, with higher scores representing worse symptoms [11][12][13]. Patients also completed a 2-question xerostomia bother score that assessed the degree to which xerostomia bothered patients while eating and while not eating [11]. Each bother question was answered on a 4-point Likert scale (0: Bothered not at all, 1: Bothered a little bit, 2: Bothered quite a bit, and 3: Bothered very much) adapted from the EORTC QLQ-H&N35 head and neck cancer-specific QOL questionnaire [11,14]. The higher score of the two bother questions was considered the overall bother score at that time point. Two definitions of clinically significant toxicity were used for the xerostomia analysis: (1) ≥ 20 point increase in xerostomia score and (2) ≥ 2 point increase in xerostomia bother score [11]. Regarding those two scores, as it has been reported in the literature and has been confirmed by our data, lower levels of increase from baseline were characterized by large variability. However, for the levels that we used in this study the responses were more stable over time constituting a clear worsening of the symptoms from baseline. The results and analysis of xerostomia using bother score are presented in the "Appendix".
Clinically significant toxicity was defined as a ≥ 2 point increase in SESoD score [1].

Radiobiological models
Physical dose distributions were converted to equivalent doses of 2 Gy per fraction (EDQ 2Gy ) using an α/β value of 3 Gy [18][19][20]. For each organ, the corresponding generalized equivalent uniform doses were calculated (gEUD 2Gy ) [21,22]. The clinical data relating to the observed rate of NTCP was used to calculate the model parameters for both the Lyman-Kutcher-Burman (LKB) model [23,24] and the relative seriality (RS) [25]. The basic parameters of each model are: D 50 (or TD 50 ), which is the dose for a complication rate of 50%, the slope (gradient) of the dose response curve (m for LKB and γ for RS), and the parameter that accounts for the volume dependence of the organ (n for LKB and s for RS). From the formulation of the RS model, the biologically effective uniform dose ( D ) is derived. D is the dose that causes exactly the same normal tissue complication probability as the real dose distribution [26]. The mathematical formulations of NTCP models and biological doses are provided in the "Appendix". In the figures and tables presenting results of the NTCP models, equivalent to 2 Gy per fraction doses are used otherwise the physical doses are shown.

Statistical methods
For the two NTCP models, the parameter values and their 95% confidence intervals were determined using the maximum likelihood method [18,27,28]. The fitting calculations were performed through the use of a minimization package (MINOS) [29]. The confidence intervals of the model parameters were determined using the profile likelihood method. The ability of the NTCP models to distinguish patients with and without the examined symptoms was evaluated using the area under the curve (AUC) measure, which is used as a summary of the ROC curve [18,30]. The goodness-of-fit of the different NTCP models was assessed through the Hosmer-Lemeshow test [31]. Additionally, the Odds Ratio (OR) method was applied to identify NTCP thresholds beyond which the risk of toxicity increases significantly [18,32]. Those thresholds were identified in three steps. First, we identified the thresholds for which the OR values are larger than 1 and sorted them by OR value (largest to lowest); second, we identified the thresholds for which the low limit of the 95% confidence interval (95% CI) is larger than one; and third, we identified the threshold with the smallest 95% CI.

Results
100 patients were enrolled and treated between 2015 and 2018, of whom 55 and 54 were eligible for the analyses of 1-month xerostomia and dry eye, respectively (45 patients were prospectively excluded from analysis due to lack of baseline score or baseline SESoD score ≥ 3, or did not complete WBRT, or did not complete any follow-up questionnaires at 1 month post-RT). Patient characteristics are shown in Table 1.
Most patients received 30 Gy in 10 fractions (58%) or 35 Gy in 14 fractions (31%). For the xerostomia analysis, clinically significant toxicity was observed in 19 patients (35%) who had a ≥ 20 point increase in xerostomia score. For the dry eye analysis, clinically significant toxicity was observed in 13 patients (24%) who had a ≥ 2 point increase in SESoD score. Figure 1 shows a coronal view of the spatial dose distribution for a representative patient in the plans of the parotid and lacrimal glands, respectively. It also illustrates the bilateral parotid and lacrimal dose volume histograms (DVHs) for patients with and without toxicity. Table 2 presents average OAR mean and volumetric dose in patients with and without toxicity. Figure 2 shows the areas under the ROC curves (AUC) for different parotid and lacrimal dose volume metrics for their corresponding toxicity endpoints.
For xerostomia, the AUC values for D mean and V 20Gy of the parotid glands were 0.64 and 0.68, respectively. Patients with parotid V 20Gy ≥ 49% had 8.6 (95% CI: 2.4-30.8) times higher risk for clinically significant toxicity as defined using xerostomia score (this cutoff was determined using the Odds Ratio method).
For dry eye, the AUC value for D mean to the lacrimal glands was 0.60, whereas the dose-volume indices in the range V 6Gy -V 15Gy were found to correlate best with toxicity with AUC values ranging between 0.65 and 0.68. Patients with lacrimal V 15Gy ≥ 80% had a 5.8 (95% CI 1.1-29.4) times higher risk for clinically significant   toxicity (this cutoff was determined using the Odds Ratio method). Results for NTCP modelling parameters using LKB and RS methods are shown in Table 3, for xerostomia and dry eye, respectively. Figure 3 shows the parotid and lacrimal dose response curves generated using these models, for a range of gEUD and D doses (for the LKB and RS models, respectively). Overall results for the NTCP modelling, AUC analysis, and assessment of statistical significance are summarized in Table 4 for xerostomia and dry eye, respectively. The goodness of fit of the NTCP models was evaluated by the Hosmer-Lemeshow test, which showed that the p values of the LKB and RS models were 0.35 for xerostomia and 0.52 for dry eye, respectively. Both values are larger than 0.05, so the null hypothesis that the observed and expected response rates are the same across all doses cannot be rejected.

Discussion
In this analysis, we expanded on our recent reports [1,10] by creating NTCP models and refining the dosimetric relationships linking parotid dose to xerostomia and lacrimal dose to dry eye in patients receiving WBRT. The dose-volume metrics V 13Gy -V 23Gy and V 6Gy -V 15Gy of the parotid and lacrimal glands were best correlated with the endpoints of xerostomia and dry eye, respectively (see Fig. 2). The AUC values of those dose metrics are close to the AUC values of D mean for both organs (parotid and lacrimal glands). This is because they demonstrate a parallel-like volume effect against the endpoints of xerostomia and dry eye, respectively. In both cases, the respective V 20 and V 15 metrics showed higher AUC values but not at a statistically significant level.
The OR values that are reported here are the highest values, which are also statistically significant (lower limit of the 95% CI should be larger than one) and have the have the smallest confidence interval. However, it has to be stated that the accompanied thresholds depend on the cohort characteristics (e.g. distribution of dose values among patients). Those results have an immediate clinical applicability since they can be used as dose constraints in treatment plan optimization and evaluation.
In WBRT, the vast majority of patients are treated with parallel-opposed lateral fields. IMRT/VMAT are not well-accepted approaches since the cost and planning/QA time for IMRT/VMAT are far higher than for  opposed laterals, and given the clinical setting, might not be justified. In order to establish a closer dose-response relation for dry eye, the response of the individual eyes could be recorded. However, in most cases the status of the patient is not good. Getting good QOL data is challenging and asking questions about each eye separately would add mental burden on the patients. For xerostomia, our findings are consistent with recent reports from radiotherapy of head and neck tumors finding that parotid V 20 correlates best with PRO-CTCAE scores [33]. For example, the difference in parotid mean dose between patients with vs. without xerostomia was only 2.4 Gy, whereas the difference in parotid V 20Gy was 8.4% (Table 2). A statistically significant threshold of V 20Gy ≤ 47% corresponded to an odds ratio of 8.6 (95% CI 2.4-30.8).
For dry eye, there are no good prior studies for comparison. More specifically, although there are a couple of  Table 4 Summary of the results from the fit of the two normal tissue complication probability models for xerostomia and dry eye. AUC = area under the curve. Response rate is the average of the response rates predicted by the models. OR = Odds ratio. Threshold doses refer to gEUD and D (for the LKB and RS models, respectively). The p value was calculated using the two sample t-test for the patient subgroups above and below the dose threshold (the null hypothesis is that the true mean difference is zero)  [34]. In our analysis, the same pattern was found as with xerostomia in that volumetric dose metrics were also better correlated with toxicity than mean doses. For example, in our group, the mean lacrimal dose difference between patients with vs. without dry eye was only 1.6 Gy, whereas the difference in lacrimal V 15Gy was 8%. A statistically significant threshold of V 15Gy ≤ 80% corresponded to an odds ratio of 4.4 (95% CI 1. 3-15.4).
There are many studies reporting NTCP model parameters for xerostomia after head and neck radiotherapy. The values of the NTCP model parameters we computed for the same endpoint but after WBRT are not very different than those reported in the literature. When the later ones were applied to the present dataset many of them were found compatible producing similar AUC and OR values with the fitted model parameters. The reason is that the previously published model parameter values fall within the confidence intervals of the derived values (see Tables 3, 5). For the parameter sets that were not found compatible, there are several potential reasons to explain this difference. First, most prior reports address patients with head and neck cancer, and the character of the 3D dose distribution is very different in the setting of WBRT. More specifically, dose fall off inside the volume of parotids is more pronounced in the case of head and neck radiotherapy.
Second, most prior studies have used physician-scored toxicity (CTCAE), while the current study uses PRO [35][36][37][38]. It is generally understood that providers might underestimate the rate/severity of toxicities. In this light, the differences between our results and the prior studies (see Table 5) are logical as lower values for TD 50 , and higher values for m, and higher values for n would be associated with predicting a higher rate of toxicity (as our use of PROs would be expected to yield higher risk rates). Interestingly, our computed parameters are close to parameters derived from our recent analysis of patientreported symptoms (PRO-CTCAE) in patients with head and neck cancer as presented in Table 5 [33]. This agreement suggests that the models might be reasonably applicable across a broad range of 3D-dose/volume characters.
Third, most prior studies have considered the dose/ volume metrics for the contralateral parotid alone (since there is usually/often no attempt made to spare the ipsilateral parotid gland), where we considered both parotids as a pooled single structure in the setting of WBRT.
We found that both the LKB and RS NTCP models were able to be fitted to the clinical xerostomia data (Table 4), with similar goodness-of-fit. Indeed, the threshold doses for xerostomia in both NTCP models were identical (18 Gy). This is similar to what has been found in other studies where both models provide a reasonable fit to the data [23,35].
The results of xerostomia based on bother score have very similar pattern with those based on Michigan Questionnaire. More specifically, both scoring systems have almost the same V 20Gy and Odds ratio statistics. However, the NTCP model values show some deviation (TD 50 /D 50 values are higher and m/γ values are lower for bother score).
For dry eye, there is minimal published dose/volume data/modelling results to which we can compare our findings. As with the parotids, the threshold doses for dry eye were similar for the two models (28 Gy for the gEUD in the LKB and 25 Gy for the D in the RS). Although the data show that there an increased risk for dry eye beyond those threshold doses, the OR results were not statistically significant. This means that a more conservative approach should be followed regarding the use of those threshold doses in the clinic.
In this study, serial observations were collected (from 1 to 6 months) aiming at analyzing the dependence of velocity and extent of symptom resolution on dose. Unfortunately, the follow-up data at 3 and 6 months are sparse to allow a valid statistical analysis. More specifically, the available follow-up data were 55, 33 and 28 at 1, 3 and 6 months, respectively.
Our study has several limitations. First, we did not consider provider-defined toxicity scoring. Nevertheless, we think our approach is reasonable as there is increasing recognition of the importance of patient-reported outcomes [35]. Second, the definition of a significant toxicity was somewhat subjective (e.g., a threshold symptom score increase from baseline). However, this a common Table 5 Summary of the parameter values that have been reported by different groups for the LKB model, regarding the endpoint of xerostomia. The difference between the physicianscored (CTCAE) patient-reported (PRO-CTCAE) scoring systems is also indicated. Those model parameter sets were applied on the current dataset and the corresponding AUC, OR (with 95% confidence interval and dose threshold) values were calculated approach that has been used by other groups using patient-reported outcomes, and it is reasonable since it inherently acknowledges the importance of considering baseline status [35]. This is particularly important in our setting since many patients with brain metastases requiring WBRT would have received prior systemic therapies that might impact these symptoms. Third, the total number of patients and events was modest. A rule of thumb in modelling is to have 10 events per parameter in the dataset. When this rule breaks, the results show large confidence intervals. However, in our case this is reasonable as the study was prospective, and to our knowledge this is one of the largest studies of its kind to address this issue in the setting of WBRT. Finally, it is known that all the existing NTCP models have inherent limitations. These models do not account for biological mechanisms (cell redistribution, re-oxygenation, etc.) that may have an impact on treatment outcome. Further, such models do not consider the spatial distribution of dose, and hence ignore the possibility that there may be particularly-important regions of such "parallel organs" in determining clinical response [39]. Despite these uncertainties, the correlations of the model-based NTCP values with clinical outcome data were fairly good.

Conclusions
In conclusion, we created NTCP models that reasonablywell fit acute parotid and lacrimal gland toxicity after WBRT. These models and their findings may help establish levels of risk for these acute toxicities and help clinicians determine reasonable dosimetric goals to maintain quality of life in patients receiving palliative WBRT.
Further investigations are needed to refine/confirm these model-based estimates.

Appendix
Analysis using bother score for xerostomia Figure 4 shows bilateral parotid dose volume histograms (DVHs) for patients with and without toxicity based on bother score. Table 6 presents average parotid mean and volumetric dose in patients with and without toxicity. Figure 5 shows the areas under the ROC curves (AUC) for different parotid dose volume metrics for xerostomia using bother score. Table 7 presents a summary of the parameter values and 95% confidence intervals for the LKB and RS models of the parotid glands for the endpoint of xerostomia using the bother score and Fig. 6 schematically illustrates the corresponding dose response curves. Table 8 presents a summary of the results from the fit of the two normal tissue complication probability models for xerostomia using the bother score. Clinically significant toxicity was observed in 11 patients (20%) who had Fig. 4 The DVHs of the combined parotid glands for patients with (red solid lines) versus without (green dotted lines) clinically significant xerostomia defined using the bother score (right, ≥ 2 point worsening in bother score)    Table 8 Summary of the results from the fit of the two normal tissue complication probability models for xerostomia using the bother score. AUC = area under the curve. Response rate is the average of the response rates predicted by the models. OR = Odds ratio. The p value was calculated using the two sample t-test for the patient subgroups above and below the dose threshold (the null hypothesis is that the true mean difference is zero)