Despite astute observations by Bentzen and Tucker, 1997, showing that the slope of TCP dose-response is model-dependent, even if fitting was performed to the same data, dependence of the model parameters on the choice of the model is generally not appreciated. Limited attention has also been devoted to demonstrating conflicts in plan ranking or in predicting consequences of dose boosting in partial volumes between common models [28–30]. In this report, the lingering question to what extent model predictions are model dependent has been studied in a systematic manner. As expected no model can be deemed a preferred model and all four models agree well within the range of the clinical data. Dosimetric parameters of clinical relevance, for example NTCP at 55 and 60 Gy, doses typically used as constraints in IMRT planning [31], would therefore be model-independent as long as there is incidence data in this dose range. These NTCP differences were in fact < 1% for RION and < 3% for RIRP, see Figures 2 and 3. The same applied to *D*
_{
5
} and *D*
_{
10
}, doses corresponding to 5 and 10% incidence of complications. Figures 2 and 3 show that the differences in these values predicted by different models were < 1.5 Gy for RION and < 4.5 Gy for RIRP.

However, for the RION data set for patients treated twice daily, where incidence data covered the smallest in range of the four sets, predictions beyond the range of data availability became quite model dependent. Not only is this reflected in large discrepancies in *D*
_{
50
} values; *D*
_{
20
}, dose corresponding to 20% incidence of RION, is 74.9 Gy for the logistic model. This contrasts with 81.2 Gy calculated from the log-logistic model. This would be consequential for dose escalation protocols relying on extrapolated incidence of complications.

The trend that the log-logistic and Poisson-based models yielded larger *D*
_{
50
} and smaller *γ* compared to logistic and probit models was observed. This is likely related to the shape of the dose-response characteristic of a specific model as well as the limited range of incidence of complications. While this ideally has to be proven mathematically, we can speculate that the trend is driven by differences in model predictions in the incidence range of concern for this study. Figure 1 shows that the log-logistic and Poisson-based models reach complication probabilities of the order of 10-20% at doses larger than the logistic and probit models. In Figure 1, models were matched according to *D*
_{
50
} and *γ*. One can speculate that if models were forced to overlap in the range of clinically observed incidences of complications, i.e., < 20%, larger *D*
_{
50
} would be expected for the log-logistic and Poisson-based models.

The model dependence is typically not specifically addressed in literature reviews that present compilations of model parameters [32]. It is conceivable that the large *D*
_{
50
} values reported for xerostomia by Munter *et al*. 2004 and Munter *et al*. 2007 were at least partly due to their choice of the log-logistic model. In this regard, generic statements based on shallow dose-response of *γ*≤1 should be made with caution as well. As shown in this study a difference on the order of factor of two has been observed for the RION data set for patients treated twice daily (*γ* = 0.8 and 1.56, Table 1). This data set was limited in complication incidence. Even for the RIRP data set covering a broad range of incidence, substantial variations in *γ* were seen while variations in *D*
_{
50
} were minor. Disagreement in model parameters cannot be viewed solely as a reflection of differences in underlying data. While this conclusion would be valid for data sets covering a broad range of incidences, human data for a good reason is typically limited to low incidences of complication. It has to be stated that while the log-logistic model predicted shallow dose-response, the only way to claim inferiority of this model is to demonstrate that its predictions contradict clinical data. The model cannot be disregarded based on how plausible its parameters and predictions to larger doses may appear compared to other models. It is unfortunate that publications showing model predictions often do not also show clinical data in the same plot, as shown in Figures 2 and 3. This provides readers with a better understanding of the spread of clinical data in dose, incidence of toxicity and statistical uncertainty.

Variations in confidence intervals were substantial. This at least in part can be connected with model parameters themselves. In particular, log-logistic model yielded the larger *D*
_{
50
} as well as broader upper limit for *D*
_{
50
}. Having said that, for RIRP data sets, *D*
_{
50
} were consistent between the models and still upper confidence interval was by far the largest for the log-logistic model. The reverse argument applies to *γ*, log-logistic model providing the broadest lower limit. Confidence intervals calculated for model parameters were broad, which relates to the small number of events. In particular, patients treated twice daily showed a low incidence of complications. Consequently, model parameters can be only estimated with substantial uncertainties. While this precludes being definitive in comparing model behavior, this is a common problem in testing model predictions. The presented analysis therefore is representative of a practical situation of dose-response analysis and use of model parameters.

The maximum likelihood method was used in this study to estimate model parameters. It should be noted that the choice of the method may impact parameter values and confidence intervals. Bentzen and Tucker [16], 1997, analyzed dose-response for control of neck nodes. The authors showed that the *D*
_{
50
} value was not sensitive to whether the maximum-likelihood or least-squares method was used to estimate parameters of the logistic model. Least squares, however, led to a substantially narrower confidence interval. Also, a significant difference in *γ* was seen. This potentially adds to uncertainties associated with comparing model parameters reported by various authors.

In this study the analysis was restricted to dose-response rather than dose-volume response. The way volume effect is handled by different models will have an impact on obtained model parameters. Commonly, dose-volume-response models have a designated parameter describing the strength of volume dependence. However, models designed to describe the incidence of complications in serial organs may not require this parameter [12]. Furthermore, the slope of dose-response may or may not be volume-dependent. This leads to differences in model parameters. However, the preferred model often cannot be established because of the uncertainties in clinical data.

Venturing in dose range not covered by clinical data is unavoidable in biologically-guided IMRT optimization. This makes the choice of the model critical. Presently the choice of NTCP models is driven by personal preferences, availability of software and historical reasons. A practice of selecting a model and "calibrating" the model to make it consistent with locally seen outcomes is encouraged [8]. When advanced biologically-driven treatment planning is used, e.g., to account for biological properties of tumors and normal tissues [5] or effect of geometric errors [33] there has to be an understanding that a choice of the model would dictate the penalty.

The results of IMRT optimization, including biologically-driven optimization, are of course subject to assessing the plan for its clinical suitability. If the plan is deemed clinically unsuitable, optimization can be re-run and navigated towards the desired result by changing weighting factors. Therefore, differences in model predictions can be offset in biologically-based optimization unless absolute values are used. A similar argument applies to plan ranking. The model does not have to be quantitatively accurate as long as it ranks a radiobiologically more desirable plan higher than less desirable. Use of biological models for plan ranking cannot be separated from DVH handling. If NTCP is calculated following a DVH reduction using an independent method, e.g., using power-law-based EUD [13], then plan ranking based on EUD is sufficient. Further, calculation of NTCP becomes redundant. If, however, NTCP is calculated directly from the DVH or a popular effective volume DVH reduction method is used [34], ranking would be based on calculated NTCP. It has been shown, however, that plan ranking can be model-dependent [30]. Quantitative use of biological models to predict complication rates for a proposed clinical trial or treatment schedule may depend on the choice of the model. Commonly, approaches based on changing fractionation to maintain the rate of complications but to improve local control are used. Also, RT protocols based on individualized prescription with an intent to keep NTCP below a pre-set level have been advocated and used clinically [35]. These approaches indirectly validate model predictions; however, their clinical implementation has to have clearly stated rules for what would be regarded as excess toxicity.