Variability in prostate and seminal vesicle delineations defined on magnetic resonance images, a multi-observer, -center and -sequence study

Background The use of magnetic resonance (MR) imaging as a part of preparation for radiotherapy is increasing. For delineation of the prostate several publications have shown decreased delineation variability using MR compared to computed tomography (CT). The purpose of the present work was to investigate the intra- and inter-physician delineation variability for prostate and seminal vesicles, and to investigate the influence of different MR sequence settings used clinically at the five centers participating in the study. Methods MR series from five centers, each providing five patients, were used. Two physicians from each center delineated the prostate and the seminal vesicles on each of the 25 image sets. The variability between the delineations was analyzed with respect to overall, intra- and inter-physician variability, and dependence between variability and origin of the MR images, i.e. the MR sequence used to acquire the data. Results The intra-physician variability in different directions was between 1.3 - 1.9 mm and 3 – 4 mm for the prostate and seminal vesicles respectively (1 std). The inter-physician variability for different directions were between 0.7 – 1.7 mm and approximately equal for the prostate and seminal vesicles. Large differences in variability were observed for individual patients, and also for individual imaging sequences used at the different centers. There was however no indication of decreased variability with higher field strength. Conclusion The overall delineation variability is larger for the seminal vesicles compared to the prostate, due to a larger intra-physician variability. The imaging sequence appears to have a large influence on the variability, even for different variants of the T2-weighted spin-echo based sequences, which were used by all centers in the study.


Introduction
Successful radiotherapy depends on high geometric and dosimetric accuracy and precision. The introduction of treatment planning and dose calculation in 3D, more than two decades ago, has provided the clinicians with very good control over the dosimetric aspects of the treatment with typical relative errors in the order of a few percent. The more recent introduction of intensity modulated radiotherapy (IMRT) [1] has made it possible to shape the dose distribution to closely match the target volume and the use of image guided radiotherapy (IGRT) [2] enables reproducible patient positioning at every treatment fraction. At present, we have come close to a point where we can "hit the target" with the right dose every time with minimal dose deposition outside the intended volume. Hence, treatment precision has dramatically improved. However, there are still problems to be solved, as described by Njeh [3]; the uncertainty in the definition of the target. Sharp dose gradients are more a hazard than a benefit, if the geometric uncertainty in delineation is large.
The use of magnetic resonance (MR) imaging alone, or together with computed tomography (CT), improves the target delineation accuracy for many diagnoses [4,5] and MR imaging is today in routine clinical use at many centers as a part of the preparation for radiotherapy. The dedicated MR examination for radiotherapy treatment planning involves issues not present in the diagnostic setting. The patient should ideally be imaged in the same position as during treatment, including fixations [6] which influence both the coil setup and image quality [7,8]. The geometric accuracy of the images is crucial which increase demands on the choice of sequences and bandwidth [9] and the sequences and image planes should be optimized for determination of the precise geometrical extent of an already known pathology.
There are two alternative ways of incorporating the MR into the radiotherapy workflow; either the MR images are seen as a complement to the CT for target definition or the MR replaces the CT throughout the entire treatment process. The CT/MR workflow is already established in many centers, but suffers from drawbacks in terms of increased workload and potential introduction of geometric errors resulting from the image registration procedure [10,11]. Fully MR based workflows have been described in the literature [12][13][14][15] and are considered feasible.
For prostate cancer patients, the use of MR alone or in combination with CT has been shown to reduce interobserver variability in target definition and reduce the treatment volume [16][17][18][19]. The treatment of prostate cancer has been considered one of the most straightforward diagnoses for an MR only workflow, as the dose calculation accuracy in the pelvic region is adequate with bulk density assignments [20,21] and the commonly used gold markers are visible with reliable geometric accuracy [22]. In addition to the technical challenges with the MR based workflow, one must also consider that the physicians need to adapt to a target definition process without CT information, and that the MR sequences need to be optimized for target definition purposes.
The aim of the present multi-center study is to evaluate the intra-and inter-physician variability of prostate and seminal vesicle volume delineations based on MR sequences from five different radiotherapy centers in the clinical setting. All centers participating in the study were at the time investigating the use of an MR based workflow for the treatment of prostate cancer. As part of this process it was considered important to perform an inter-clinic comparison of both the standard clinical MR images and the interpretation of the images by the physicians. The observed variations can be assumed to reflect the clinical reality as the images were acquired with the standard clinical protocol and the physicians were instructed to perform the delineation as for an ordinary clinical case.

Methods and materials
Five centers were involved in the study; Umeå University Hospital (Umeå, Sweden), Karolinska Hospital (Stockholm, Sweden), Herlev Hospital (Copenhagen, Denmark), Newcastle Upon Tyne Hospitals NHS Trust (Newcastle, United Kingdom) and Beatson West of Scotland Cancer Centre (Glasgow, United Kingdom). All centers were, at the time of the study, routinely using MRI data in their clinical practice for target definition for prostate cancer patients, except Karolinska who was in the startup process. Both participating physicians from Karolinska did however have extensive previous experience (>5 years) of prostate delineations on MR images from other hospitals. The different scanners and sequences used in the study are listed in Table 1. All centers had chosen to use spin-echo based T2 weighted images as primary bases for target delineation.

Imaging and preparation of data
Five consecutive patients scheduled for radiotherapy of the prostate were selected from each site. All patients had MR examinations as part of their standard preparation for radiotherapy. The axial images which were typically used for target delineation were anonymized and sent to the study coordinator. The 5 image series from each of the 5 sites were tagged as CT studies in the DICOM files to enable delineations to be performed directly on the MR data in all oncology delineation software applications. The set of 25 image series were then sent to each site and imported into the clinically used treatment planning systems or dedicated delineation software.

Delineations
Two physicians from each site independently delineated the prostate volume and the seminal vesicles. The instruction was: "Both prostate and vesicles should be delineated as if a clinical case with high risk for vesicle involvement". The prostate and the vesicle delineations were stored as separate structure sets. After finalizing the delineations, the structure sets were returned to the study coordinator as DICOM RTstruct files for analysis.

Analysis
The total dataset consisted of 25 patients, with delineations from 10 physicians for each patient. All structures for an individual patient were defined in the same coordinate system and could be directly compared. The prostate and vesicle volumes were analyzed separately.
For the prostate, the first step was to calculate the joint center of mass for all delineations for each patient. The distance from the center was calculated for each delineation in the directions right, left, anterior, posterior, superior, inferior, right-posterior and left-posterior. To reduce the influence of small scale variations in the structure sets and create a representative measure for the distance, the average over a solid angle Ω =0.49 sr was used ( Figure 1). This procedure provides a single numerical measure for the distance in the different directions for each patient and delineation.
The joint center of mass for each patient was also used as a starting point for the analysis of the vesicle delineations. The shape of the vesicles does not, however, allow the same analysis approach due to the sometimes concave surface. Instead, the maximum distance in the right, left, anterior, posterior, superior and inferior directions from the center of mass were calculated for each delineation.

Nomenclature
A specific physician is denoted q and a specific patient p. The complete set of physicians is denoted Q and a set of patients is denoted P. The center from which a specific images originates is called imaging center (IC), and the center where a specific delineation is performed is called delineation center (DC).
x var p;q is the observed variable var, which can be either the volume or the distance in a specific direction. The current work includes 25 patients (N P = 25) and 10 physicians ( N Q = 10). The average measure of each delineation characteristic was used as a golden standard, and was calculated as To remove the systematic variations connected to specific patients we formed the variable i.e. the difference between an individual physician (q) delineation on a specific patient (p) and the average delineation over all physicians, for the delineation characteristic var.
To refer to a specific subgroup of patients imaged at a specific IC, we use the notation pIC . To refer to the subgroup of physicians belonging to a specific DC we Table 1 The sequence used at center C were a 3D sequence (Siemens, SPACE), while the other clinics used 2D sequences   to the average measure of the delineation property var for all patients coming from imaging center IC.

Statistical analysis
The normality of the data was checked through visual inspection of Q-Q plots. Most reported significant differences use a Bonferroni corrected 0.01 confidence level. The reason for the use of the strict significance levels was that the main purpose of the tests was to highlight the most pronounced effects in the dataset, where most factors can be expected to have influence. Two sided Ftests were used to compare distributions and t-tests to compare averages, unless otherwise indicated.

Intra and inter-observer variation
It was assumed that the variable y var p;q is dependent on physician (q) and patient (p), i.e.
where w q is a factor that is only dependent on the phys- ≡ 0, and standard deviation σ var z . Equation 3 is an ordinary one-way random effect Anova model [23], where w var q is the effect of the physician, hence σ var w is interpreted as the inter-physician variation, and z var p;q accounts for the residual variation, hence σ var z is interpreted as the intra-physician variation ( Figure 2).
Estimates for the variances, i.e. s var z À Á 2 and s var w À Á 2 , can be calculated using the equations Confidence intervals for the true variabilities were found using simulation. The simulation was performed in a custom written Matlab™ Monte Carlo script. The script searched the σ var z and σ var w space to identify the area where the probability to get the observed s var z and s var w or more extreme values was below 5%.

Normality
The differences between delineations from individual physicians and the average, i.e. y var p;q , were approximately normally distributed for all scored variables. There was however a tendency that the largest deviations were larger than predicted with a Gaussian model, especially pronounced for the posterior, right-posterior and leftposterior directions for the prostate and for the volume and anterior direction for the vesicles.

Delineation summary
As seen in Table 2 and Figure 3 there are apparent differences between the patient samples from the centers.  approximately the same for the different imaging centers ( Table 2).

Variability for different patients
The variability among the physicians differed for different patients, as can be seen in Tables 3 and 4 giving the median, max and min of the standard deviation for individual patients, i.e. max; min; median S y var For the prostate, the highest frequency of large deviations (y var p;q > 4 mmÞ was found in the inferior and anterior directions (8%) each, followed by the superior direction (6%). The lowest frequency was found in the right, left and posterior directions (below 3%) while the frequency was around 5% in posterior-left and posteriorright directions.
For the vesicles the variability was larger as can be seen by comparing Tables 3 and 4. The highest frequency of large deviations y var p;q > 8 mm was found in the right and left directions (6%), while the frequency was around 2% in the other directions.

Physician variability
The influence of the delineating physician was large and significant for all investigated variables (p < 0.01, Kruskal-Wallis test, SPSS). Tables 5 and 6 gives the intra-and inter-physician variability (1 std) for the different delineation variables, for the prostate and the seminal vesicles, together with the 95% confidence interval for the true variability.

Variability for different Sequences
All centers participating in the study used T2 weighted images for target delineation. There were, however, noticeable differences in the image contrast, as can be appreciated in Figure 4. Tables 7 and 8    This was the pattern for all scored variables for the prostate, and was significant for posterior and right/posterior directions (2-sided F-test p < 0.01 Bonferroni corrected). For the vesicles the pattern was similar. The variability was significantly lower for the right and left directions with images from the physicians home center compared to foreign center (2-sided F-test p < 0.01 Bonferroni corrected).

Discussion
Delineation errors have a direct effect on the quality of the treatment. An excessive target volume entails unnecessary risk of complications, while an undersized target reduces the chance of cure. The relationship between the target definition variability and the extent of the optimal margin to compensate for geometrical uncertainties is not completely clear. From a local control perspective the target definition variability should be considered a systematic uncertainty affecting the entire treatment, and should therefore be reflected in the employed margins. However, the uncertainty in the delineation is heavily dependent on both physician (Tables 5 and 6) and patient (Tables 3 and 4), which makes it inadequate to employ generalized margins to account for the variability. The opinion of the authors of the present work is that the responsibility to account for the delineation uncertainty should be placed on the physicians. The target volume should be delineated to cover the volume that the physician wants to treat; actively including volumes that are of benefit for the patient taking both local control probability and risk for side effects into account, and actively excluding volumes that for example are close to sensitive healthy tissues and the probability for tumorous growth is considered small. When only one physician delineates the target, the delineation from that physician is the best available estimate for the correctly defined target volume and should therefore be used without any additional generic margin accounting for the variability. Hence, the primary effect of improved imaging leading to decreased variability will not be a general possibility to reduce the standard margins, but will rather be reflected in a more uniform and generally increased treatment quality. Improved consistency will, as a secondary effect, improve the statistical power when evaluating and optimizing treatment protocols in clinical trials. A way to decrease variability is through training and experience [24]. In Tables 7 and 8 it can be seen that physicians' delineation on images from their home center generally were closer to the average compared to when delineating on images from foreign centers. In some directions the difference was up to 40% (for example posterior direction for the prostate). This effect may be attributed to customization and experience of the local MR sequence. Another way to potentially decrease the variability is to optimize the MR sequence. It is however ambitious to optimize with respect to the delineation variability. Data from the present work does not indicate reduced variability when using a 3 T scanner (center B), but the observations for the single 3 T scanner and only one sequence may not to be representative.
Intra-observer variation has often been reported based on repeated delineation on a single image set by one or several physicians. The approach used in this work is based on the analysis of variance theory [23] and has been described for the present application by Remeijer et al. [25]. It does not rely on multiple delineations by the same physician on individual images. To be able to make the separation we assumed that the intra-physician variability was equal for all physicians. In practice this was not the case. For all scored variables there was at least one physician with significantly different variability (F-test p < 0.01 Bonferroni corrected). In addition, Tables 7 and 8 shows that the variability depends on the origin of the images (home vs. foreign center) as mentioned above. This phenomenon was not accounted for in the separation model. The numbers in Tables 5 and 6 are representative values describing the observations in the present study, but should be interpreted with these reservations in mind. A concern when setting up the study was that the observed overall variability would primarily reflect the use of different clinical routines and traditions at the different centers. The separation of the variability into interand intra-physician components did however reveal that the intra-physician variability was dominating both for the prostate and especially for the seminal vesicles. There were significant differences between different delineation centers, the physicians from center A and D did on average delineate 20-30% larger prostate volumes compared to the physicians from centers B, C and E, but the dominating source for variability in individual directions was still the intra-physician variability. The increase of overall variability for the seminal vesicles delineation compared to the prostate delineation could be fully attributed to the larger intra-physician variability.
The inter-physician variability observed in the present study, summarized in Table 5, is approximately in line with the observations described in the literature. Rasch et al. found an inter-physician variability in the inferior region (apex) and superior region of around 1 mm (1.7 mm and 1.6 mm in present study) using axial MR images for 18 patients and with 3 observers [16]. The intra-observer variability was around 3 mm in both regions (2.0 mm and 1.5 mm in present study). It should be noted that Rasch et al. used a similar separation of variance as utilized in the present work, but the low number of physicians make the estimates for the interphysician variability uncertain. Smith et al. reported inter-observer volume variability of 4.6 cm 3 (6.1 cm 3 in present study), and intra-physician volume variability of 2.7 cm 3 based on repeated observations on same patient (5.1 cm 3 in present study), in a study with 10 patients and 7 observers [26]. The large difference between the intra-physician variability in the present study compared to the study by Smith et al. could be due to the use of repeated delineations on the same image to estimate the intra-physician variability compared to separation of variances.
The results from the present study are also approximately in line with variability reported in the literature for CT based delineations. Fiorino et al. [27] has reported a study using 6 patients and 5 observers and found a short term intra-physician variation of 0.8, 1.1, 1.5 mm for the posterior, anterior and right/left directions to be compared with intra-physician variability 1.4, 1.7 resp. 1.9 mm found in the present study. The interobserver standard deviation was estimated by Fiorino et al. to 1.4, 1.5 and 2.0 mm in the anterior, posterior and left-right directions, which should be compared with the approximately equivalent numbers found in Table 5  The numbers within brackets below each estimate indicates the 95% confidence interval. directions from the present work was added together assuming these are independent variables. The present work is based on a total of 250 delineations performed by 10 physicians on 25 patients. This is a large study and provides relatively tight 95% confidence intervals especially for the intra-physician variability, σ var z . A common method for determination of confidence intervals for the true standard deviation σ is to make use of the relation between the ratio the of estimated, s, and true standard deviation and Chi 2 distribution Where DF is the degrees of freedom. For the intraphysician variation in the present study DF = (N Q − 1) (N P − 1). The confidence interval for σ var w can also be estimated using equation 6 (DF = N Q − 1). but especially in cases when the σ var w ≪σ var z this estimation will lead to an underestimation of the confidence interval. For the prostate, where the intra-physician and inter-physician variability was of the same magnitude, the use of equation 6 gives approximately the same results as simulation. For example in the inferior direction for the prostate, where the simulations and equation 6 gave equivalent results with 0.1 mm precision. But for the vesicles, where the intra-physician variability was larger compared to the inter-physician variability, the use of equation 6 underestimates the confidence interval. For example the confidence interval for the inter-physician variability in the posterior direction was simulated to 2.1-0.6 mm, while equation 6 gave 1.9-0.7 mm. This can be understood considering the scenario with a very large intraobserver variation creating random variations between physicians, and making the inter-observer variation difficult to quantify. The reporting of confidence intervals is very important, especially when using small sample sizes and/or separation of variances into components.
A clinical objective of the present work was to provide feedback to physicists optimizing the MR sequences at the different sites and to the physicians doing the delineations. For the physicians the feedback consisted of two parts. The images sets for the 25 patients were returned to each physician with their own delineations shown as a white structure set and the other 9 delineations shown as black structures. See Figure 5 for example of the delineation variability. The purpose of this feedback was to give an indication of their performance in relation to the others. In addition a one day workshop was organized where a selection of the patient cases were reviewed together with two radiologists specialized in MR examinations of the prostate and seminal vesicles. The outcome of this workshop was that the average delineation of the Bold figures mean statistically significant larger and italic significantly smaller variation compared to other imaging centers (bonferroni correction p < 0.01) F-test. The two last rows compares the average standard deviation for delineations performed on images from physicians home center compared to images from a foreign center. Bold figures mean statistically significant larger and italic significantly smaller variation compared to other imaging centers (Bonferroni correction p < 0.01) F-test. The two last rows compares the average standard deviation for delineations performed on images from physicians home center compared to images from a foreign center. prostate from the 10 radiation oncologists was close to the opinion of the radiologists, while the delineations of the vesicles performed by the radiation oncologists tended to overestimate the extent of the seminal vesicles for some patients, especially in the anterior and rightleft directions. The radiologists preferred the image quality provided by center B, followed by the image quality from center D. It is interesting to notice that the images from these sites were associated with the largest delineation variability. Our interpretation is that increased amount of information increases the scope for interpretation and hence the importance of training and experience. It also highlights the complexity of the optimization procedure and the importance of a well defined objective for the optimization. If the objective is to reduce the delineation variability of the prostate or the seminal vesicles it could be counter-productive to use sequences optimized to visualize pathology. Our opinion is that recommendations on specific sequence settings are difficult to make because of the different needs and possibilities at different centers. For example, if high quality diagnostic images are already available for a patient there is less need to acquire images optimized for pathology.

Conclusion
The overall intra-and inter-physician variability for prostate and seminal vesicle delineations was determined for clinically used MR sequences optimized for target volume determination at 5 different radiotherapy centers in Europe. Large differences in variability were observed between different patients, but also between different MR sequences, even though all centers used T2-weighted spin-echo based sequences. The intra-physician variability was significantly larger for the seminal vesicles compared to the prostate, while the inter-physician variability was approximately the same.