3D Variation in delineation of head and neck organs at risk
- Charlotte L Brouwer1Email author,
- Roel JHM Steenbakkers1,
- Edwin van den Heuvel2,
- Joop C Duppen3,
- Arash Navran3,
- Henk P Bijl1,
- Olga Chouvalova1,
- Fred R Burlage1,
- Harm Meertens1,
- Johannes A Langendijk1 and
- Aart A van ’t Veld1
© Brouwer et al; licensee BioMed Central Ltd. 2012
Received: 23 December 2011
Accepted: 13 March 2012
Published: 13 March 2012
Consistent delineation of patient anatomy becomes increasingly important with the growing use of highly conformal and adaptive radiotherapy techniques. This study investigates the magnitude and 3D localization of interobserver variability of organs at risk (OARs) in the head and neck area with application of delineation guidelines, to establish measures to reduce current redundant variability in delineation practice.
Interobserver variability among five experienced radiation oncologists was studied in a set of 12 head and neck patient CT scans for the spinal cord, parotid and submandibular glands, thyroid cartilage, and glottic larynx. For all OARs, three endpoints were calculated: the Intraclass Correlation Coefficient (ICC), the Concordance Index (CI) and a 3D measure of variation (3D SD).
All endpoints showed largest interobserver variability for the glottic larynx (ICC = 0.27, mean CI = 0.37 and 3D SD = 3.9 mm). Better agreement in delineations was observed for the other OARs (range, ICC = 0.32-0.83, mean CI = 0.64-0.71 and 3D SD = 0.9-2.6 mm). Cranial, caudal, and medial regions of the OARs showed largest variations. All endpoints provided support for improvement of delineation practice.
Variation in delineation is traced to several regional causes. Measures to reduce this variation can be: (1) guideline development, (2) joint delineation review sessions and (3) application of multimodality imaging. Improvement of delineation practice is needed to standardize patient treatments.
KeywordsInterobserver variability Interobserver agreement Head and neck cancer Organs at risk Delineation
Radiotherapy (RT) plays an important role in the treatment of head and neck cancer patients. Many new radiation delivery techniques such as intensity-modulated RT (IMRT) have been developed to allow improved dose conformation with steeper dose gradients compared with conventional three-dimensional conformal RT. Variation in contouring is an important obstacle in the development of high geometric accuracy in the clinical application of these new techniques. Reproducibility in delineation of tumour and normal structures is of importance for optimal patient treatment . As new radiation delivery techniques are increasingly controlled by OAR constraints for normal tissue sparing , variations in OAR delineation may unintentionally influence the treatment plan including the dose to these OARs . In a number of publications (e.g. Bortfeld and Jeraj ), uncertainties in the contouring of organs is also mentioned as one of the potential causes for uncertainties in historical dose and volume data and therefore reduced performance of predictive models. Deasy et al.  furthermore mentioned that differences in segmentation procedure could be one of the reasons explaining variations between existing models.
Target volume delineation variability in the head and neck area has been investigated in several studies (e.g. Rasch et al. ), indicating the need to minimize observer variation for adequate irradiation. However, interobserver variability of OARs in the head and neck area has not been frequently studied. Nelms et al.  found significant organ-specific interclinician variation for head and neck OARs. These variations resulted in large differences in dose distribution parameters, especially in high dose gradient regions. The authors stated that the major variations were in each observer's interpretation of the OARs actual size and shape, suggesting the need for basic training (with unambiguous guidelines) on identifying OARs. Our department uses well-defined delineation guidelines to promote the consistency and accuracy of delineation such as the recently described guidelines for the delineation of OARs related to salivary dysfunction and anatomical structures involved in swallowing [7, 8]. Interobserver variation in the contouring of OARs is therefore intended to be minimal, but still there will be regions in the OARs which are difficult to interpret for the observer. Accurate determination of variation in OAR delineation expressed in volumetric, positional and local 3D measures is therefore needed to establish current accuracy status, to bring actual weaknesses to light and to establish measures to reduce current redundant variability in delineation practice. More consistency in the delineation of OARs may contribute to more consistent dose volume data, and thus less uncertainty in the usage of dose volume characteristics. With the unambiguous and consistent contouring of OARs we could furthermore generalize the application of normal tissue complication probability (NTCP) models. We might even be able to develop improved models, when more consistent dose-volume data is correlated with clinical outcome. This is particularly important for the rising application of particle therapy, in which the dose gradients are extremely steep. The obtainable level in accuracy of delineations also provides valuable information for the evaluation of tools for automatic (re-)contouring. Qazi et al. , for instance, reported high accuracy for automatic segmentation within a clinically-acceptable segmentation time, but also mentioned the need for multi-observer studies to give more insight in the robustness, reliability, and stability of the automated approach. Existing variations in expert delineations could serve as benchmark data.
The aim of the current study was to indicate OAR regions with high interobserver variability in the head and neck area, to subsequently establish possible solutions for this variability in delineation practice.
The study population was composed of 6 head and neck cancer patients. These patients underwent a planning CT scan (CTplan) which was acquired prior to radiation, and a repeat CT scan (CTrep) which was acquired during the course of radiation. CTrep scans were performed 11 to 35 days (range) after the start of radiotherapy. The CT images were made with the patient in supine position on a multidetector-row spiral CT scanner (Somatom Sensation Open, 24 slice configuration; Siemens Medical Solutions, Erlangen, Germany). The acquisition parameters were: gantry un-angled, spiral mode, rotation time 0.5 s, 24 detector rows at 1.2 mm intervals, table speed 18.7 mm/rotation, reconstruction interval 2 mm at Kernel B30 and 120 kVp/195 mA. The matrix size was 512 × 512, with a pixel spacing of 0.97 × 0.97 × 2.0 mm in the x, y and z directions, respectively.
Five specialized head and neck radiation oncologists (R.S., A.N., H.B., O.C. and F.B.), all treating more than 50 head and neck patients per year, delineated five OARs on axial CT slices in all CT images. The radiation oncologist did not have clinical patient information additional to the CT scan. The OAR set included the spinal cord, the parotid and submandibular glands, the thyroid cartilage, and the glottic larynx. For one patient, the right parotid gland contained tumour infiltration and therefore the patient was excluded from analysis for this particular OAR beforehand. The total number of delineated structures was 410.
CTplan and CTrep were delineated under slightly different circumstances, since CTplan was made with contrast-enhancement (iodine containing contrast medium, intravenously applied) while CTrep was acquired without contrast enhancement. Furthermore, the CTplan scan was delineated from scratch and the CTrep scan was delineated using a template obtained from the delineated contours of the CTplan, which were propagated to CTrep after a rigid registration of CTrep to CTplan in each individual patient.
The radiation oncologists were instructed to delineate the parotid and submandibular glands according to the delineation guidelines of van de Water et al. .
Following these guidelines the parotid gland was demarcated in lateral direction by a hypodense area corresponding to subcutaneous fat and more caudally by the platysma. The medial border was defined by the posterior belly of the digastric muscle, the styloid process and the parapharyngeal space. The cranial aspect of the parotid gland was related to the external auditory canal and mastoid process. Caudally, the gland protruded into the posterior submandibular space inferior to the mandibular angle. The anterior border was defined by the masseter muscle, the posterior border of the mandibular bone and the medial and lateral part of the pterygoid muscle. The posterior border was delimited by the anterior belly of the sternocleidomastoid muscle and the lateral side of the posterior belly of the digastric muscle. The external carotid artery, the retromandibular vein and the extracranial facial nerve are prescribed to be enclosed in the parotid gland.
Cranial demarcation of the submandibular gland was defined by the medial pterygoid muscle and the mylohyoid muscle, the caudal demarcation by fatty tissue. The anterior border was the lateral surface of the mylohyoid muscle and the hyoglossus muscle, and the posterior border the parapharyngeal space and the sternocleidomastoid muscle. Lateral demarcation was described by the medial surface of the medial pterygoid muscle, the medial surface of the mandibular bone and the platysma. The medial border was finally described by the lateral surface of the mylohydoid muscle, the hyoglossus muscle, the superior and middle pharyngeal constrictor muscle and the anterior belly of the digastric muscle.
The spinal cord was delineated as the actual spinal cord instead of using bony structures as surrogate for the spinal cord, starting at the tip of the dens and ending at the level of the third thoracic vertebra. The thyroid cartilage was delineated as the actual thyroid cartilage. The cranial border of the glottic larynx was defined as the arythenoid cartilages and the caudal border as the edge of the cricoid.
For all observers, mean volumes and standard errors (SEs), as well as coefficients of variation (CVs) per OAR were calculated. In addition, an 'OAR ratio' for each observer was determined, which was defined as the ratio of the mean OAR volume per observer divided by the mean volume of that OAR determined by all observers. Friedman's test was applied to the CTplan data per OAR separately to investigate a possible systematic effect in the determination of volumes by the observers.
We used different endpoints to investigate interobserver variability. Variations in volume were indicated by the Intraclass Correlation Coefficient (ICC), and differences in combined volume and positional variations by the Concordance Index (CI). Local variations in delineation were finally described by the regional 3D SD. Integration of these three endpoints could help us to identify the type of variation in delineation.
Intraclass correlation coefficient
Analysis of variance (ANOVA) was conducted for estimation of the ICC per OAR. The ICC quantifies how well the observers defined the same size of volumes, without considering the position of the volume of one observer with respect to the other . To assess the ICC for each OAR separately, a three-way mixed effect analysis of variance model was applied to the volume data. All possible interaction terms were included, with patients and observers as random effects and time as a fixed effect. The time effect describes the mean difference in volume during the treatment (CTplan vs. CTrep). The patient and time-patient interaction effects were considered sources of variation that are unrelated to observer variation. Therefore, in line with Barnhart et al. , the ICC was calculated as the ratio of the sum of variance components for patient and time-patient interaction effects and the sum of all variance components. It represents the correlation coefficient of two arbitrary observers measuring the same patient at the same time (the same CT scan). We used a classification of the data as presented by Shrout et al. . Values of 0.00-0.10 represent virtually no agreement (reliability); 0.11-0.40 slight agreement; 0.41-0.60 fair agreement; 0.61-0.80 moderate agreement; and 0.81-1.00 substantial agreement.
Another endpoint for interobserver variability used in this study was the ratio of the intersection (Volume1∩Volume2) and union (Volume1∪Volume2) volume of two delineated volumes. Terminology for this coefficient varies  but we adhered to the term concordance index (CI), as is also done in the overview of Hanna et al.  and in the review of Jameson et al. . The CI is both sensitive to positional differences and differences in volume size between observers.
We calculated a mean CI value per OAR by averaging all individual CIs over all ten observer pairs and all twelve CT scans, and we determined the range of CIs. A CI of 1.00 indicates perfect overlap (identical structures), whereas a CI of 0.00 indicates no overlap at all.
Large discrepancies between the ICC and the CI indicate that observers are either more consistent in defining the volume size (ICC > > CI), or more consistent in positioning the volumes (CI > > ICC).
3D analysis of variation
The 3D analysis of variation allows quantification of local variation in delineated structures in 3D . For each OAR, a median contour surface of all 5 observer delineations was computed in 3D [16, 17]. The local variation in the five distances to the median (SD) was determined for each surface point, and was averaged over all surface points of the OAR to obtain the global 3D SD.
Note that these 3D SD results provided additional information to the ICC and CI, because these results quantify in which region of the OAR the highest variability in volume sizes and position was observed.
Mean volume, coefficient of variance, and mean OAR ratio per observer and organ at risk
Mean OAR ratio*
[cm 3 ] (SE)
Parotid gland left
Parotid gland right
Submandibular gland left
Submandibular gland right
The CVs in Table 1, which indicate an observer relative standard deviation for observing a volume, clearly showed highest variability for the glottic larynx (56%), while the other CVs varied from 12 to 16%.
Intraclass correlation coefficient
Interobserver variability of the organs at risk described by 3 different endpoints
3D SD (mm)
Parotid gland left
Parotid gland right
Submandibular gland left
Submandibular gland right
The mean CI values varied from 0.64 to 0.71, except for the glottic larynx for which the mean CI was 0.37 (Table 2). A large range in the CI of different observer pairs was seen (min-max, 0.11-0.86, Table 2).
3D analysis of variation
This study included an extensive 3D analysis of variation in delineation of a set of OARs in the head and neck area. All OARs, except from the glottic larynx, showed moderate interobserver variability with ICC values of 0.32-0.83, CI values of 0.64-0.71, and 3D SD values of 0.9-2.6 mm. Cranial, caudal, and medial regions of the OARs showed largest variations. The glottic larynx showed larger variation in delineation (ICC = 0.27, mean CI = 0.37 and 3D SD = 3.9 mm). All endpoints provided support for improvement of delineation practice.
The inaccurate results for consistency in delineation of the glottic larynx were mainly caused by poor compliance to the delineation guidelines. The guidelines prescribe the glottic larynx to end at the caudal edge of the cricoid, and to include the arythenoid cartilages in the glottic larynx contour. As illustrated in Figure 3(d,e,f), this description was not consistently followed. Reduction of the interobserver variability might be accomplished by joint delineation review sessions in which all radiation oncologists who are involved in head and neck cancer participate. These sessions are nowadays current practice at our institutes (UMCG, NCI-AVL).
The salivary glands showed moderate interobserver variability. Visual inspection of the parotid gland contours showed that the guidelines for these organs were also not consistently followed. The protocol prescribed that superficial temporal vessels should be enclosed in the delineated parotid gland, because they are generally hard to distinguish from the parotid gland tissue on scans with no or poor contrast. Still, some observers did not include the vessels in their delineation of the parotid gland. Joint delineation review sessions and enlightenment of the guidelines could help here. Yi and colleagues  for instance showed that clear stepwise delineation guidelines resulted in minimal variability. Our volume analysis of the parotid gland data showed similar CV values (12 and 15%) as data of Geets et al.  (17%). Nelms et al.  found larger CV values (34% and 29%), evaluated for 1 patient by 32 observers. Our 3D SD evaluation reflects valuable information on specific regional variations. Largest discrepancies for the parotid glands were located at the cranial, caudal and medial sub regions of the gland. For the submandibular glands, the cranial parts of the organ clearly showed largest discrepancies. Poor discrimination between tissues at the medial borders of the parotid gland (e.g. distinction from the posterior belly of the digastric muscle) and the cranial parts of the submandibular gland (e.g. distinction from the medial pterygoid muscle and the mylohyoid muscle) could be a reason for these larger 3D SD values. The addition of MRI might improve the visibility of borders between tissues [19, 20]. The cranial and caudal variations for the parotid glands could partly be explained by the image resolution in the cranial-caudal direction of 2 mm (the slice thickness) and the fact that observers could only delineate on transverse CT slices, which limits the resolution in both the cranial and caudal part of the delineations. These limitations do, obviously, apply to the delineations of all OARs. Possibly the availability of delineation on multiple orientations could help to diminish these variations, as is also suggested by Steenbakkers et al. . The use of a standardized delineation environment and tools for automatic contouring might further contribute to reduce interobserver variability [17, 21, 22].
Interobserver variability of the spinal cord was predominantly caused by variations at the cranial and caudal part of the structure, due to indistinctness of the guidelines and low compliance. This problem could be reduced by clearer delineation guidelines although it is unlikely that these variations will have major consequences in clinical practice as long as the spinal cord is accurately delineated in the vicinity of the irradiated volume and the maximum dose to the cord is considered as the leading parameter for treatment planning.
We analysed the interobserver variability in contouring on twelve CT sets, which consisted of six CTplan scans and six CTrep scans. The three endpoints of interobserver variability did not indicate a trend in the differences between CTplan and CTrep (for example see Figure 2). The correspondence of interobserver variability between the scans may suggest that the use of contrast (in CTplan) and the use of a(n) (observer specific) delineation template (in CTrep) have effects of comparable magnitude on the variation in delineation amongst observers, for the considered OARs. Besides, the guidelines are developed to be applicable to non-contrast as well as to contrast enhanced CT data, which will minimize possible variation in delineation due to (lack of) contrast. According to our experience the addition of contrast in delineating OARs is limited, because the uptake of contrast by the selected OARs is deniable.
We used different endpoints to quantify interobserver variability in head and neck OAR delineation. Variations in volume were indicated by the ICC and differences in combined volume and positional variations by the CI. Local variation in delineation was finally described by the regional 3D SD. The results showed that the variation in the determination of the volume alone (ICC) can be rather large while the combined volume and positional variations (CI) did not point to such a large variability (e.g., the spinal cord showed ICC = 0.32, CI = 0.63). This implies that the variations are situated at the borders of the OAR rather than in positional mismatches of the centres of gravity. In another case the ICC indicated substantial agreement while the CI was relatively moderate (e.g., the thyroid cartilage showed ICC = 0.83 and CI = 0.66), which could indicate a substantial consistency in defining volume size while the centre of gravity of the volumes are shifted in position with respect to each other. So information of the ICC combined with the CI could help to identify the type of interobserver variation (in volume and position). To study variations between delineations in detail, the 3D SD provides most complete information.
Some of the endpoints to describe interobserver variability as used in the current study have also been applied in studies dealing with head and neck target volume interobserver variability. Geets et al.  found CVs of 4% and 20% for oropharyngeal and laryngeal-hypolaryngeal GTVs, which are more or less similar to the CVs we found for OARs (2-16%), excluding the glottic larynx (56%). Rasch et al.  described 3D SD variability for head and neck target volume delineation in the same range as our OAR results; 3.3-4.4 mm for the CTV and 4.9-5.9 mm for the elective nodal areas, while our global 3D SD results varied from 0.9 to 3.9 mm. Our results thus strengthen earlier findings (e.g. of Nelms et al. ) that interobserver variability is not only an important issue in the delineation of target volumes but also plays a role in the delineation of OARs.
Cranial, caudal, and medial regions of the studied head and neck organs at risk showed largest interobserver variability, due to indistinctness of the delineation guidelines, the larger image resolution in the cranial-caudal direction, the limitation of delineation on transverse slices, and poor discrimination in contrast from adjacent tissues. Potential measures to reduce current redundant variability in delineation practice are: (1) guideline development, (2) joint delineation review sessions, and (3) application of multimodality imaging. Other aspects that could contribute to more consistency in delineation are a standardized delineation environment with standard delineation tools, the possibility to delineate on multiple orientations and automatic contouring tools. The latter should however carefully be validated using base line data of contouring variability such as the results of this study. Minor interobserver variability could ultimately benefit radiation oncology practice since it may contribute to more general applicability and improvement of TCP and NTCP models.
- Peters LJ, O'Sullivan B, Giralt J, Fitzgerald TJ, Trotti A, Bernier J, Bourhis J, Yuen K, Fisher R, Rischin D: Critical impact of radiotherapy protocol compliance and quality in the treatment of advanced head and neck cancer: results from TROG 02.02. J Clin Oncol 2010, 28:2996–3001.PubMedView Article
- Dirix P, Nuyts S: Evidence-based organ-sparing radiotherapy in head and neck cancer. Lancet Oncol 2010, 11:85–91.PubMedView Article
- Nelms BE, Tomé WA, Robinson G, Wheeler J: Variations in the contouring of organs at risk: test case from a patient with oropharyngeal cancer. Int J Radiat Oncol Biol Phys, in press.
- Bortfeld T, Jeraj R: The physical basis and future of radiation therapy. Br J Radiol 2011, 84:485–498.PubMedView Article
- Deasy JO, Moiseenko V, Marks L, Chao KS, Nam J, Eisbruch A: Radiotherapy dose-volume effects on salivary gland function. Int J Radiat Oncol Biol Phys 2010, 76:S58-S63.PubMedView Article
- Rasch C, Eisbruch A, Remeijer P, Bos L, Hoogeman M, van Herk M, Lebesque J: Irradiation of paranasal sinus tumors, a delineation and dose comparison study. Int J Radiat Oncol Biol Phys 2002, 52:120–127.PubMedView Article
- van de Water TA, Bijl HP, Westerlaan HE, Langendijk JA: Delineation guidelines for organs at risk involved in radiation-induced salivary dysfunction and xerostomia. Radiother Oncol 2009, 93:545–552.PubMedView Article
- Christianen MEMC, Langendijk JA, Westerlaan HE, van de Water TA, Bijl HP: Delineation of organs at risk involved in swallowing for radiotherapy treatment planning. Radiother Oncol 2011, 101:394–402.PubMedView Article
- Qazi AA, Pekar V, Kim J, Xie J, Breen SL, Jaffray DA: Auto-segmentation of normal and target structures in head and neck CT images: a feature-driven model-based approach. Med Phys 2011, 38:6160–6170.PubMedView Article
- Shrout PE, Fleiss JL: Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979, 86:420–428.PubMedView Article
- Barnhart HX, Haber MJ, Lin LI: An overview on assessing agreement with continuous measurements. J Biopharm Stat 2007, 17:529–569.PubMedView Article
- Shrout PE: Measurement reliability and agreement in psychiatry. Stat Methods Med Res 1998, 7:301–317.PubMedView Article
- Kouwenhoven E: Measuring the similarity of target volume delineations independent of the number of observers. Phys Med Biol 2009, 54:2863.PubMedView Article
- Hanna GG, Hounsell AR, O'Sullivan JM: Geometrical analysis of radiotherapy target volume delineation: a systematic review of reported comparison methods. Clin Oncol 2010, 22:515–525.View Article
- Jameson MG, Holloway LC, Vial PJ, Vinod SK, Metcalfe PE: A review of methods of analysis in contouring studies for radiation oncology. J Med Imaging Radiat Oncol 2010, 54:401–410.PubMedView Article
- Deurloo K, Steenbakkers R, Zijp L, de Bois J, Nowak P, Rasch C, van Herk M: Quantification of shape variation of prostate and seminal vesicles during external beam radiotherapy. Int J Radiat Oncol Biol Phys 2005, 61:228–238.PubMedView Article
- Steenbakkers R, Duppen J, Fitton I, Deurloo K, Zijp L, Uitterhoeve A, Rodrigus P, Kramer G, Bussink J, Jaeger K, et al.: Observer variation in target volume delineation of lung cancer related to radiation oncologist-computer interaction: a 'Big Brother' evaluation. Radiother Oncol 2005, 77:182–190.PubMedView Article
- Yi SK, Hall WH, Mathai M, Dublin AB, Gupta V, Purdy JA, Chen AM: Validating the RTOG-endorsed brachial plexus contouring atlas: an evaluation of reproducibility among patients treated by intensity-modulated radiotherapy for head-and-neck cancer. Int J Radiat Oncol Biol Phys 2012, 82:1060–1064.PubMedView Article
- Geets X, Daisne J, Arcangeli S, Coche E, Poel M, Duprez T, Nardella G, Grégoire V: Inter-observer variability in the delineation of pharyngo-laryngeal tumor, parotid glands and cervical spinal cord: Comparison between CT-scan and MRI. Radiother Oncol 2005, 77:25–31.PubMedView Article
- Rasch CR, Steenbakkers RJ, Fitton I, Duppen JC, Nowak PJ, Pameijer FA, Eisbruch A, Kaanders JH, Paulsen F, van Herk M: Decreased 3D observer variation with matched CT-MRI, for target delineation in Nasopharynx cancer. Radiat Oncol 2010, 5:21.PubMedView Article
- Chao KSC, Bhide S, Chen H, Asper J, Bush S, Franklin G, Kavadi V, Liengswangwong V, Gordon W, Raben A, et al.: Reduce in variation and improve efficiency of target volume delineation by a computer-assisted system using a deformable image registration approach. Int J Radiat Oncol Biol Phys 2007, 68:1512–1521.PubMedView Article
- Teguh D, Levendag P, Voet P, Al Mamgani A, Han X, Wolf T, Hibbard L, Nowak P, Akhiat H, Dirkx M, et al.: Clinical validation of atlas-based auto-segmentation of multiple target volumes and normal tissue (Swallowing/Mastication) structures in the head and neck. Int J Radiat Oncol Biol Phys 2011, 81:950–957.PubMedView Article