This study included an extensive 3D analysis of variation in delineation of a set of OARs in the head and neck area. All OARs, except from the glottic larynx, showed moderate interobserver variability with ICC values of 0.32-0.83, CI values of 0.64-0.71, and 3D SD values of 0.9-2.6 mm. Cranial, caudal, and medial regions of the OARs showed largest variations. The glottic larynx showed larger variation in delineation (ICC = 0.27, mean CI = 0.37 and 3D SD = 3.9 mm). All endpoints provided support for improvement of delineation practice.
The inaccurate results for consistency in delineation of the glottic larynx were mainly caused by poor compliance to the delineation guidelines. The guidelines prescribe the glottic larynx to end at the caudal edge of the cricoid, and to include the arythenoid cartilages in the glottic larynx contour. As illustrated in Figure 3(d,e,f), this description was not consistently followed. Reduction of the interobserver variability might be accomplished by joint delineation review sessions in which all radiation oncologists who are involved in head and neck cancer participate. These sessions are nowadays current practice at our institutes (UMCG, NCI-AVL).
The salivary glands showed moderate interobserver variability. Visual inspection of the parotid gland contours showed that the guidelines for these organs were also not consistently followed. The protocol prescribed that superficial temporal vessels should be enclosed in the delineated parotid gland, because they are generally hard to distinguish from the parotid gland tissue on scans with no or poor contrast. Still, some observers did not include the vessels in their delineation of the parotid gland. Joint delineation review sessions and enlightenment of the guidelines could help here. Yi and colleagues  for instance showed that clear stepwise delineation guidelines resulted in minimal variability. Our volume analysis of the parotid gland data showed similar CV values (12 and 15%) as data of Geets et al.  (17%). Nelms et al.  found larger CV values (34% and 29%), evaluated for 1 patient by 32 observers. Our 3D SD evaluation reflects valuable information on specific regional variations. Largest discrepancies for the parotid glands were located at the cranial, caudal and medial sub regions of the gland. For the submandibular glands, the cranial parts of the organ clearly showed largest discrepancies. Poor discrimination between tissues at the medial borders of the parotid gland (e.g. distinction from the posterior belly of the digastric muscle) and the cranial parts of the submandibular gland (e.g. distinction from the medial pterygoid muscle and the mylohyoid muscle) could be a reason for these larger 3D SD values. The addition of MRI might improve the visibility of borders between tissues [19, 20]. The cranial and caudal variations for the parotid glands could partly be explained by the image resolution in the cranial-caudal direction of 2 mm (the slice thickness) and the fact that observers could only delineate on transverse CT slices, which limits the resolution in both the cranial and caudal part of the delineations. These limitations do, obviously, apply to the delineations of all OARs. Possibly the availability of delineation on multiple orientations could help to diminish these variations, as is also suggested by Steenbakkers et al. . The use of a standardized delineation environment and tools for automatic contouring might further contribute to reduce interobserver variability [17, 21, 22].
Interobserver variability of the spinal cord was predominantly caused by variations at the cranial and caudal part of the structure, due to indistinctness of the guidelines and low compliance. This problem could be reduced by clearer delineation guidelines although it is unlikely that these variations will have major consequences in clinical practice as long as the spinal cord is accurately delineated in the vicinity of the irradiated volume and the maximum dose to the cord is considered as the leading parameter for treatment planning.
We analysed the interobserver variability in contouring on twelve CT sets, which consisted of six CTplan scans and six CTrep scans. The three endpoints of interobserver variability did not indicate a trend in the differences between CTplan and CTrep (for example see Figure 2). The correspondence of interobserver variability between the scans may suggest that the use of contrast (in CTplan) and the use of a(n) (observer specific) delineation template (in CTrep) have effects of comparable magnitude on the variation in delineation amongst observers, for the considered OARs. Besides, the guidelines are developed to be applicable to non-contrast as well as to contrast enhanced CT data, which will minimize possible variation in delineation due to (lack of) contrast. According to our experience the addition of contrast in delineating OARs is limited, because the uptake of contrast by the selected OARs is deniable.
We used different endpoints to quantify interobserver variability in head and neck OAR delineation. Variations in volume were indicated by the ICC and differences in combined volume and positional variations by the CI. Local variation in delineation was finally described by the regional 3D SD. The results showed that the variation in the determination of the volume alone (ICC) can be rather large while the combined volume and positional variations (CI) did not point to such a large variability (e.g., the spinal cord showed ICC = 0.32, CI = 0.63). This implies that the variations are situated at the borders of the OAR rather than in positional mismatches of the centres of gravity. In another case the ICC indicated substantial agreement while the CI was relatively moderate (e.g., the thyroid cartilage showed ICC = 0.83 and CI = 0.66), which could indicate a substantial consistency in defining volume size while the centre of gravity of the volumes are shifted in position with respect to each other. So information of the ICC combined with the CI could help to identify the type of interobserver variation (in volume and position). To study variations between delineations in detail, the 3D SD provides most complete information.
Some of the endpoints to describe interobserver variability as used in the current study have also been applied in studies dealing with head and neck target volume interobserver variability. Geets et al.  found CVs of 4% and 20% for oropharyngeal and laryngeal-hypolaryngeal GTVs, which are more or less similar to the CVs we found for OARs (2-16%), excluding the glottic larynx (56%). Rasch et al.  described 3D SD variability for head and neck target volume delineation in the same range as our OAR results; 3.3-4.4 mm for the CTV and 4.9-5.9 mm for the elective nodal areas, while our global 3D SD results varied from 0.9 to 3.9 mm. Our results thus strengthen earlier findings (e.g. of Nelms et al. ) that interobserver variability is not only an important issue in the delineation of target volumes but also plays a role in the delineation of OARs.