Interobserver variability in organ at risk delineation in head and neck cancer

Background In radiotherapy inaccuracy in organ at risk (OAR) delineation can impact treatment plan optimisation and treatment plan evaluation. Brouwer et al. showed significant interobserver variability (IOV) in OAR delineation in head and neck cancer (HNC) and published international consensus guidelines (ICG) for OAR delineation in 2015. The aim of our study was to evaluate IOV in the presence of these guidelines. Methods HNC radiation oncologists (RO) from each Belgian radiotherapy centre were invited to complete a survey and submit contours for 5 HNC cases. Reference contours (OARref) were obtained by a clinically validated artificial intelligence-tool trained using ICG. Dice similarity coefficients (DSC), mean surface distance (MSD) and 95% Hausdorff distances (HD95) were used for comparison. Results Fourteen of twenty-two RO (64%) completed the survey and submitted delineations. Thirteen (93%) confirmed the use of delineation guidelines, of which six (43%) used the ICG. The OARs whose delineations agreed best with the OARref were mandible [median DSC 0.9, range (0.8–0.9); median MSD 1.1 mm, range (0.8–8.3), median HD95 3.4 mm, range (1.5–38.7)], brainstem [median DSC 0.9 (0.6–0.9); median MSD 1.5 mm (1.1–4.0), median HD95 4.0 mm (2.3–15.0)], submandibular glands [median DSC 0.8 (0.5–0.9); median MSD 1.2 mm (0.9–2.5), median HD95 3.1 mm (1.8–12.2)] and parotids [median DSC 0.9 (0.6–0.9); median MSD 1.9 mm (1.2–4.2), median HD95 5.1 mm (3.1–19.2)]. Oral cavity, cochleas, PCMs, supraglottic larynx and glottic area showed more variation. RO who used the consensus guidelines showed significantly less IOV (p = 0.008). Conclusions Although ICG for delineation of OARs in HNC exist, they are only implemented by about half of RO participating in this study, which partly explains the delineation variability. However, this study highlights that guidelines alone do not suffice to eliminate IOV and that more effort needs to be done to accomplish further treatment standardisation, for example with artificial intelligence. Supplementary information Supplementary information accompanies this paper at 10.1186/s13014-020-01677-2.

Radiotherapy (RT) is an important treatment modality in the fight against head and neck cancer (HNC) where efforts are continuously being made to improve disease outcome without increasing toxicity. Intensification of RT [1] and/or concomitant chemotherapy [2], have improved survival, however with more acute and late toxicity [3]. Unfortunately, loco-regional failure rates remain high with approximately 30% loco-regional recurrences over 5 years, which impacts morbidity and mortality [4,5]. The ultimate aim is to deliver an as high as possible dose to the target volumes (TVs) to achieve disease control whilst keeping the dose to normal surrounding tissue as low as possible, to limit toxicity. The complex anatomy of the head and neck however makes this very challenging because of the close proximity between TVs and organs at risk (OARs) [6]. A huge step forward in realising this was the implementation of more conformal techniques such as intensity modulated radiotherapy (IMRT) and volumetric arc therapy (VMAT) which allow better sparing of OARs resulting in a decrease in toxicity and a better quality-of-life [7][8][9][10]. To fully utilise these benefits, accurate and consistent delineation of TVs and OARs is crucial as it determines where the high dose should be delivered and it is necessary to produce an optimal, patient specific dose plan. Inaccuracies in this step can have a detrimental effect on treatment outcome either by unnecessarily giving a too high dose to normal tissue which could result in more toxicity, or by inadequately treating the TVs which could result in locoregional treatment failure [11]. Delineation accuracy is significantly limited by interobserver variability (IOV) in delineation of TVs [11][12][13][14][15][16] and OARs [11,17] and should be minimised to improve treatment standardisation to provide the best quality of care possible for patients. Furthermore, IOV has an impact on the interpretation of radiation induced toxicity and could therefore also have an impact on the outcome of multicentre trials (11). International consensus guidelines (ICG) describing the delineation of 25 OARs in the head and neck were published in 2015 by Brouwer et al. [18] after IOV had been shown between 5 radiation oncologists (RO) [17].
An initiative was launched to map the RT landscape within Belgium for HNC, regarding delineation of TVs [16] and OARs, in the presence of ICG [18]. Since the publication of these ICG, this is the first study of its kind to identify (a) which guidelines are used, (b) which OARs are delineated in clinical practice and (c) the extent of IOV in organ at risk (OAR) delineation, with the cooperation of multiple RO from different RT centres.

Study design
In February 2017, all 25 RT centres in Belgium were invited to participate in this study. One experienced HNC RO from each participating centre was asked through an online survey which guidelines they used for delineation of OARs and whether these guidelines in their opinion needed a revision or clarification (survey in Additional file 1: survey questions and answers). The same RO was also invited to submit OAR delineations of five previously selected HNC cases (Additional file 2: Table 1 Patient characteristics). These cases were selected to represent different tumour sites and different tumour and nodal stages, excluding post-operative patients and patients with scatter artefacts on planning CT. We refer to our previous study for a full description of each case [16], which was also provided to each participating RO, including detailed information on clinical examination, diagnostic imaging (MRI, CT, PET-CT) and biopsy.
A planning CT scan was acquired in supine position after iodine containing contrast medium (Visipaque 320 ® ) was injected intravenously. For further details regarding the planning CT, we refer to our previous publication [16]. The anonymized planning CTs with delineated gross tumour volume of the primary tumour (GTVp) and pathological lymph nodes (GTVn) were provided and dedicated software (Aquilab Software, Lille, France) was used for secure data transfer to and from each participating centre.
A reference contour of each OAR (OARref ) was created for comparison, with the help of an in-house developed auto-delineation tool to ensure consistent delineations [19]. This tool was created using deep learning based on a training set of HNC planning CTs delineated according to the ICG [18]. The tool has been validated and implemented in our clinical practice [20] and has been shown to decrease IOV in our centre. The auto-delineation contours were carefully reviewed and manually corrected if needed to remove minor mistakes.

Delineation agreement analysis
Pair-wise agreement of the 3D set of contours submitted by each RO to the corresponding reference contours made according to the ICG (OARref ) was assessed for each OAR separately using Dice similarity coefficient (DSC), mean surface distance (MSD) and the 95% Hausdorff Distance (HD95). The DSC was calculated as the ratio of the volume of overlap of both contour sets (A and B), divided by their total volume: A perfect overlap between contours results in DSC = 1, while no overlap results in DSC = 0. Clinical interpretation of intermediate DSC values is complicated by the fact that DSC is biased with regards to volume (i.e. structures with larger volume yielding higher DSC than smaller structures with similar absolute volume difference) [21]. Hence, also MSD and HD were calculated which are distance measures. MSD is the mean distance between the surface of the contours of the RO and the OARref. HD is the maximum of the 3D distances between any two closest points on each of both OAR contours, which is independent of their volume. Instead of the maximum distance which is sensitive to outliers, we report HD95, i.e. the 95th percentile. For MSD and HD95, a smaller value corresponds to more delineation agreement compared to a larger value. Median DSC, MSD and HD95 were computed for each OAR separately to asses difference in IOV per OAR. To assess the impact DSC = 2 * |A ∩ B| |A| + |B| of the guidelines the RO used on IOV, DSC, MSD and HD95 were computed separately for the two groups. An independent, two-sided T-test was used to quantify significance, P < 0.05 was considered statistically significant.

Results
Three RO encountered technical problems and could therefore not take part in this study. Fourteen of the remaining 22 RO (64%) responded to the questionnaire and submitted at least one delineation. Eleven RO delineated all 5 patients, 1 delineated 3 cases and 2 delineated 2 cases (62 cases in total). Of the fourteen RO, four worked in a university hospital and ten in a general hospital. Three hospitals were public hospitals, the remaining eleven were private,

Survey
Thirteen of fourteen participating RO confirmed using guidelines for OAR delineation of which six used the ICG of Brouwer et al. [18] and one also used the publication of Christianen et al. [22]. One RO used the publication of Genovesi et al. [23], while six did not specify which guidelines they used. Seven RO found an update or clarification of existing guidelines, or creation of new guidelines necessary. Five of these did not use the ICG and two did (Additional file 1).  Figure 1 shows the overall difference in MSD between RO who use the ICG versus other RO and Fig. 2 shows the differences per OAR. They show that MSD is significantly smaller when the ICG are applied (p = 0.008). In Additional file 3: Fig. 1, DSC and corresponding MSD for each OAR are shown separately to show that some OARs show more IOV than others. Additional file 4: Fig. 2 shows the difference between the two RO groups for DSC and HD95. Additional file 5: Fig. 3 shows the range of volumes delineated per patient and per OAR compared to OARref.

Brainstem
The brainstem was delineated in 89% of cases (no difference between the two RO groups). Most RO in this study started delineation in the most cranial slice where the brainstem was visible. The caudal border differed with a few slices between RO but was mostly according to the guidelines ( Table 2, Fig. 3a). The circumferential contour on the axial plane showed little variation (Additional file 6: Fig. 4a). On visual inspection of the contours, there was no clear difference between the two groups of RO.

Cochlea
Cochleas were delineated in 40% of cases (59% with ICG vs 26% without). Disagreement of contours was small, although 3 RO delineated the entire petrous part of the temporal bone, one of whom used the ICG (Additional file 6: Fig. 4b) and 2 who did not use the ICG delineated a region that did not contain the cochlea in one patient each (Additional file 6: Fig. 4c).

Glottic area
It was delineated in 48% of cases by RO who used the ICG compared to 29% of RO who did not. It was delineated more in patients with oropharyngeal tumours (58%) than in patients with laryngeal, supraglottic or hypopharyngeal tumours (22%). Two RO delineated the entire larynx starting caudal of the hyoid bone and included the thyroid cartilage and arytenoids. One RO included part of the supraglottic larynx, another included the arytenoids and a third included both. Three RO delineated the glottic area according to the ICG, and all three confirmed using the guidelines in the survey (Additional file 6: Fig. 4d+e).

Mandible
Vast majority (89%) of the submissions included a delineation of the mandible (96% with ICG vs 83% without). There were minor differences on visual inspection compared to OARref although sometimes the teeth were included as well (Additional file 6: Fig. 4f ). One RO did not include the mandibular condyles and coronoid process.

Oral cavity
Two thirds (68%) of the submissions included the oral cavity (70% with ICG vs 66% without). Two RO included the teeth (one used the ICG), and one RO who used the ICG included the buccal mucosa (Fig. 3b). The cranial border was consistently selected as the mucosa of the hard palate, but the posterior and caudal border showed more variation (Additional file 6: Fig. 4g). One RO excluded the posterior part of the tongue, and another the base of tongue.

Parotid glands
The parotid glands (PGs) were delineated most often by all RO. Only one right parotid gland was not delineated by one RO for an unknown reason. At the anterior  16:120 border the masseter and pterygoid muscles were sometimes included and at the medial border the digastric muscle ( Fig. 3b + Additional file 6: Fig. 4h). The cranial and caudal borders varied up to a few slices.

Pharyngeal constrictor muscles
The three pharyngeal constrictor muscles (PCMsup, PCMmid, PCMinf ) were delineated by 9 RO, but only by 5 separately. RO who used the ICG delineated the PCMs more often than other RO, 44% vs. 20%. There was good agreement in the cranial border of PCMsup, although one RO delineated it up to the base of skull. It also showed variation in the anterior border (Additional file 6: Fig. 4i). Regarding PCMmid, only two RO delineated cranially enough, the others stopped at caudal level C3 (Additional file 6: Fig. 4j). There was good consensus regarding the cranial border of the PCMinf but the caudal border differed with multiple slices between RO. There was good agreement in the lateral extension of the contours in all three muscles.

Spinal cord
The spinal cord was delineated in 82% of cases (62% with ICG vs 97% without) and the spinal canal in the other cases (two RO who both used the ICG and once by a RO in the other group) (Fig. 3b). Besides this, the largest differences were seen in the cranial border (depending on the caudal border of the brainstem) and the caudal border (Fig. 3c). Some RO delineated the spinal cord all the way to the most caudal slice of the CT scan, others stopped several slices higher. Three RO stopped a few slices cranial to T3 in one patient each.

Supraglottic larynx
The supraglottic larynx was delineated by less than half of the RO in patients with an oropharyngeal tumour, and by less than a quarter of RO in patients with a laryngeal, supraglottic or hypopharyngeal tumour. In total it was delineated at least once by seven RO and more often when the ICG were used (41% vs 26%). Two RO systematically delineated 2-3 cm more caudally then the guidelines suggest (Additional file 6: Fig. 4m) and one RO more cranially (Additional file 6: Fig. 4n).

Discussion
The present study shows that even though there are ICG for OAR delineation, these are not consistently applied by all HNC RO in routine clinical practice. This results in variability in terms of which OARs are delineated and how these are delineated. Furthermore, we have shown that even when they are implemented, there is still room for improvement regarding IOV. This is in line with what RO in this study indicate, namely half of them found that new or updated guidelines are necessary.
Previous studies have also shown significant IOV in delineation of several OARs such as the spinal cord, brainstem, PGs, glottic larynx and thyroid cartilage [11,17,24]. Consequently, ICG for OAR delineation were published in 2015 to try to standardise delineation of OARs [18]. The current study is the first one to investigate IOV between RO of different centres for a large set of OARs, since these ICG were published. We had similar results to Brouwer et al. [17], although DSC (or concordance index) was higher in our study which could imply improvement of IOV with the ICG as 6 of 14 RO used them. In a study on the benefits of deep learning for OAR delineation [20], we also showed IOV in OAR delineation    16:120 between two RO from the same centre who both used the ICG. The IOV however was smaller than in the current study, and improved even more with the use of the automated delineation tool.
There are several reasons that could explain the contour variation between RO and the reference contour in the present study. A reason that has already been mentioned, is that different guidelines are used, either because the ICG [18] were not known to exist, or because other guidelines were used. The effect of using the ICG could clearly be seen on several OARs, namely the cochleas, glottic area, PCMs and supraglottic larynx, which were delineated more often and with better agreement. Figures 1 and 2 support this hypothesis because MSD is significantly smaller for the RO using the ICG compared to the other group (p = 0.008). However, even when the ICG are used, there was still IOV compared to the reference contours. A first possible reason is that the edges of the OARs may be unclear/blurry on CT (PCMs, anterior and medial borders of PGs), needing interpretation by the delineating RO, which can result in IOV. Secondly, different CT windowing can also have an impact on OAR visualisation, resulting in different volumes. Thirdly, the guidelines might be misunderstood or misinterpreted. For example the supraglottic larynx which should start cranially at the tip of the epiglottis was delineated by one RO including the air surrounding the tip (Additional file 6: Fig. 4n). The inclusion of air has a large impact on the volume delineated, which is also often seen in case of the oral cavity. Another misinterpretation occurs at the cranial and caudal borders, which often differed a few slices. For example at the caudal border of the brainstem, because the "tip of the dens of C2" can be prone to misinterpretation (Fig. 3a). Also the spinal cord showed variation in the caudal border because some RO delineated it all the way to the most caudal slice of the CT, and others stopped more cranially. Two RO who used the ICG delineated the spinal canal instead of the spinal cord so these were excluded from the analysis which resulted in less delineations (Table 1) and less agreement (Fig. 2). Not only the delineated volumes differed, but also whether the OAR was delineated or not varied significantly. The mandible, brainstem, spinal cord, salivary glands and oral cavity were consistently delineated in all patients, irrespective of which RO delineated them. But several OARs seem less well-known, especially to RO who did not use the ICG. This resulted in less than half of them to delineate the cochleas, glottic area, PCMs and supraglottic larynx. Even the RO using the ICG did not always delineate the OARs described in the guidelines, even though they did delineate them more often (Table 1). A reason for this could be that the RO may have deemed delineation of the OAR unnecessary for treatment planning because the tumour was situated far away or too close to spare the OAR anyway.
Nelms et al. [25] showed the impact of OAR contouring variation on dose volume histograms (DVH) and concluded that differences in maximum dose (Dmax) and mean dose (Dmean) per OAR could be large, depending on the degree of IOV and the RT plan. On the one hand there are OARs where Dmax can be used for plan optimisation (mandible, brainstem, spinal cord and cochleas) and for these OARs, precision of the contour (especially in cranial and caudal direction) may be less important because volume does not affect Dmax significantly. Exceptions of course are sub-optimal delineations, for example when OARs (such as cochleas in 2 patients in this study) are delineated in the wrong position. Additionally, the caudal border of the spinal cord is important for caudally located tumours and the cranial border of the spinal cord should also be delineated carefully, as the spinal cord has a stricter dose constraint than the brainstem. Shifting the border between these two OARs more caudally means the spinal cord could receive a higher dose than anticipated. On the other hand, there are OARs (salivary glands, oral cavity, PCMs, glottic area and supraglottic larynx) where Dmean is used for treatment planning and evaluation. In that case, the volume delineated is important because a smaller volume would result in a higher Dmean than a larger volume. Additional file 4: Fig. 2 shows that for the glottic area, oral cavity and supraglottic larynx, the smallest/largest volume contoured by RO is sometimes half/double the size of the OARref volume. A summary of the impact of sub-optimal delineations on dosimetry is listed in Table 2.
The consequences of inconsistent OAR delineation should not be underestimated as it is crucial for developing a treatment plan that represents reality. Incorrect or inaccurate delineation of OARs can impact DVH and could in turn impact normal-tissue complication probability (NTCP), affect evaluation of treatment plans and result in unexpected treatment-related morbidity. In turn, this could also affect the performance of predictive models and should be kept in mind in multicentre trials. Furthermore, care should be taken when using constraints from publications or other RO as these may have been developed with different OAR volumes, which could result in more unexpected toxicity. Correct delineation of OARs is also important to fully utilise the benefits of highly conformal techniques such as IMRT, VMAT and proton therapy, as incorrect delineation will counteract this benefit. Besides unexpected toxicity resulting from incorrect delineation of OARs, there is also the possibility of geographical misses. When delineating the clinical target volume, it may be adapted to exclude overlapping OARs which it does not invade. However, if the OAR is incorrectly delineated and the region is excluded from the clinical target volume or planning target volume, this could result in a geographical miss. Lastly, RO should be aware that even when identical guidelines are used, delineations still differ from one another (Fig. 1). We therefore advise regular joint delineation review sessions as a form of continuous training. If the guidelines would be updated, it would be useful to consider a general recommendation of mandatory and optional OARs to be delineated, in function of tumour location. In the future, it would also be useful if the preferred window level setting per OAR would be added to the guidelines, for optimal delineation. We also strongly believe there is a place for the automated delineation of OARs, as we have shown its benefits in reducing IOV and improving time efficiency in a previous study [20].
There are several limitations to the present study that should be addressed. Firstly, participation was voluntarily which could result in a response bias because not all invited clinical centres took part (64%). However, RO from university hospitals and general hospitals took part in the study. A second potential limitation is that not all RO answered which guidelines they used for delineation of OARs. Although this has no impact on the observed IOV, it does affect the perceived impact of the implementation of guidelines. Thirdly, participants were asked to delineate as they would do in clinical practice to give a realistic indication of therapeutic variability. This however meant that not all OARs were delineated by all RO, although it reflects variation in how patients are treated in reality. Lastly, reference contours were delineated using the ICG [18] and although this was done with the utmost care and with the help of an automated delineation tool, we cannot deny that this in itself required interpretation of the guidelines, which could introduce bias.

Conclusions
Although ICG for delineation of OARs in HNC have been published several years ago, they are only implemented by half of RO participating in this study, which partly explains some of the delineation heterogeneity. Although there was less IOV between RO using the ICG, this study highlights that delineation guidelines alone do not suffice and that more effort needs to be done to accomplish further treatment standardisation, for example with the implementation of artificial intelligence tools for automated delineation.