Evaluation of an automatic segmentation algorithm for definition of head and neck organs at risk

Background The accurate definition of organs at risk (OARs) is required to fully exploit the benefits of intensity-modulated radiotherapy (IMRT) for head and neck cancer. However, manual delineation is time-consuming and there is considerable inter-observer variability. This is pertinent as function-sparing and adaptive IMRT have increased the number and frequency of delineation of OARs. We evaluated accuracy and potential time-saving of Smart Probabilistic Image Contouring Engine (SPICE) automatic segmentation to define OARs for salivary-, swallowing- and cochlea-sparing IMRT. Methods Five clinicians recorded the time to delineate five organs at risk (parotid glands, submandibular glands, larynx, pharyngeal constrictor muscles and cochleae) for each of 10 CT scans. SPICE was then used to define these structures. The acceptability of SPICE contours was initially determined by visual inspection and the total time to modify them recorded per scan. The Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm created a reference standard from all clinician contours. Clinician, SPICE and modified contours were compared against STAPLE by the Dice similarity coefficient (DSC) and mean/maximum distance to agreement (DTA). Results For all investigated structures, SPICE contours were less accurate than manual contours. However, for parotid/submandibular glands they were acceptable (median DSC: 0.79/0.80; mean, maximum DTA: 1.5 mm, 14.8 mm/0.6 mm, 5.7 mm). Modified SPICE contours were also less accurate than manual contours. The utilisation of SPICE did not result in time-saving/improve efficiency. Conclusions Improvements in accuracy of automatic segmentation for head and neck OARs would be worthwhile and are required before its routine clinical implementation.


Background
The accurate definition of organs at risk (OARs) is required to fully exploit the benefits of intensity-modulated radiotherapy (IMRT) for head and neck cancer [1]. However, manual delineation is time-consuming [2]. There is also considerable inter-observer variability; [3][4][5][6] which can result in significant differences in radiation dose to OARs [4]. This has implications for: evaluation of radiotherapy plans; interpretation of radiation effects; and meaningful comparisons between treatments. Standardisation is improved by the use of contouring guidelines, multimodality imaging and consensus between experts, but variation in organ delineation remains [3,5,7]. This is of pressing importance with the introduction of both function-sparing and adaptive IMRT, where number and frequency of delineation of OARs are increased.
Following head and neck radiotherapy, adverse late effects are highly prevalent and these impact on both organ function and more general domains of well-being, such as physical, mental and social health [8]. Radiation-induced xerostomia is the most commonly reported grade ≥2 late side effect, which can result in difficulties with speech, swallowing and dental caries [9][10][11]. Saliva is produced from the major (parotid, submandibular and sublingual) and minor (soft palate, lips, cheeks) salivary glands [12]. The parotid-sparing intensity-modulated versus conventional radiotherapy in head and neck cancer (PAR-SPORT) trial demonstrated the incidence of grade ≥2 xerostomia one year after treatment was significantly reduced with parotid-sparing IMRT compared to 3Dconformal radiotherapy (38% versus 74%) [9]. One parotid gland should be spared to a mean dose of less than 20Gy or both glands to less than 25Gy [13]. For the submandibular gland, relatively modest reductions in dose (to less than 35Gy) may be of benefit [13].
Swallowing dysfunction is seen in up to half of patients treated with definitive synchronous chemo-radiotherapy and is the most common late grade ≥3 toxicity; the incidence has increased with intensification of treatment including addition of chemotherapy or altered fractionation [14][15][16]. This adversely affects quality of life, probably to an even greater extent than xerostomia [8,[17][18][19][20]. The mean radiation doses to the pharyngeal constrictor muscles and supraglottic larynx are significantly associated with late dysphagia [19,[21][22][23][24][25][26][27]. The volume of the larynx and pharyngeal constrictor muscles that receive a radiation dose ≥60Gy (and where possible ≥50Gy) should be minimised [28].
Permanent and predominantly high frequency sensorineural hearing loss may occur in 40-60% of patients who receive radiotherapy to areas such as the nasopharynx, para-nasal sinuses and parotid bed [29][30][31]. This is associated with psychological and cognitive morbidity [32]. The mean dose to the cochlea should be limited to ≤45Gy (or more conservatively ≤35Gy); and when combined with cisplatin, strictly limited [33].
Significant anatomic changes and alteration in dose to target volumes and OARs may occur during a course of head and neck radiotherapy [34][35][36][37]. A standard way to detect inter-fraction variation is volumetric imaging using kilovoltage (kV) cone beam computed tomography (CT) imaging. Typically these images are superimposed on the planning CT scan using rigid coregistration. However, this only allows qualitative comparison of similarity in six degrees of freedom, which may not be adequate if the shapes or relative position of target organs and OARs have changed. A potential solution for head and neck structures is the use of automatic segmentation where the planning CT scan and manual contours serve as an atlas and are mapped to the replanning or cone beam CT scan using a process of deformable registration and voxel-matching [36,[38][39][40][41]. This would facilitate calculation of changes in doses to the target volumes and OARs; [42] information that could be used to determine whether adaptive replanning is required [34,[43][44][45].
Smart Probabilistic Image Contouring Engine (SPICE) is an automated commercially available algorithm, which combines an atlas-based and model-based approach to segmentation of head and neck lymph node levels and OARs [46]. The atlas was initially derived from expert 'ground truth' contours. The automatic segmentation process employs multiple-steps of deformable image registration. First, low-dimensional non rigid transformation maps the model landmarks (or mean organ positions) into the image, which accounts for any large displacements (atlas-based step). Second, there is density-based registration where each voxel is included or excluded from a structure depending on its intensity (grey-scale step) i.e., functionality is limited to CT scans. Third, a model-based segmentation approach is applied where organ models ('meshes') that have been created from averaged manual expert segmentations adapts and refines the structure (shape model-based step). This mesh evolution can be considered as being 'driven by the greyscale and constrained by the shape model' [47].
This study aims to evaluate accuracy and time-saving of SPICE to define OARs for salivary-, swallowing-and cochlea-sparing IMRT.

Methods
Ten radiotherapy planning CT scans were selected where the OARs of interest were not distorted by tumour or artefact (treatment planning system, Pinnacle³ version 9.4). Five clinicians (four Consultants/Attending Physicians and one Fellow) recorded for each scan the time to manually delineate the parotid and submandibular glands, larynx (supraglottic and glottic larynx defined as one structure), pharyngeal constrictor muscles (superior, middle, inferior pharyngeal constrictor muscles and cricopharyngeus muscle defined as one structure) and cochleae according to a locally agreed protocol based on published guidelines ('manual' contours) [14,48,49]. SPICE was then used to define these structures ('SPICE' contours). Each clinician determined by visual inspection the acceptability of SPICE contours for each structure and the total time to modify these for each scan ('modified SPICE' contours). The modified SPICE contours represent the utilisation of SPICE in clinical practice (clinician review and modification). These also demonstrate introduction of bias by automatic segmentation (in the absence of bias, modified and manual contours should ideally match).
The Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm employs a probability map to create a 'best fit' from a collection of contours ( Figure 1) [50]. The STAPLE algorithm created a reference standard from all clinician manual contours ('STAPLE' contours). The manual, SPICE and modified SPICE contours were compared to STAPLE by: Dice similarity coefficient (DSC) and mean/maximum distance to agreement (DTA). The DSC is a statistical measure of spatial overlap between two structures. It is defined as 2x intersection volume/ total sum of volumes and normalises the degree of intersection from 0 (no overlap) to 1 (perfect overlap), with good agreement defined as >0.7-0.8 [41,51,52]. DTA is a geometrical parameter that measures the per voxel shortest distance from the surface of one structure to another, ideal = 0 mm. Paired structures (parotid glands, submandibular glands and cochleae) were considered together. For the parotid and submandibular glands, SPICE generated three contours ('1', '2' or '3'), which were each based on different 'ground truth' data [53]. Comparisons between these and STAPLE for all 10 patients were made to determine the most accurate, for subsequent use and evaluation. The study was conducted with appropriate local R&D approval.
Statistical comparisons using multiple linear regression analysis (to control for possible individual patient/scan or clinician confounding factors) were made between mean values of all matrices for: SPICE against STAPLE versus manual against STAPLE (to determine the accuracy of SPICE); and modified SPICE against STAPLE versus manual against STAPLE (to determine the accuracy of modified SPICE i.e., the utilisation of SPICE). As a further measure of accuracy, SPICE was compared with the most discordant clinician contours (determined against STAPLE and by ranking of clinicians) for each structure measured by DSC and DTA, using the Wilcoxon signed rank test. The total times to manual versus modify SPICE contours for all structures and clinicians were compared using Student's paired t-test (to determine efficiency in the utilisation of SPICE). Significance was assessed at the p < 0.05 level.

Results
Accuracy of SPICE SPICE submandibular gland '1' and parotid gland '2' contours demonstrated best concordance with STAPLE (Table 1) and were used in subsequent comparisons.
The mean DSCs were significantly reduced for SPICE contours compared with manual for all structures ( Figure 2). All SPICE contours were inferior to the most discordant manual contours ( Figure 2). However, for parotid and submandibular glands SPICE contours, the respective median and interquartile ranges for DSCs were 0.79 (0.74, 0.83) and 0.80 (0.70, 0.85), suggesting acceptability for these structures. The mean and maximum DTAs for SPICE contours and manual were similar for parotid glands and cochleae but statistically significantly worse for submandibular glands, larynx and pharyngeal constrictor muscles (Figures 3 and 4). Similarly, except for the parotid glands and cochleae, the SPICE contours mean and maximum DTAs were inferior to the most discordant clinician manual contours. However, for submandibular glands, the respective median and interquartile ranges for mean and maximum DTAs were relatively minor: 0.6 mm (0.4-1.0) and 5.6 mm (4.7-8.2 mm).

Utilisation of SPICE
The total proportions of SPICE contours determined by visual inspection not to require alteration were: parotid glands (17%), submandibular glands (41%), larynx (8%), pharyngeal constrictor muscles (4%), and cochleae (28%). The mean DSCs were significantly reduced for modified SPICE contours compared with manual for all structures ( Figure 5). However, the respective median and interquartile ranges for modified SPICE DSCs for parotid glands, submandibular glands and larynx were: 0.85 (0.83, 0.86), 0.85 (0.82, 0.87), and 0.76 (0.72, 0.82), which represented  good agreement. The mean and maximum DTAs for modified SPICE contours compared with manual were similar for the pharyngeal constrictor muscles and cochleae but significantly worse for parotid glands, submandibular glands and larynx (Figures 6 and 7). For these three structures, the respective median and interquartile ranges for the mean/maximum DTAs were 1.

Efficiency in utilisation of SPICE
The respective per scan overall mean times for manual and modified SPICE contours were 14.0 and 16.2 minutes (difference, 15.7%) (Figure 8). Only one out of five clinicians showed a mean reduction in per scan overall time to modify SPICE contours compared with manual.

Discussion
This study showed that for head and neck OARs: (i) SPICE contours were less accurate than manual contours, but acceptable for the definition of parotid and submandibular glands; (ii) modified SPICE contours remained inferior to manual contours; and (iii) the utilisation of SPICE compared with manual delineation did not result in timesaving/improve efficiency. Automatic segmentation to define selected head and neck OARs may reduce inter-observer variability [54,55]. Chao et al compared for two CT scans and eight clinicians, manual and automatic modified contours for delineation of the clinical target volume as well as parotid glands, spinal cord, brainstem and (for one scan) the optic apparatus [54]. For the OARs, inter-observer variability was significantly reduced for modified compared with manual contours. This was associated with a mean time saving of 26%-47%, which depended on experience of the oncologist. In a subsequent study, the ISOgray atlas-based auto-segmentation algorithm was evaluated for definition of the brainstem, parotid glands and mandible [55]. The study was conducted at 2 centres, where Figure 3 Mean Distance to Agreement (mm) -SPICE against STAPLE compared with: (i) all manual contours against STAPLE (left-side graphs); (ii) individual clinicians manual contours against STAPLE (right-side graphs, statistical comparisons shown between most discordant clinician contours against STAPLE versus SPICE against STAPLE) for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or SPICE contours (for paired organs, two per scan). a total of 3 clinicians either manually delineated (2 clinicians, 3 scans each) or modified automated contours (1 clinician, 7 scans); for only one scan were both manual and modified contours defined. The mean DSCs for all organs were 0.68 and 0.82 for manual and modified contours, respectively; and the sensitivity and specificity for manual versus modified contours were 63%-91% and 60%-80% versus 63-91% and 89-98%, respectively. These results suggested reduced inter-observer variability for modified contours compared with manual. However, while demonstration of reduced inter-observer variability is important, it is not sufficient, because there is potential introduction of bias and systematic errors.
The updated Brainlab automated segmentation algorithm, which employs atlas-based and deformable registration, was assessed for accuracy of definition of neck nodal regions and selected head and neck OARs [56]. In 10 'ideal' cases without neck nodes on at least one side, the ipsilateral parotid gland, spinal cord and brainstem were contoured; and in 10 cases with neck node Figure 4 Maximum Distance to Agreement (mm) -SPICE against STAPLE compared with: (i) all manual contours against STAPLE (left-side graphs); (ii) individual clinicians manual contours against STAPLE (right-side graphs, statistical comparisons shown between most discordant clinician contours against STAPLE versus SPICE against STAPLE) for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or SPICE contours (for paired organs, two per scan). involvement both parotid glands, submandibular glands, spinal cord, brainstem and mandible were defined. One clinician manually contoured and then modified the automatic contours for each scan/patient. The automatic and modified contours were compared with manual contours using the DSC as well as mean and maximum DTA. The spinal cord and mandible contours were not included in the analysis because the automatic contours did not require modification, except for mandible in one case. For the second group of 10 cases, the OARs were considered together. The authors found that except for spinal cord, the automatic contours systematically required some modification, with resultant improvement in DSC and DTA measures. There was increased efficiency in definition of OARs with a reduction in mean time to manual compared with modified contours from 11.2 minutes to 4.5 minutes (60%) and 16.4 to 6.3 minutes (62%), in respective groups. This time-saving is partly due to the automatic contours for spinal cord, brainstem and mandible requiring no or little modification.
Clinical validation of a multiple-subject atlas-based autosegmentation tool was performed by measuring the DSC and mean DTA for manual contours (outlined by one of 10 clinicians and agreed by an expert panel) and modified contours (outlined by one of two clinicians) for neck levels, parotid and submandibular glands in 12 patients [57]. For manual versus automatic contours, the respective DSC/mean DTA for parotid and submandibular glands were 0.80/2.3 mm and 0.72/1.6 mm. For manual versus modified automatic contours, the respective DSC/ mean DTA for parotid and submandibular glands were 0.81/2.1 mm and 0.77/1.2 mm.
We found that SPICE automatic contours were less accurate/inferior to manual contours for all investigated Figure 5 Dice similarity coefficient -Modified SPICE against STAPLE compared with all manual contours against STAPLE for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or modified SPICE contours (for paired organs, two per scan).
structures, but acceptable for the parotid and submandibular glands. For the parotid and submandibular glands, the DSCs were satisfactory; [41,52] for parotid glands, the mean and maximum DTAs were similar to manual contours and for submandibular glands, the differences were relatively minor. The modification of automatic contours improved accuracy but remained inferior to manual contours and did not result in time-saving. There are a number of possible reasons for these findings. First, the processes of automatic segmentation, both grey-scale and model-based are limited by insensitivity to boundary or edge detection [47]. This is important because the differences in attenuation between soft tissues are often small and the shapes of organs divergent. The computer-based algorithms do not account for nuances in the honed technique of the expert manual contourer. Second, while there are published delineation guidelines for OARs, there is no agreed international consensus, especially for definition of the larynx and pharyngeal constrictor muscles [14,48]. The SPICE atlas may have been developed from dissimilar 'ground truth' contours. Where available, an alternative investigational strategy would be to adapt the local contouring protocol to that used to define the atlas contours [58]. Third, to produce tightly conformed volumes, relatively small alterations in automatic contours may be required, which are time-consuming. The modification process is then less efficient than manual delineation, where techniques such as interpolation between CT slice levels may be used. Figure 6 Mean Distance to Agreement (mm) -Modified SPICE against STAPLE compared with all manual contours against STAPLE for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or modified SPICE contours (for paired organs, two per scan).
Whether differences between manual, automatic or modified contours result in clinically relevant alterations in measured doses to OARs is uncertain. This will partly depend on proximity of normal structures to the treatment volume and the dose gradient. In this study, the target volumes were not defined. This may have influenced the low percentage of OARs determined by visual inspection not to require alteration i.e., clinicians only considered the conformity of automatic contours to normal structures rather than clinical relevance or requirement for this.
This study represents an independent clinical evaluation of automatic segmentation using SPICE and its utilisation for head and neck OARs. It determined the accuracy of SPICE by comparison against a reference standard created using STAPLE, for five head and neck OARs important in function-sparing IMRT. Future work should evaluate automatic segmentation in the presence of distortion by tumour or artefact e.g., dental amalgam; and determine the variation in measured dose to OARs between manual, automatic and modified contours. Maximum Distance to Agreement (mm) -Modified SPICE against STAPLE compared with all manual contours against STAPLE for A. parotid glands, B. submandibular glands, C. larynx, D. pharyngeal constrictor muscles, E. cochleae. *p < 0.05, **p < 0.01, ***p < 0.001. Abbreviations: n, total number of manual or modified SPICE contours (for paired organs, two per scan).

Conclusion
For the investigated head and neck OARs, SPICE automatic segmentations were less accurate than manual contours. However, these were acceptable for the definition of parotid and submandibular glands. The modification of SPICE contours improved accuracy, but these remained inferior to manual contours and the process did not result in time-saving. Improvements in automatic segmentation of head and neck OARs would be worthwhile and are required before routine clinical implementation.

Competing interests
The authors declare that they have no competing interests.
Authors' contributions DT designed and coordinated the study, participated in contouring, analysed part of the data, interpreted data, drafted the manuscript. CB performed A B Figure 8 Efficiency in utilisation of SPICE -A. Total time per scan for all clinicians to manual and modify SPICE contours; and B. Time differences per scan between modified SPICE contours compared with manual for each clinician. (positive values: increase in time to modified versus manual contours); **p< 0.01. Abbreviations: n, total number of CT scans.