Systematic evaluation of three different commercial software solutions for automatic segmentation for adaptive therapy in head-and-neck, prostate and pleural cancer

Purpose To validate, in the context of adaptive radiotherapy, three commercial software solutions for atlas-based segmentation. Methods and materials Fifteen patients, five for each group, with cancer of the Head&Neck, pleura, and prostate were enrolled in the study. In addition to the treatment planning CT (pCT) images, one replanning CT (rCT) image set was acquired for each patient during the RT course. Three experienced physicians outlined on the pCT and rCT all the volumes of interest (VOIs). We used three software solutions (VelocityAI 2.6.2 (V), MIM 5.1.1 (M) by MIMVista and ABAS 2.0 (A) by CMS-Elekta) to generate the automatic contouring on the repeated CT. All the VOIs obtained with automatic contouring (AC) were successively corrected manually. We recorded the time needed for: 1) ex novo ROIs definition on rCT; 2) generation of AC by the three software solutions; 3) manual correction of AC. To compare the quality of the volumes obtained automatically by the software and manually corrected with those drawn from scratch on rCT, we used the following indexes: overlap coefficient (DICE), sensitivity, inclusiveness index, difference in volume, and displacement differences on three axes (x, y, z) from the isocenter. Results The time saved by the three software solutions for all the sites, compared to the manual contouring from scratch, is statistically significant and similar for all the three software solutions. The time saved for each site are as follows: about an hour for Head&Neck, about 40 minutes for prostate, and about 20 minutes for mesothelioma. The best DICE similarity coefficient index was obtained with the manual correction for: A (contours for prostate), A and M (contours for H&N), and M (contours for mesothelioma). Conclusions From a clinical point of view, the automated contouring workflow was shown to be significantly shorter than the manual contouring process, even though manual correction of the VOIs is always needed.


Introduction
Anatomical variations occurring during irradiation, including tumor shrinkage and shape deformation, can be significant and can result in suboptimal treatment of patients, especially when highly conformal treatment techniques, such as intensity modulated photon or proton therapy, are used [1,2].
Repeat imaging and replanning, even with a single mid-treatment CT scan only, can significantly improve tumor coverage and organ sparing in patients who experienced clinically apparent changes in anatomy [3,4]. However, in clinical practice adaptive planning is limited to a small number of patients considering the need for multiple physician-drawn volumes. To spread out the practice of adaptive radiotherapy easing the onerous task of recontouring is required.
In this context, a fast, robust, and automatic regionof-interest (ROI) delineation method is needed and has led to a growing array of automatic contouring (AC) software [5][6][7]. These algorithms generally deform one set of contours from an initial CT to fit the anatomy of a second CT, and they can also redraw the ROI of interest for each CT from scratch.
The purpose of the present study was to compare three different commercial algorithms applying them to a number of clinical cases. The software solutions (SS) were evaluated quantitatively both in terms of speed and reliability.

Methods and materials
All volumes of interest (VOIs) were outlined manually by three in field specialized oncologists (e.g. H&N was contoured by a H&N oncologist, etc.) on planning CT (pCT), that represented our atlas for AC on three SS replanning CT (rCT).
We used three commercial software solutions to generate the automatic contours, subsequently all these VOIs were manually corrected (ACMC) by the same experienced physicians. The VOIs on the rCT were also manually contoured from scratch, that represented our reference volumes (Vref ). The times employed for AC, for ACMC (when needed), and for Vref contouring were recorded. To evaluate the quality of the contours, AC and ACMC were compared, by the use of several parameters, with those manually delineated on rCT (Vref ).

Patient data and manual contouring
A conventional helical CT scanner was used for image acquisition. A total of 15 patients (five with locallyadvanced head and neck (H&N) tumors, five with malignant pleural mesotheliomas (MPM) and five with high-risk prostate cancer (HRPCa)) were enrolled in the study. In addition to the treatment pCT images, a further set of CT images (rCT) was acquired for each patient during the RT course, usually in the middle of the treatment. For H&N, prostate and mesothelioma patients the images had a slice thickness of 3, 2.5, and 5 mm, respectively. A commercial treatment planning system (Focal 4D by Elekta, Sweden) was used for the manual contouring from scratch of ROIs ( Figure 1).
All H&N patients (two oropharynx, one oral cavity, one nasopharynx, and one larynx) had pathologically confirmed Stage III-IV disease. The target volumes were determined according to the ICRU definition of GTV and CTV. For each patient, neck levels were delineated according to international consensus guidelines [8]. In addition, 16 organs at risk (OARs) were contoured (parotid, cochlea, esophagus, brainstem, spinal cord, mandible, thyroid, pharynx, masticatory spaces, larynx, oral cavity, temporal lobes, eyes, lenses, optic nerves and chiasm). Five MPM patients had been previously treated with extraperitoneal pleuro-pneumonectomy and received adjuvant thoracic irradiation. The CTV included the entire hemithorax and thoracotomy incision and site of chest drains. Two patients had a lesion on the right side and three on the left. Contoured normal tissues were: contralateral lung, heart, esophagus, liver, bowel, spinal cord, spleen, kidneys. Five HRPCa patients (PSA > 20 ng/mL; Gleason score 8-10 or c/pT3a/b) [9] had been previously treated, two with definitive radiotherapy and three with post-operative irradiation. The CTV encompassed the prostate and seminal vesicles (definitive irradiation) or prostatic bed (post-operative irradiation) and pelvic lymphnodes. The defined OARs were: rectum, bladder, femoral heads and bowel.
In ABAS an atlas patient consists of a CT scan with pre-defined ROIs, both target volumes and OARs. A detailed description of the method has been published by Han et al. [10]. Firstly, non-rigid registration is used to transform the CT scan of an atlas patient (pCT) to replanning CT scan. Specific models for e.g., H&N and prostate are available in the software, taking structurespecific information, like elasticity, into account. Then, using the obtained transformation, auto-contours are generated by mapping the atlas contours to the replanning CT scan.
Also in MIM, we decided to use a single-atlas segmentation approach: the pCT of the patient was inserted into the atlas and, subsequently, the algorithm extracts information from one CT to generate the automated contour of the rCT. In order to do this, initially rigid registration with rotations were applied, followed by deformable registration. The previously validated intensity-based free-form deformable registration algorithm utilizes regularization to minimize the likelihood of folds or tears in the deformation fields to fit one CT to another [11].
A single-patient-atlas segmentation approach was used also with VelocityAI. Between planning CT and replanning CT, we firstly applied a rigid registration with rotations and secondly, a deformable multi pass registration. Finally, we copied the contours from planning CT to replanning CT and this software applied automatically the deformation matrix to them. VelocityAI uses the basis-spline (B-spline) method [12] for deformable registrations.

Time/speed evaluation
Focal 4D, ABAS and MIM were installed in a 3 GHz HP xw 8600 workstation running Windows with 8 GB RAM, whereas VelocityAI was installed in a 2.66 GHz HP xw 8600 workstation running Windows with 4 GB RAM. The time required for the ex novo ROIs definition on rCT, for the three software solutions to generate the AC, and finally, the time for manual correction of AC was calculated.
The time to manually define the volumes on rCT was calculated from the opening of the latest CT to the last ROI. The time needed by the software solutions to generate automatic contours was measured by when the CT was imported until the end of the entire generation process. The time needed to check the automaticallyobtained volumes was defined from the time of loading the CT until the time needed for final volume correction ( Figure 1). The usefulness of the automatic contours procedure was evaluated by comparing 1) the time needed from the software + manual corrections vs. manual contour from scratch or 2) just the manual correction time vs. manual contour from scratch (i.e. not considering the time needed by the computer for the generation of the deformed contours).

Quantitative evaluation of automated and manually corrected contours
The performance of the automatic segmentation software was assessed by quantitatively comparing manual Vref contours with AC or ACMC contours in terms of volume, position and shape. A sensitivity and specificity study was also conducted. Manual segmentation was used as the reference segmentation.
As an initial measure of the similarity between the automatic and manual contours, the volume of every structure was calculated and the difference between the automatically generated volume (V AA , V MA , V VA , for ABAS, MIM and VElocityAI, respectively) and the manually generated volume, or reference volume, was calculated for each structure, as follows: Also the difference between manually corrected automatic volume (V AM , V MM , V VM , for ABAS, MIM and VElocityAI respectively) and the Vref was calculated.
Since its introduction, DICE similarity coefficient (DSC) index [13] has been widely used in the evaluation of deformable image registration results. The DSC index is defined as Vref were compared to automatically contoured ROIs (or manually corrected after automatic generation ROIs). DSC values range from 0 to 1, and are identical to 1 if automatic and manual volumes were equal with a complete intersection.
For all the software solutions evaluated, the sensitivity index (Se) of contours was computed as: The sensitivity reflects the probability that the automatic contours (before or after the manual corrections) match the reference contour and some authors renamed it as the overlapping index (OI) [7].
We defined, as a surrogate of the specificity, the inclusiveness index (IncI): The inclusiveness index reflects the inclusion of V auto within V ref , i.e. the probability that a voxel of the V auto is really a voxel of the V ref .
To help the reader get an idea of some parameter trends, a modified Receiver Operating Characteristic (mROC) analysis was done by plotting the sensitivity vs.
(1 _ IncI) for some delineated structure. The best possible result was expected to yield a point in the upper left corner or coordinate (0, 1) of the ROC space, representing 100% sensitivity (all voxels are true positive) and 100% of inclusion (surrogate of specificity, i.e. no false positive voxel is present).
As a general measure for the location of the structures, for each patient and for each structure (manually defined from scratch (i.e. reference structure), automatically generated and manually corrected after automatic generation) mass centre is calculated and the distance in the three coordinates was evaluated: As reported in Figure 1, in order to evaluate these parameters in a systematic and consistent way, all DICOM images and structures were exported to VOD-CA4rt (MSS GmbH, Hagendorn, Switzerland) version 4.4.1. Therefore, we used the analysis tool box of   VODCA for the automatic calculation of the variables described above. A non-parametric Wilcoxon signed rank test was used to determine whether or not the observed differences were statistically significant. The Holm-Bonferroni correction was considered as well. Table 1 shows the average time needed for manual contouring, the software time for AC and correction time for the VOIs on rCT for each anatomical site. The differences in AC time between A, M and V are always statistically significant.

Time/speed evaluation
After comparing the sum of the average duration of automatic + manual correction for each site and software with the total time needed to get the Vref, we can conclude the following: 1) for prostate patients, MIM was the software that obtained the most gain in time (55 min), while the average gain was 31 min and 41 min with ABAS and VelocityAI respectively; 2) for the H&N site, ABAS was the most time saving software with an average gain of 1 hour and 23 min; MIM can save 1 hour and 4 min, and VelocityAI 45 min; 3) for the mesothelioma cases, the average obtained gain was 22 min with both ABAS and MIM, and 15 min with Velocity. For all sites, the time gain for all three software solutions, compared to manual contouring from scratch, is statistically significant. In Table 1 we also reported the time needed for manual correction, aside from the time needed for the automatic contouring as the reader may be interested in the 'physician time' saved, independently from the time the software takes to generate the automatic contours.

Contour evaluation
Regarding the DSC index for structures located in the pelvis, an important example is given by the rectum. Table 2 shows how similar this index is for all three software solutions, with regards to the AC (AA = 0.77, MA = 0.75, VA = 0.75). However, it is quite far from the value 1; this means that all three VOIs, which were obtained automatically, are not qualitatively close to the Vref. After manual correction of the automatic contours, the DSC improved for all the software solutions but the values of this index still remained ≤ 0.9. For this organ we found a ΔV from about +30% to −15% for AA and MA respectively that was reduced to about 5% after the manual correction. The Δz values for rectum (before and after manual correction) underlined the difficulty in contouring the cranio-caudal limits of this structure. The three software solutions tended to give us CTV1 and bowel volumes smaller than the Vref. VelocityAI had a lower sensitivity index, also after manual correction, compared to the other two softwares.
Regarding the H&N cases (Table 3), there were no statistically significant differences between the three software solutions for the CTV2, larynx, and superior part     of the larynx. Generally, after manual correction, ABAS showed a higher inclusiveness index whereas MIM showed a higher sensitivity. From the ΔV parameter analysis, we found that ABAS tended to underestimate the volume of the VOI, while MIM and VelocityAI tended to overestimate it. Finally, in the mesothelioma cases (Table 4), the sensitivity index was usually higher for ABAS, before manual correction, and for VelocityAI after manual correction. Regarding the IncI index statistically significant differences are present in both automatic and manually corrected contours: MIM usually resulted to be the best software in both cases. For the automatic contours, the ΔV ranged from about +30% for the esophagus, to −10% for the intestine; these differences usually decreased after ACMC, but sometimes remained high (i.e. CTV, esophagus, and spinal cord). Figures 2, 3 and 4 show the mROC analysis of the performance of automatic segmentation compared to manual correction after automatic segmentation. The mROC curves for the selected OARs exhibited the same behaviour: all the points improved with manual correction (i.e. they moved to the upper left corner of the mROC space), but under our clinical conditions some discrepancies still remain compared to the reference structure.
Regarding the automatic re contouring of the tumor, we can summarize that for prostate patients the DSC index improved after manual correction, but it still remained below 0.9; in H&N cases, the DSC index improved after manual correction for CTV1 (but still remained below 0.8) and almost didn't vary for CTV2. In the mesothelioma patients, the DSC index improved, for the three software solutions, from an average value of 0.85 before manual correction to 0.9, with MIM performing slightly better than the other two software solutions.
Furthermore due to the low number of patients examined, applying the Holm-Bonferroni method, none of the difference between the software solutions would have been statistically significant. That is why we reported in our Tables the p values obtained with the Wilcoxon test.

Discussion
The need to replan and adapt treatment for internal anatomy variations due to tumor shrinkage and shape deformation [1,3] has increased over the years in order to make better use of highly conformal treatment techniques. However, this modality is very time-consuming. In order to reduce the commitment of medical staff in targets and ROIs delineation and modification, systems for the AC have been increasingly developed. The use of atlas-based tools to delineate OARs for cancer sites including H&N [11,14], breast [15], endometrium [16] and prostate [17] have shown to reduce volume delineation variability and the total time required to contour. In this study, we compared three different commercial software solutions for atlas-based autocontouring through a comparison with manual delineation of target and OAR in Table 3 Mean values and standard deviations of parameters that evaluate the contours generated by the three software, before and after the manual correction, for each organ of the head and neck (Continued)     three tumor sites. For the purpose of this study, contours of manually-generated VOIs on pCT were taken as a reference atlas. These were then compared to the VOIs (contours) automatically generated by A, M and V, and successively corrected manually. This procedure has proved to be time saving although the AC must be re-checked and corrected manually by physicians: on average, about 40 minutes were saved for the HRPCa, one hour for the H&N patients, and 20 minutes for the MPM. Regarding prostate cases, the auto-segmentation module faces the same problems as the clinicians when drawing the prostate: a) in the cranial direction there is poor or no contrast on the CT between the base of the prostate and the bladder, b) in the caudal direction, there is poor or no contrast between the apex of the prostate and the rectum. As in the correction/replanning of the prostate plans, the volumes that needed the most corrections were the rectum, the CTV and the bowel. The volumes closer to the reference volumes were the femoral heads and the bladder. It was noted that the most cranial and caudal slices of all volumes underwent more changes leading to greater intraobserver variability. This was especially true for certain organs, such as the rectum ( Figure 5) (i.e. we also found that after manual correction, the ΔV and Δz parameter can remain significant for some organs). For the Vref, the rectum was  contoured according to the guidelines [9] and for the AC, these anatomical limits were not always respected. This may lead to the intraobserver variability: the AC could bring us to correct a contour that is misleading from the start for the physician. The placement of a rectal balloon and a strict protocol for bladder filling could help the automatic recontouring process for the prostate patient.
With regards to the H&N cases, the first consideration is that the re-planning CT was performed without contrast media; surely having contrast media would have been helpful for the physician for target delineation and probably for the three software packages too. Nonetheless results, both in terms of time and contouring accuracy are good and promising. All three software solutions significantly reduce the time needed to replan the VOIs in comparison to the time needed to replan the same VOIs from scratch (Vref ). Indeed, both the automatic contouring time, the manual correction time and their sum are statistically shorter than the Vref contouring time. Each software allows 1 hour to be saved, which is undoubtedly relevant in daily clinical activity. The significant differences found amongst the times provided by the three software packages can be explained in part with the fact that the referring contouring physician for the H&N area had been using one of them in his clinical practice during the months preceding the analysis. When evaluating quality according to the established parameters, ΔV, Δx,y,z, DSC, sensitivity and inclusiveness indexes, VOIs generated with the AC and VOIs manually corrected from AC compares favourably with their corresponding Vref VOIs. Indeed even lower scores of the quality indexes are in an acceptable range. As for the prostate cases, we found a volume variation between the Vref VOIs and the automatic generated manually corrected VOIs in particular for CTVs, for organs with not-precisely-defined boundaries such as the superior pharynx and for organs of small volume such as the cochlea. This intra-observer variability is a well known phenomenon of the radiotherapy planning more evident when there are no anatomical points of reference. As for rectum, another explanation could be that automatic contour propagation produce contours which somehow "suggest" to the human observer incorrect contour shape. Variation in the position of center of mass is particularly evident for the z axis both for automatically generated and manually corrected but it is limited to the length of a couple of slices. As expected, scores of DSC, sensitivity and inclusiveness indexes for ACMC contours are better than the automatically generated corresponding ones, pointing out the necessity of correction from a physician of the automatically generated VOIs.
In the mesothelioma cases the bowel required some work to be re-contoured manually particularly in the most cranial and caudal slices. Moreover, thoracic cavity showed some differences probably for a different content of air cavities, requesting some more manual interventions.
The accuracy evaluated with sensitivity, inclusiveness and DSC indexes, and the other volumetric parameters, show that none of the three software solutions always perform better: depending on the VOI, parameter and type of cancer considered, from time to time one software can be better than another. More importantly, a statistically significant difference between the software solutions does not lead to a clinically relevant difference. Looking at the data reported in our tables (Tables 2, 3 and 4), it is up to the clinician to assess what might be the most suitable software for the specific patient/protocol. We also have to emphasize that the presence of Figure 4 mROC analysis of the performances of three software solutions evaluated in mesothelioma patients: a) esophagus, b) heart, and c) liver. Automatic segmentation and automatic segmentation + correction were compared to the manual contours from scratch. artifacts or relevant anatomical changes (bowel shape or filling, nasal cavity empty or full, etc.) could seriously affect the quality of the automatic contours generation.
Moreover, the three commercial software solutions have other differences that were not evaluated in this study. We can observe that A is the only one that doesn't have its own contouring tools: another software to import and review the results of the deformation is needed. On the other hand A is a simple and straightforward software for automatic contour generation. Both V and M have tools for the deformable registration of CT images, cumulative dose volume histogram calculation, and V manages also the deformable registration of MRI with CT images (useful for treatment planning on brain, prostate and for paediatric patients), but we did not test the reliability of such tools. M required more time to learn how to operate the software.
In the context of on-line adaptive treatment, both automatic delineation of CTV and OARs are important. In general, we can say the higher the sensitivity of the OARs automatic segmentation, the lower the risk for over irradiation of the organs; the higher the sensitivity of the CTV segmentation, the lower the risk for under irradiation of tumor tissue. On the other hand, as discussed by Tsuji et al. [7], it is difficult to determine a priori whether automatic contours have acceptable accuracy because the importance lies also in the dosimetry of their resultant plans. Tsuji et al. found differences in target coverage and conformality with a similar range of DSC and also Voet et al. [6] found underdosages in the PTV of up to 11 Gy even for DSC coefficients of 0.8. Furthermore, in the Tsuji et al. statistical analysis, a significant correlation between the overlapping index, what we call "sensitivity index", and the target coverage was shown. Tsuji et al. concluded that because of its stronger correlation with target coverage, the sensitivity index may be a better initial measure to predict contour utility, as opposed to DSC. We didn't evaluate the dosimetric effects of our contour discrepancies, but we believe that each protocol for automatic recontouring should also be evaluated also from this point of view. This will be the goal for our future research. As underlined in the literature [18], mapping planning contours to daily diagnostic CT images, instead of daily MVCT or kV CT, would facilitate adaptive replanning. Deformable image registration relies on image quality. In the present study, we used CT images from a fan-beam CT scanner. Our current automatic ROI delineation method can be directly applied to IGRT by CT-on-rail positioned in the treatment room. The image quality of Cone beam CT-CBCT is inferior to that of a fan-beam CT scanner. More importantly, the signal/noise ratio is dramatically low compared to the one of regular fan-beam CT images. If the contours are available on daily CT images, dosevolume histograms can be calculated to evaluate the necessity of replanning, or the contours can be used directly for intensity-modulated RT optimization. In addition, the daily dose distribution can be transformed back to the planning CT scan by using the same deformable image registration method to compare it to the original plan and estimate the cumulative doses delivered to the patient.

Conclusion
The AC workflow was shown to be significantly shorter than the manual contouring process from scratch, even though manual correction of the VOIs is always needed. For the H&N site, a clinician can save about one hour, for a prostate patient, the time saved is about 40 minutes, and for a mesothelioma patient about 20 minutes. The differences, both in time and quality, between the software packages were statistically significant in many cases, but the absolute values of such differences are often modest.