Clinical implementation of MRI-based organs-at-risk auto-segmentation with convolutional networks for prostate radiotherapy.

Background Structure delineation is a necessary, yet time-consuming manual procedure in radiotherapy. Recently, convolutional neural networks have been proposed to speed-up and automatise this procedure, obtaining promising results. With the advent of magnetic resonance imaging (MRI)-guided radiotherapy, MR-based segmentation is becoming increasingly relevant. However, the majority of the studies investigated automatic contouring based on computed tomography (CT). Purpose In this study, we investigate the feasibility of clinical use of deep learning-based automatic OARs delineation on MRI. Materials and methods We included 150 patients diagnosed with prostate cancer who underwent MR-only radiotherapy. A three-dimensional (3D) T1-weighted dual spoiled gradient-recalled echo sequence was acquired with 3T MRI for the generation of the synthetic-CT. The first 48 patients were included in a feasibility study training two 3D convolutional networks called DeepMedic and dense V-net (dV-net) to segment bladder, rectum and femurs. A research version of an atlas-based software was considered for comparison. Dice similarity coefficient, 95% Hausdorff distances (HD95), and mean distances were calculated against clinical delineations. For eight patients, an expert RTT scored the quality of the contouring for all the three methods. A choice among the three approaches was made, and the chosen approach was retrained on 97 patients and implemented for automatic use in the clinical workflow. For the successive 53 patients, Dice, HD95 and mean distances were calculated against the clinically used delineations. Results DeepMedic, dV-net and the atlas-based software generated contours in 60 s, 4 s and 10-15 min, respectively. Performances were higher for both the networks compared to the atlas-based software. The qualitative analysis demonstrated that delineation from DeepMedic required fewer adaptations, followed by dV-net and the atlas-based software. DeepMedic was clinically implemented. After retraining DeepMedic and testing on the successive patients, the performances slightly improved. Conclusion High conformality for OARs delineation was achieved with two in-house trained networks, obtaining a significant speed-up of the delineation procedure. Comparison of different approaches has been performed leading to the succesful adoption of one of the neural networks, DeepMedic, in the clinical workflow. DeepMedic maintained in a clinical setting the accuracy obtained in the feasibility study.


Background
Structure delineation is a necessary, yet time-consuming manual procedure in radiotherapy. Consistent and accurate delineation of organs-at-risk (OARs) and target structures for prostate patients is vital when performing dose escalation and treating patients with highly conformal plans [1]. Traditionally, computed tomography (CT) has been used for radiotherapy simulation and structure delineation [2]. In the last few decades, magnetic resonance imaging (MRI) has found its way for radiotherapy simulation as it provides superior soft-tissue contrast compared to CT [3,4], thus enabling more accurate delineation of target regions and critical structures compared to CT [5][6][7].
The manual segmentation of anatomical structures is a time-consuming process [8]. Besides, with the advent of MR-guided radiotherapy [9][10][11], the accuracy and speed of delineations become the weakest link [12] that hinders the possibilities of online adaptive radiotherapy by being responsible for longer fraction time [13].
To automatically perform delineations of target and OARs for patients affected by prostate cancer, various methods have been developed over the past years. For example, three-dimensional (3D) deformable model surface [14], organ-based modelling [15], and atlas-based solutions [16,17] have been demonstrated. For all these methods, the time required to perform segmentation is in the order of minutes, if not hours, which is excessive to enable online adaptive treatments. To obviate this limitation, currently in online treatments only the target delineations and the OARs in the vicinity of the target (e.g. within a ring of 3-5 cm) are adjusted due to the excessive time needed for OARs segmentation [18][19][20].
Recently, deep learning has been proposed to speedup and automatise automatic segmentation obtaining promising results [8,21,22]. Deep learning is a branch of artificial intelligence and machine learning that involves the use of neural networks to generate a hierarchical representation of the input data to achieve a specific task without the need of hand-engineered features [23,24].
Many studies focused on target delineations [8] reaching mean dice similarity coefficients compared to manual delineations in the range 0.82-0.95 [25][26][27][28][29][30][31]. Automatic delineation of OARs is also a crucial aspect to achieve full online adaptive radiotherapy and to possibly save time to manual contouring.
In this study, we aim at investigating the feasibility of convolutional neural network-based automatic OARs delineation on MRI. A preliminary retrospective study was conducted to select a suitable network architecture and prepare for clinical implementation. After having chosen the most suitable convolutional network and performing clinical implementation, performances of automatic deep learning-based OARs delineation from our clinic are presented.

Patient data collection
Patients diagnosed with intermediate and high-risk prostate cancer undergoing MR-only radiotherapy [32] in the period between June 2018, and January 2020 were included in the study. Further inclusion criteria were: the presence of four gold fiducial markers for position verification and absence of hip implants. The patients were also scanned with a specific radio-frequency spoiled gradientrecalled echo (SPGR) sequence that will be described in more detail further on. The clinical exclusion criteria for MR-only radiotherapy were: patients with more than four positive lymph-nodes (N1, as on PET-CT or after pelvic lymph-nodes dissection), life expectancy <10 years (as from WHO >3), prior pelvic irradiation, IPSS >20, presence of prostatitis, active Crohn's disease, colitis ulcerosa or diverticulitis, an anastomotic bowel in the high dose region and patients undergoing trans-rectal prostate resection less than three months before treatment. With the application of these exclusion criteria, a total of 150 patients that were included in this study and treated with external beam radiotherapy.
For all patients, 3T MRI (Ingenia MR-RT, v 5.3.1, Philips Healthcare, the Netherlands) was acquired after requesting the patients to empty their bladder and drink 200-300 ml of water one hour before the acquisition. Patients were positioned on a vendor-provided flat table using a knee support cushion (lower extremity positioning system, without adjustable FeetSupport, MacroMedics BV, the Netherlands). Patients were tattooed at the MRI with the aid of a laser system (Dorado3, LAP GmbH Laser Applikationen, Germany) to facilitate treatment positioning. Also, MR-visible markers (PinPoint for Image Registration 128, Beekley Medical, USA) were used to identify the set-up location on MRI. MR images were acquired using anterior and posterior phased array coils (dS Torso and Posterior coils, 28 channels, Philips Healthcare, the Netherlands). Two in-house-built bridges supported the anterior coil to avoid skin contour deformation.
OARs were contoured on Dixon images [33] obtained with a dual-echo three-dimensional (3D) Cartesian radiofrequency SPGR sequence. For each patient, in-phase (IP), water (W), and fat (F) images [34] (Fig. 1) were reconstructed as in [35]. Dixon images were generated as part of a proprietary solution (MRCAT, rev. 257, Philips Healthcare, Finland) that enabled MR-based dose calculation for patients with prostate cancer [36,37]. The imaging parameters, reported in Table 1, were locked by the vendor; therefore, they were stable through the whole study. Radiotherapy technicians (RTTs) with dedicated experience in contouring delineated bladder, rectum and femurs using IP, W and F Dixon images. The OARs delineations were approved or revised by a radiation oncologist. Besides, the radiation oncologist delineated the target structures. The delineation indications followed RTOG

Geometry correction 3D
Acquisition time 2 min 17 s * expressed in terms of anterior-posterior (AP), right-left (RL) and superior-inferior directions guidelines [38] requiring that the rectum was delineated from the outer part of the sphincter (anus) until the sigmoid fold (expected length of the rectum was 10-15 cm), as described in [39], with the sphincter delineated as a separate structure. The bladder was entirely delineated, while the femurs were delineated in the whole FOV of the image. In the case of regional radiotherapy, the bowel bag was also included .

Study design
The first 48 patients (treated until January 2019) were included in a feasibility study training two state-of-theart 3D convolutional networks called DeepMedic [40] and dense V-net (dV-net) [41] ("Networks architecture, image processing and training" section). Three-fold crossvalidation was performed, splitting the patients in 32/16 for train/validation. The network hyperparameters were  optimised on the first fold and maintained for the other two folds. For example, the number of epochs was chosen considering the loss function in the validation set by performing early stopping when loss function did not decrease after five consecutive epochs.
The performance of the networks was compared against a research version of commercial software based on multiatlases and deformable registration and against the clinically used delineations ("Evaluation" section).
This preliminary study enabled us to choose among the three automatic methods. The preferred approach was retrained on 97 patients that were imaged and treated until August 2019; it was implemented for automatic use in the clinical workflow. The performances of the implemented model were reported on the 53 successive consecutively treated patients. A schematic overview of the study design is presented in Fig. 2.

Networks architecture, image processing and training
Three-dimensional network architectures were chosen to investigate performance differences considering as perceptive field the whole volumes or smaller patches. In particular, DeepMedic [40] was the network chosen to perform patch-based training, while dV-net [41] was chosen to perform training on whole volumes. The two architectures, which will be described in detail in the next sections, required similar pre-processing. Three channels were used as input: IP, W and F images. The OARs that were considered as target are: bladder, rectum, right and left femur; they were decoded as masks with values from 1 to 4 without overlapping each others. To increase the amount of contextual information, the CTV was also decoded with a value of 5, which means that the networks also output CTV. Note that CTV was not considered in our study given that CTV delineation is clinically based on a different MRI, i.e. T2-weighted turbo spin-echo sequences [42]. The networks were trained on a graphical processing unit (GPU) Tesla P100 (NVIDIA Corporation, USA) with 16 GB of memory. To allow the whole volume to fit on the GPU, the IP, W and F images were initially cropped with 90 voxels at the borders of the anterior-posterior and lateral directions obtaining matrices of 348x348x120 voxels. Note that an observer controlled the presence of femurs within the FOV. Also, the image intensity of IP, W and F were clipped at their respective 99.9 percentile per each patient volume. Images were subsequently divided by the standard deviation (σ ), and then a fixed value of 1 was subtracted.
After training and inference of the networks, the delineations were post-processed generating four binary volumes. Morphological operations of closure and hole filling by one voxel were applied. The largest 3D connected region was selected for each delineated structure. These operations were performed to remove possible smallsized delineations that may have been found by the networks.

DeepMedic
The DeepMedic [40] implementation employed was provided by the Kamnitsas et al. 1 in Tensorflow v1.7. The model employed a three-pathway architecture for multiresolution processing of 3D patches. A low, medium and high-resolution pathway with receptive fields of 85 3 , 51 3 , 17 3 voxels were employed with each pathway consisting of 11-layers. A fully connected network (FCN) was used for combining the pathways and post-processing, as presented by Kamnitsas et al. [40]. Note that the size of the receptive fields has been modified compared to the original implementation.  The training configuration was kept as the original, with learning rate = 0.001, Adam optimiser with momentum = 0.6, epochs = 35, batch size = 10 and L 1 and L 2 regularisations 2 weighted with factor 0.000001 and 0.0001, respectively. The configuration file is reported in the Supplementary Material. All the OARs were equally sampled during training enforcing that the patches considered in each epoch contains the four OARs the same amount of times. Also, as in Kamnitsas et al. [40], volumetric dice similarity coefficient was adopted as the loss function. Data augmentation was applied in terms of random shifts and rescaling perturbation of the intensity (I) by the following: I = (I + s) * m, where s and m where Gaussian distributed with μ=0, 1 and σ =0.05, 0.01, respectively. For training, DeepMedic made use of about 9 GB of GPU memory.

Dense v-net
The dV-net implementation provided in NiftyNet was employed 3 . It consisted of a 3D U-Net with a sequence of three downsampling and dense upsampling feature strided stacks with skip connections to propagate higher resolution information to the final segmentation. Dilated convolutions were employed to reduce the number of features [41].
The training configuration was kept as the original, with learning rate= 0.001, Adam optimiser with momentum = 0.6, batch size = 6, L 2 regularisation (weight = 0.001) and epoch = 25. The configuration file is reported in the Supplementary Material. Dice was adopted as loss function, and data augmentation was applied in terms of elastic deformation, as implemented within NiftyNet. For training, dV-net made use of about 16 GB of GPU memory.

Preliminary study
The first 48 patients treated between June 2018 and August 2019 were included in a preliminary study to compare the performance of the two networks and atlas-based approach to the delineation used during clinical treatment planning.
The advanced medical imaging registration engine (ADMIRE, research version 1.13.5, Elekta AB, Sweden) was the software considered; ADMIRE is based on multiatlases [43,44] and gradient-free dense mutual information deformable registration [45]. In particular, the rectum was delineated based on the F image, bladder and femurs were delineated based on IP images using an atlas of 9 patients that were previously acquired with the same sequence. ADMIRE took about 10 to 15 minutes to generate automatic contouring on a Tesla K20c GPU (NVIDIA Corporation, USA) with 6 GB of memory.
Performances of the three automatic approaches were evaluated in terms of (volumetric) dice similarity coefficients (DSC), 95% boundary Hausdorff distances (HD 95 ) [46], mean surface distance (MSD) against clinical delineations. All the metrics were calculated using Plastimatch 4 , except for the surface distance, which was calculated as from https://github.com/deepmind/surfacedistance. In particular, violin plots [47] representing the mean, σ , 95% percentile and the probability distribution were obtained for the three metrics. Also, Wilcoxon signed-rank tests were conducted among the three evaluation metrics with a confidence interval of 0.05.
For a subset of 8 patients, an RTT with five years of experience in contouring scored the quality of the delineations for all three methods. The delineations were classified from zero to three, which corresponds to clinically acceptable, small modifications, large modifications, or clinically unacceptable contours. In total, the RTT scored 96 delineations. The percentage of each score over all the contours was reported for the three methods and visualised in a pie chart. Also, the most challenging structures (structures with an average score ≥2) were reported for each method.

Clinical implementation
After a choice was made among the three automatic approaches, the best performing network was retrained for the first 97 patients that were included up to August 2019. The hyperparameters were identical to the preliminary study. The network was implemented for clinical use complying with the medical device regulation (MDR 2017/745) 5 . Quantitative evaluation was perfomed in terms of DSC, HD 95 and MSD for the 53 consecutive patients undergoing MR-only radiotherapy from August 2019 to January 2020. The delineations adopted for clinical use, i.e. delineated by RTTs and approved or readjusted by a radiation oncologist, were considered as reference. Also, surface dice similarity coefficient (SDSC) [48] was calculated 6 to enable comparison with previous work [49]. Besides, the performance of the network clinically implemented was compared with the performance of the same network obtained during the preliminary study.

Timing performance
The inference time of the network was about 60 s for DeepMedic and approximately 4 s for dV-net using the full resolution images of 328x328x120 voxels on GPU. ADMIRE generated contours in approximately 14 min on GPU. Figure 3 represents the violin plots for DSC, HD 95 and the MSD. One can observe that performances were higher for 5 European regulation 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, https://eur-lex.europa.eu/eli/reg/ 2017/745/oj. 6 as available in https://github.com/deepmind/surface-distance.

Preliminary study
both the networks compared to ADMIRE. For the bladder, no significant differences were observed between the networks, but significant differences were observed between the networks and ADMIRE. For the rectum, no significant differences were observed among the three automatic methods. When considering the femurs, DeepMedic outperformed both dV-net and ADMIRE. For example, for the right femur, the mean (±σ ) HD 95 was 2.2±1.4, 2.5±1.8 and 3.2±1.4 mm for DeepMedic, dV-net and ADMIRE, respectively.
The qualitative scoring by an RTT expert (Fig. 4) demonstrated that delineations from DeepMedic required fewer adaptations, followed by dV-net and then ADMIRE. Specifically, the expert RTT stated that, for all the structures, the number of delineations that were acceptable or needed small adjustment was 81%, 59% and 3% for DeepMedic, dV-net and ADMIRE, respectively. For both the networks, the rectum followed by bladder were indicated as the most challenging structures, while for ADMIRE, the bladder followed by rectum and femurs (same scoring) were the structures considered as the most challenging (score ≥ 2).

Clinical implementation
On the basis of the preliminary analysis, we decided to implement DeepMedic for our clinic. Clinical implementation was performed in August 2019.
The performance of DeepMedic in the preliminary study and after clinical implementation are presented in Table 2. After retraining DeepMedic and testing on the successive patients, the performances slightly improved. For example, it can be observed that, on average, the performance of DSC, HD 95 and MSD after retraining the network on a more extensive set was ameliorated by 0.01-0.03, 1.2-1.4 mm and 0.1-0.4 mm, respectively. Delineations obtained with DeepMedic for a patient in the test set are presented in Fig. 5.

Discussion
The use of MRI for prostate radiotherapy delineation is becoming increasingly common among radiotherapy Fig. 6 Boxplots for each structure of surface Dice similarity coefficient (SDSC) as a function of threshold (τ ) for the 53 patients after clinial implementation. The data is plotted for the range of τ from sub-pixel (0.5 mm) to above the voxel size (3 mm). Box plots are shown with an inter-quartile range from 25 to 75% with the horizontal line representing mean value. Upper and lower whisker represent the 2.5 and 97.5 percentiles departments [50]. MRI are used to plan radiotherapy [32,51]. Besides, use of MRI is also accelerated by the adoption of new advancements in linear accelerator technology, whereby daily MR imaging in treatment position is possible [9][10][11].
In this study, we demonstrated that deep learningbased approaches can utilise MRI to automatically segment OARs achieving high conformality. Also, a convolutional network has been implemented for clinical use, demonstrating the capability to maintain the performances obtained in a preliminary study. Table 3 compiles previous work based on the use of convolutional networks and a selection of conventional approaches [16,17,52] for OARs delineation in the pelvic area. One can notice that CT-based segmentation [53][54][55] achieved mean DSC in the range 0.88-0.95 for prostate, rectum and bladder. Also, MRI-based segmentation [27,49,56] achieved mean DSC in the range 0.82-0.95. This study seems to outperform previous studies in almost all the metrics (in bold in the Table) except for the rectum, as obtained by Kazemifar et al. [54] and the HD 95 and MSD as obtained by Kazemifar et al. and Dong et al. [56]. Comparing the results of automated contouring methods should be done with caution. For example, the guidelines used for clinical delineation may be different, and the impact of inter-observer variability on deep learning-based methods is not generally investigated [57]. In this sense, our study is novel given that a comparison of approaches based on CNNs to an atlas-based method is presented.
In this study, a qualitative assessment by a manual observer has been presented. Unfortunately, it has not been recorded whether the overall time for the delineation has been reduced. Previous studies investigated this aspect [58] when introducing deep learning-based techniques in their clinic. Also, it is unclear whether the performance of the network may further improve when a dataset larger than 97 patients is used for training. This may be an object of future research.
The time necessary for automatic delineation on full FOV is within a minute. Such time-scale can be of interest for conventional radiotherapy and for MR-guided treatments. On the one hand, for conventional radiotherapy, fast automatic OAR segmentation may facilitate the reducing delays in the start of the treatments that may lead to hampered clinical outcomes [59]. On the other hand, for online adaptive MR-guided radiotherapy, fast OAR segmentation may relieve clinicians from dedicating effort in OARs segmentation while facilitating the delineation of the target [60]. Currently, it has been reported that about 5-10 min is necessary for the for delineation in an online setting [19]. The time frame reported in our work may facilitate online adaptive radiotherapy, especially with an integrated automatic workflow. Table 3 Overview of the performance of automatic OARs delineations based on MRI and CT subdivided in convolutional network-based and conventional approaches. The number of patients included in the study (Pts), the imaging modality, a brief description of the method and metrics as dice similarity coefficient (DSC), 95% boundary Hausdorff distance (HD 95 ) and mean surface distance (MSD) were reported for each study. HD 95 [48] with τ =3 or 2 mm, respectively

Conclusion
High conformality for OARs delineation was achieved with two in-house trained networks, obtaining a significant speed-up of the delineation procedure. One of the networks, DeepMedic, was successfully adopted in the clinical workflow maintaining in the clinical setting the accuracy obtained in the feasibility study conducted before clinical implementation.

Acknowledgements
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 GPU used for prototyping this research. Also, we are grateful that Elekta AB provided the research version of the ADMIRE software.
Authors' contributions MS designed, collected the data, trained the network, supervised the study, supported clinical implementation and revised the manuscript. MM designed and supervised the study, analysed the data, supported clinical implementation and drafted/edited the manuscript. GGS participated in designing the study, collect the observer validation, and revise the manuscript. JRNvdVvZ and ANTJK contributed to developing the study and revise the manuscript. GHB contributed to developing the study, performed its clinical implementation and revise the manuscript. CATvdB participated in designing the study and revising the manuscript. All authors read and approved the final manuscript.

Funding
This project was made possible with the support of Elekta AB, Stockholm, Sweden, who provided the research version of ADMIRE for this research.

Availability of data and materials
The datasets analysed during the current study are not publicly available due to the internal policy of the Medical Ethical Commission about data sharing. The configuration files of DeepMedic and dV-net are reported as Supplementary Materials.

Ethics approval and consent to participate
The study received the approval of the medical ethical commission (Medisch Ethische Toetsingscommissie) and was classified under the protocol number 15-444/C approved on 29th July 2015.

Consent for publication
Not applicable.