Impact of random outliers in auto-segmented targets on radiotherapy treatment plans for glioblastoma

Poel, Robert; Rüfenacht, Elias; Ermis, Ekin; Müller, Michael; Fix, Michael K.; Aebersold, Daniel M.; Manser, Peter; Reyes, Mauricio

doi:10.1186/s13014-022-02137-9

Research
Open access
Published: 22 October 2022

Impact of random outliers in auto-segmented targets on radiotherapy treatment plans for glioblastoma

Robert Poel^1,2,
Elias Rüfenacht²,
Ekin Ermis¹,
Michael Müller²,
Michael K. Fix³,
Daniel M. Aebersold¹,
Peter Manser³ &
…
Mauricio Reyes²

Radiation Oncology volume 17, Article number: 170 (2022) Cite this article

2256 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

Aims

To save time and have more consistent contours, fully automatic segmentation of targets and organs at risk (OAR) is a valuable asset in radiotherapy. Though current deep learning (DL) based models are on par with manual contouring, they are not perfect and typical errors, as false positives, occur frequently and unpredictably. While it is possible to solve this for OARs, it is far from straightforward for target structures. In order to tackle this problem, in this study, we analyzed the occurrence and the possible dose effects of automated delineation outliers.

Methods

First, a set of controlled experiments on synthetically generated outliers on the CT of a glioblastoma (GBM) patient was performed. We analyzed the dosimetric impact on outliers with different location, shape, absolute size and relative size to the main target, resulting in 61 simulated scenarios. Second, multiple segmentation models where trained on a U-Net network based on 80 training sets consisting of GBM cases with annotated gross tumor volume (GTV) and edema structures. On 20 test cases, 5 different trained models and a majority voting method were used to predict the GTV and edema. The amount of outliers on the predictions were determined, as well as their size and distance from the actual target.

Results

We found that plans containing outliers result in an increased dose to healthy brain tissue. The extent of the dose effect is dependent on the relative size, location and the distance to the main targets and involved OARs. Generally, the larger the absolute outlier volume and the distance to the target the higher the potential dose effect. For 120 predicted GTV and edema structures, we found 1887 outliers. After construction of the planning treatment volume (PTV), 137 outliers remained with a mean distance to the target of 38.5 ± 5.0 mm and a mean size of 1010.8 ± 95.6 mm³. We also found that majority voting of DL results is capable to reduce outliers.

Conclusions

This study shows that there is a severe risk of false positive outliers in current DL predictions of target structures. Additionally, these errors will have an evident detrimental impact on the dose and therefore could affect treatment outcome.

Introduction

In terms of automation in healthcare, auto-segmentation is an important technique that can be useful in radiology, surgery, study purposes and in particular radiation therapy (RT). In RT, contouring of target volumes and organs at risk (OARs) is daily practice. Much of the work is performed manually but to a certain extent, segmentation software are also used to support the task in suggesting the contours of larger structures. Auto-segmentation and contouring support (e.g. semi-automatic segmentation) have been around for decades. However, the implementation of these techniques is not widespread. Often the auto-segmentation lacks the desired accuracy [1,2,3,4], which results in copious manual adjustments and the loss of confidence in such techniques.

The main argument for fully automatic segmentation is that the current practice of manual contouring is very time-consuming for radiation oncology professionals [3, 5,6,7,8]. Another advantage is that auto-segmentation contours, compared to manual contouring, will be more consistent and it is hypothesized that this can improve the overall quality of RT planning [3, 9,10,11].

For the RT treatment of Glioblastoma (GBM, many critical structures, also called organs at risk (OAR), need to be spared from radiation. [12] Most of these structures are small and can only be distinguished on high quality magnetic resonance imaging (MRI) [13, 14]. Contouring in the brain is therefore a difficult and time-consuming process. Additionally, since most currently available auto-segmentation methods are based on CT imaging, they are incapable of distinguishing the different neural structures.

While there is often a clear definition of how to segment an OAR, there is much more debate on how the gross tumor volume (GTV) and clinical tumor volume (CTV) should be defined [15]. The main reason for this is the large variation in shape, size and location of a tumor in relation to the standard human anatomy. Additionally, the target often includes areas that are clinically suspected of being compromised by the tumor, and are not morphologically visual on imaging. Consistent target definition is furthermore hampered by the quality of the imaging and distortion of the anatomy caused by surgical resection that often takes place additional to RT [16].

The latest generation of auto-segmentation methods are based on deep learning [17]. The state of the art methods yield contour results for OARs and targets that are on par with manual contouring [18]. This means contours reside within the range of contour variation based on multiple raters [19, 20]. Still, the results are not perfect in terms of geometric similarity to the “ground truth” and there is no consensus in the judgement of contours among radiation oncology experts [21,22,23,24,25,26,27]. Neither are there clear guidelines on the commissioning of the auto-segmentation methods by medical physicists [28]. While most current errors in RT processes are human-made [29], the requirements for approval of software innovations in RT are high and not well suited for recent deep learning based methods [30]. In general, the community’s acceptance of artificial intelligence (AI) applications is poor [28, 31]. In healthcare, a machine is only accepted when it performs consistently better than a human [32].

A typical mistake deep learning-based auto-segmentation can make, are random outliers that can be defined as small segmented islands away from the region of the actual targeted structure. This type of error is best described by the large amount of outliers found in the summary of the Hausdorff distance results from the Brats Challenge [33] (e.g. Fig. 13 in referred publication). Such errors are relatively easy to solve for auto-segmentation of OARs, since shape, location and size priors of these structures can be modeled and incorporated in post-processing routines.

Dealing with random outliers gets more problematic for target definition. Since targets in the brain can appear in different locations and be of different sizes and shapes, infiltrate multiple tissues and even have satellite locations, its segmentation is more prone to inaccuracies than for OARs [34]. In addition, it is not easy to detect random outliers. Common scriptable rules to remove outliers from OARs are typically not valid for tumors. In a metastasized situation, it is even more difficult to determine if one is dealing with a random outlier (false positive) or there might be growing malignant tissue (true positive). Due to the described difficulties, robustness of deep learning-based target definition lags behind OAR segmentation methods. This is reflected by the fact that there are not many commercial products that offer deep learning-based tumor segmentation.

As a solution to improve implementation of auto-segmentation, there are two approaches: (1) Improving accuracy and robustness of deep learning methods. (2) Introduce post-processing techniques and/or QA measures that enable accurate and efficient use of automatic tumor segmentation. In both cases, a first step is to characterize the specific errors that might occur. The question we would like to answer in this study is how much an outlier, when undetected, will affect the dosimetry. Furthermore, we want to characterize the influence of size, shape and location of the outliers on the dose effect. Additionally, we want to identify the occurrences of outliers for a state of the art deep learning approach as well as their size and distance.

Materials and methods

In this study, the main goal is to determine the impact of random outliers in target definition on the dose distribution of GBM RT plans. The study consists of two parts: (1) Controlled experiments on synthetically generated outliers. (2) Occurrence and dose effect of actual deep learning outliers resulting from state of the art deep learning methods.

Controlled experiments with synthetically generated outliers

In this first part, we designed a set of controlled experiments to characterize how size, location and shape of outliers affect treatment plan quality.

From a local database containing the planning CT (3 mm slices) and MR images of de-identified GBM cases, a representative case was selected that does not have any intracranial deformation or extensive imaging artefacts. The images of the selected case were imported in the research environment of the treatment planning system (TPS) Eclipse version 15.5 (Varian Medical Systems, Inc.). For this case, a reference planning target volume (PTV) was generated according to the RTOG guidelines [35]. Additionally, 17 OAR volumes were defined according to Scoccianti et al. [13]

In four different experiments, the targets have been manually adjusted by adding an outlier target that was neither connected nor in the direct vicinity of the reference target but within realistic size and location boundaries. The four experiments represent: (I) different locations of the outlier, (II) different shapes of the outlier, (III) different sizes of the outlier and (IV) relative size to the main target, by changing the size of the target. In a series of planning studies, we determined the effect of random outliers on the dosimetry considering a volumetric modulated arc technique (VMAT) treatment approach (Fig. 1).

Experiment 1 – outlier location

We generated 20 small spherical outliers (0.13 cm³) that were added to the PTV, at different locations along the three main axes. The outliers have different distances and locations with respect to the PTV and the different OARs (Fig. 2). The goal of this experiment was to determine whether the location of an outlier, and its distance relative to the reference PTV, has a specific influence on the dosimetry.

Experiment 2 – shape and orientation

At a given contralateral location in the brain within the range of medium expected dose effects according to experiment 1, 4 different outliers were drawn manually over multiple slices, to have different shapes and orientation while maintaining the same volume and center of mass. The 4 outliers will appear at location A, which is at the same axial plane as the reference PTV and are additionally reproduced at location B, which is located above the axial plane of the reference PTV (Fig. 3).

Experiment 3 – outlier size

At two locations from experiment 1, here referred to as location C and location D, 12 different sizes of outliers were generated. The smallest outlier has a voxel volume of 4.2 mm³, with sizes increasing incrementally to 186.5 mm³. The first outliers, numbered 1 through 5 only cover a single CT slice while the latter outliers, numbered 6 through 12 cover multiple CT slices. With this experiment we aimed at analyzing the effect of outlier size on dosimetry.

The volumes of the different outliers were determined as voxel volume (i.e. counting the discretized amount of voxels multiplied by the voxel size) and mesh volume (i.e., geometrically from mesh points) as which is used in the TPS. (Fig. 4)

Experiment 4 – outliers relative size to PTV

In experiment 3, the influence of the absolute size of the outlier is investigated. It is expected that the TPS optimizer is also influenced by the relative size of the outlier with respect to the reference PTV. To determine this we selected the two smallest outliers from location D, because this is a location that is in proximity to the target and surrounded by multiple OARs. Additionally, we respectively increased and decreased the reference PTV with increments of a 1 mm isotropic margin. This resulted in 9 different sized reference PTVs of which the original is depicted in green in Fig. 5.

For the analysis, we looked specifically at the dose received by the outlier volumes.

Planning

A reference plan was made based on the reference PTV and according to the institutional prescription protocol. A double arc coplanar VMAT plan with 6 MV flattening filter free beams was optimized (Varian photon optimizer version 15.6.05) to deliver 30 times 2 Gy while maximally sparing the OARs. The dose, calculated with the AAA algorithm, was normalized so that 100% of the prescribed dose covers 50% of the PTV. For the experimental plans, which include outliers as part of the PTV, the corresponding reference plan was duplicated and only the PTV structure was substituted to consider the added outliers and the adjusted size. All planning setups and optimization criteria remained the same while the plan was re-optimized once on the reference plans settings, and the dose recalculated.

An additional plan was made without any OARs, to obtain better insights in the dosimetric effects of the outliers without disturbances of any dose constraints due to nearby OARs. The objectives to the PTV were based on the prescription protocol. The only additional constraint was the normal tissue objective (NTO) of the planning system. This plan is called the PTV-only reference plan. Here too, the reference plan was duplicated and only the PTV structure was substituted to consider the added outliers and the adjusted size. All planning setups and optimization criteria remained the same while the plan was re-optimized and the dose recalculated.

Analysis

The dose distributions of the experimental plans were compared with those of the reference plan. To determine the differences in dose distributions, dose volume histograms (DVH) of the different structures of the experimental plans are plotted together with the corresponding DVH of the reference plan. This was performed for the PTV and all defined OARs. Additionally, the brain minus the PTV was defined to serve as a measure of the amount of dose to healthy brain tissue.

Besides the DVH curves, the following specific dose parameters were determined: For the PTV, the mean dose, minimum dose, the 95% target coverage and the 98% target coverage. For OARs, the mean dose, maximum point dose, max dose to 1% of the structures volume, and the maximum dose to one cubic centimeter of the structure.

Furthermore, the dose distributions were compared to show specific details in the effects on the dose distribution under the different performed experiments.

Outlier target segmentations from deep learning data

Besides controlled experiments with manually constructed false positive outliers in the target volume, we constructed target data by means of a deep learning segmentation model. This data reflects outliers resulting from auto-segmentation predictions.

Deep learning data

As training data 100 GBM cases were available who received surgery and RT treatment at the Inselspital Bern, University Hospital, but did not have any prior brain pathologies. Of all cases, the GTV and the edema regions were annotated. From the 100 cases, 80 randomly chosen cases were used for training and the remaining 20 cases were used as test dataset. We performed a five-fold cross validation, resulting in five different models, and one ensembling model [36] based on majority voting of these five models. We included this ensembling model to verify the advantages of ensembling, as reported in [37], and whether it was able to improve GBM targets after construction of the PTV. Model training was based on the nnUnet architecture based on the work of Isensee et al. [38]. A transfer learning approach was used, with pre-training model weights based on the HD-GLIO segmentation model trained on 3220 brain tumor MRI examinations [39]. Each model was then fine-tuned on the training dataset (i.e., 80 cases per fold). Technical details of the training procedure can be found in [40, 41].

Outlier analysis

In order to determine the number of outliers that are created by the deep learning models, we defined the main structure as the largest connected region of calculated segmentation masks. Each other segmented region disconnected from the main structure was counted as an outlier. This assumption was valid as the dataset only include single-lesion cases. For each case and trained model (i.e., five plus majority voting), the total number of outliers (per case and per model) as well as their size and closest distance to the main target was recorded. We analyzed outliers’ size vs. their distance from the main structure since it is expected that these two parameters play a role in dosimetry metrics.

To assess the impact of deep learning-based outliers on dosimetry, for every automated segmentation, a CTV was created by combining the GTV and the edema structures and 3 mm margins were added to form the PTV according to the RTOG guidelines [35]. On the resulting PTV structures, we analyzed the distribution of outliers.

Dose effect from deep learning segmentations

From the total of 120 constructed PTVs from deep learning models (6 models × 20 test cases), 5 cases with an outlier of significant size and distance from the PTV were randomly selected for dosimetric analysis. For these cases a reference PTV was available, which is a manually drawn target verified by a radiation oncology expert. Based on the reference PTV, a plan was constructed according to the current clinical department’s protocol, which included constraints for all OARs. To show the impact of the outlier to the predicted target in particular, we made a copy of the predicted target with and without outlier. To ensure the geometrical similarity with respect to the reference PTV, the copy of the predicted PTV was manipulated to obtain the same Dice Similarity Coefficient (DSC) with respect to the reference PTV as the original predicted PTV has with the reference PTV. On both the predicted PTV containing the outlier (referred hereafter as “predicted outlier”) and the predicted PTV with the outlier removed (referred hereafter as “removed outlier”) the reference plan was re-optimized and recalculated. These two plans were compared to the reference plan. An analysis on the DVHs of the different OARs was then performed. Since the dose effect is highly dependent on the location of both the target and the outlier, the dose to the healthy tissue, defined as the brain minus the reference PTV, was also analyzed. As to perform a direct comparison of the total three-dimensional dose distribution, gamma analysis of the predicted outlier plan, and the removed outlier plan was performed with respect to the reference plan. For the gamma analysis we used the criterion of 3% of the prescribed dose and 3 mm and a dose cutoff at 20 Gy to remove the lower dose regions.

Results