Skip to main content

Comparison of deep learning networks for fully automated head and neck tumor delineation on multi-centric PET/CT images



Deep learning-based auto-segmentation of head and neck cancer (HNC) tumors is expected to have better reproducibility than manual delineation. Positron emission tomography (PET) and computed tomography (CT) are commonly used in tumor segmentation. However, current methods still face challenges in handling whole-body scans where a manual selection of a bounding box may be required. Moreover, different institutions might still apply different guidelines for tumor delineation. This study aimed at exploring the auto-localization and segmentation of HNC tumors from entire PET/CT scans and investigating the transferability of trained baseline models to external real world cohorts.


We employed 2D Retina Unet to find HNC tumors from whole-body PET/CT and utilized a regular Unet to segment the union of the tumor and involved lymph nodes. In comparison, 2D/3D Retina Unets were also implemented to localize and segment the same target in an end-to-end manner. The segmentation performance was evaluated via Dice similarity coefficient (DSC) and Hausdorff distance 95th percentile (HD95). Delineated PET/CT scans from the HECKTOR challenge were used to train the baseline models by 5-fold cross-validation. Another 271 delineated PET/CTs from three different institutions (MAASTRO, CRO, BERLIN) were used for external testing. Finally, facility-specific transfer learning was applied to investigate the improvement of segmentation performance against baseline models.


Encouraging localization results were observed, achieving a maximum omnidirectional tumor center difference lower than 6.8 cm for external testing. The three baseline models yielded similar averaged cross-validation (CV) results with a DSC in a range of 0.71–0.75, while the averaged CV HD95 was 8.6, 10.7 and 9.8 mm for the regular Unet, 2D and 3D Retina Unets, respectively. More than a 10% drop in DSC and a 40% increase in HD95 were observed if the baseline models were tested on the three external cohorts directly. After the facility-specific training, an improvement in external testing was observed for all models. The regular Unet had the best DSC (0.70) for the MAASTRO cohort, and the best HD95 (7.8 and 7.9 mm) in the MAASTRO and CRO cohorts. The 2D Retina Unet had the best DSC (0.76 and 0.67) for the CRO and BERLIN cohorts, and the best HD95 (12.4 mm) for the BERLIN cohort.


The regular Unet outperformed the other two baseline models in CV and most external testing cohorts. Facility-specific transfer learning can potentially improve HNC segmentation performance for individual institutions, where the 2D Retina Unets could achieve comparable or even better results than the regular Unet.


Head and neck cancer (HNC), which is the sixth most frequently occurring cancer worldwide [1], is conventionally treated with radiotherapy, or radiotherapy-based combined modalities (chemotherapy and surgery) [2]. For radiotherapy, the delineation and segmentation of the gross tumor volume (GTV) from quantitative medical images, including the gross primary tumor volume (GTVp) and the associated lymph nodes (GTVn) [3, 4], is required and any inaccuracy can cause undertreatment of tumors and unnecessary irradiations of normal tissues. Labor-intensive and time-consuming manual delineation of the GTV from medical images is still the most common practice in clinics. However, due to the complicated HNC anatomical environment and irregular tumoral morphologies, manual segmentation can be error-prone and may suffer from intra/inter-observer variabilities [5]. The accurate identification and segmentation of the GTV remain crucial and challenging for HNC treatment.

It is widely accepted that [18 F]fluorodeoxyglucose (FDG) positron emission tomography (PET) and computed tomography (CT), providing both anatomical and metabolic information about the tumor, are two standard medical imaging modalities used for GTV segmentation during HNC diagnosis and radiotherapy treatment planning stages [6]. Compared with CT, PET can detect hypoxia levels [7] and reflect the physiological changes related to tumor cellular metabolism, thus serving as a relevant complementary source of information for tumor localization. However, PET can suffer from low spatial resolution and limited signal-to-noise [8]. The various FDG dosages and scanner settings from different vendors and institutions can also lead to large variations in PET image intensity. Thus, the more consistent and high-resolution anatomical information from CT is still indispensable for HNC GTV segmentation.

The recent development of deep learning methods has enabled auto-segmentation of the GTV in HNC, a competitive alternative to avoid time-consuming and error-prone manual delineation, validated by several studies [5, 10, 11] and challenges [12, 13]. Auto-segmentation of medical images is currently dominated by the Unet deep-learning architecture [9] and its variants [5, 10,11,12,13]. By adaptively adjusting the network architecture, training scheme, data pre-processing, and data post-processing, the Unet-based approaches could achieve Dice similarity coefficient (DSC) [14] scores from 0.71 to 0.78 for GTVp [10, 11] and 0.70 to 0.74 for combined GTVp and GTVn [5] segmentation.

Although it is common to have whole-body PET/CT scans in clinics, a tumor bounding box was usually selected manually in previous studies due to memory limitations. Such manual selection process can lead to an increase of the processing time. Besides, inter- and intra-physician variability may still be present. A solution for auto-localization and segmentation of HNC GTVs from entire images is desirable. Furthermore, previous studies were mainly trained and tested on datasets generated under the same guideline where the GTVp and GTVn were inspected and adapted beforehand [13]. Since variations in GTV delineation could still exist between institutions, the transferability of a trained model for external testing requires investigation.

The overall goal of this study was to explore the possibility of auto-localization and segmentation of HNC GTVs from entire PET/CT images and to investigate the transferability of the trained models for external testing with potential variations in GTV delineation style. We first used Retina Unet, a deep learning network for tumor localization [20], to find HNC from whole-body PET/CT scans, and successively utilized a regular Unet [25] to segment the GTV. Additionally, we also employed two end-to-end models for direct tumor localization and segmentation with 2D/3D Retina Unets. The segmentation performance between different models was compared via the Dice similarity coefficient (DSC) and Hausdorff distance (HD). Furthermore, to investigate the transferability of trained models, the prediction performance was additionally tested with data from three independent facilities. Finally, transfer learning, which has been beneficial for prostate cancer segmentation [15, 26], has also been performed to investigate whether the trained baseline models can further be adapted to the segmentation style of each external institution and thus improve the prediction performance.

Material and method


We trained and cross-validated baseline models with the dataset provided by HECKTOR 2022, where 524 histologically proven HNC patients were collected from 7 different cohorts [12, 13] and where segmentations were retrospectively harmonized. The ground truth segmentation was based on human annotations of GTVp and GTVn, which were manually delineated by an annotator and cross checked by another. Precise contouring guidelines were elaborated to ensure the unification of all annotations. We then collected another 275 patients from three different cohorts for external testing. The MAASTRO (Maastricht Radiation Oncology clinic, Netherlands) cohort was publicly retrieved from the Cancer Imaging Archive (TCIA) [16,17,18], while the CRO (Centro di Riferimento Oncologico Aviano, Italy) and BERLIN (Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Radiation Oncology, Berlin, Germany) cohorts were obtained via collaboration. All clinical data were anonymously obtained and processed under relevant ethics approvals and regulations. Informed consent was obtained from all patients. The baseline characteristics of the patients from the different cohorts are summarized Table 1, where the tumor stage was omitted in the HECKTOR cohort since they were not publicly provided.

All patients underwent radiotherapy and/or chemotherapy treatment, had FDG-PET as well as non-contrast-enhanced CT images, and had the GTV (including GTVp and GTVn) contoured as segmentation masks. The contours of the CRO cohort were adaptively checked and modified for a radiomics study, while the contours of BERLIN were directly exported from the clinical treatment planning systems (TPS). Attributed to labels 1 and 2 (background with label 0), the GTVp and GTVn were contoured separately in the HECKTOR training dataset and MAASTRO cohort, while they were labelled together as 1 in the CRO cohort. For the BERLIN cohort, 12.9% of the patients had GTVp and GTVn contoured and labelled separately, while the rest (87.1%) had only a single label. Therefore, to unify the differences, we labelled GTVp and GTVn together as 1 and background as 0 in this study. The PET intensities had already been converted to standardized uptake values and the segmentation masks had been aligned with the corresponding CT images. The number of patients who had PET/CT scans exceeding 100 cm along the superior-inferior (SI) direction was 57 for the training dataset, and 53 for the external testing datasets.

Table 1 Baseline characteristics of the patients from different cohorts in this study

Image preprocessing

For the HECKTOR training dataset, although the PET and CT were registered, the average size and spacing of the PET scans were 200 × 200 × 200 voxels and 4 × 4 × 3 mm3, while for CT they were 512 × 512 × 200 voxels and 0.98 × 0.98 × 3 mm3. Therefore, resampling for PET/CTs and their GTV segmentation masks to a 1 × 1 × 1 mm3 isotropic grid was implemented via linear and nearest-neighbor interpolations, respectively. For the other testing cohorts (MAASTRO, CRO, and BERLIN), the registration of PET and CT was first verified. If the PET and CT images were not properly registered, an in-house python script was applied to replicate the clinical rigid registration with the provided DICOM files before resampling to 1 × 1 × 1 mm3. Finally, for all cohorts, the resampled PET and CT were re-scaled using z-scores (subtraction of the mean and division by the standard deviation).

Tumor localization

We adopted the Retina Unet [20] in its 2D version to localize the tumor (combination of GTVp and GTVn) center from PET/CTs (including whole body scans). The dual-channel (with PET and CT as inputs) six-layer Retina Unet with ResNet50 [21] as the backbone was trained in a slice-based manner to determine the center of the tumor in each slice. Specifically, the Retina Unet outputs both the coordinates of a tumor bounding box and the corresponding confidence score (ranging from 0 to 1), as shown in Fig. 1(B). Since the GTVp and GTVn can be distinct volumes, there could be several predictions for each slice. To determine the center of the combined GTVp and GTVn, we computed the true and predicted bounding box center differences as a function of confidence score thresholds ranging from 0.4 to 0.9, effectively treating the threshold as a hyper-parameter optimized based on the validation set. If a bounding box had a confidence score larger than the threshold times the maximum confidence score of that patient, then it would be collected to compute the tumor volume center. Finally, we took the median center coordinates in 3D of those collected bounding boxes as the center for the tumor volume.

In our approach, the Retina Unet was trained with 80% of the HECKTOR dataset (419 patients) and validated on the remaining 20% (105 patients). A multi-task loss function was applied as

$${L_{retina\_unet}} = {L_c} + {L_b} + {L_s}$$

where Lc was the class loss defined as Eq. (5) in [22], Lb was the bounding box loss defined as Eq. (2) in [23], and Ls was the segmentation loss defined as a combination of soft Dice coefficient loss and the pixel-wise cross-entropy loss in Eq. (1) of [20]. To enhance the computational efficiency, the preprocessed PET/CTs and their segmentation masks were firstly cropped (from the axial image center) to a size of 512 × 512 mm in the axial plane with 2 mm pixel grid spacing and were resampled to 3 mm grid spacing along the superior-inferior direction, preserving the original superior-inferior length of the scans. For training, the input PET/CT images were randomly cropped into patches with a size of 128 × 128 pixels. Data augmentation was applied with a multithreaded augmentation pipeline [10], including scaling (from 0.8 to 1.1), rotation along the axial direction (from 0 to 360 degrees), and elastic deformation with parameter alpha in the range (0, 1500) and parameter sigma in the range (30, 50). The network was trained for 100 epochs using the Adam optimizer (learning rate 5e-4) on an NVIDIA Quadro RTX 8000 (48 GB) GPU with a batch size of 40. More details of the network architecture can be found in [20].

Tumor segmentation

We used a dual channel 4-level Unet [25] to segment the GTV foreground (combination of GTVp and GTVn) as label 1. Based on the tumor center coordinates determined from the 2D Retina Unet, we first cropped and resampled the PET/CTs and their segmentation masks to a volume size of 256 × 256 × 256 mm on a 1 mm isotropic grid. With the default soft Dice loss function, the network was trained for 300 epochs using the Adam optimizer on an NVIDIA Quadro RTX A6000 (48 GB) GPU at a batch size of 4. The initial learning rate was 5e-4 and decayed at a rate of 2 if the validation loss was not improved after every 20 epochs. To avoid overfitting, the same pipline [10] for data augmentation was applied including a central random shift of maximal 16 voxels from the original image, a random mirroring with 50% probability, and an elastic deformation up to 25% of the cropped images.

In addition, since the Retina Unet [20] can localize the tumor center and simultaneously output the predicted foreground (as shown in Fig. 1(B)), we also implemented it in its 2D and 3D versions for tumor (combination of GTVp and GTVn) segmentation. For both Retina Unets, the same network structure and data augmentation process were applied. The loss function was kept the same as Eq. (1) with equal weightings for the multi-task losses. Besides, to focus more on the segmentation task, the weightings of Lc, Lb, and Ls on the right side of Eq. (1) were modified to be 0.1, 0.1, and 1.0, respectively. Besides, these two networks were trained for 200 epochs using the Adam optimizer (learning rate 1e-4) on an NVIDIA Quadro RTX 8000 (48 GB) GPU with a batch size of 40 and 8 for 2D and 3D cases, respectively. Similar to Sect. 2.3, the inputs of 2D Retina Unet were at a size of 256 × 256 pixels with a pixel spacing of 2 mm and were randomly cropped into patches with a size of 128 × 128 pixels during training. For 3D Retina Unet, the whole-body PET/CTs and their segmentation masks were resampled to a 2 × 2 × 3 mm grid after the preprocessing in Sect. 2.2. Later they were randomly cropped into patches with a size of 128 × 128 × 128 voxels during training.

The whole HECKTOR dataset was used for training and cross-validation (CV). For all three networks (regular Unet, 2D/3D Retina Unets), we applied 5-fold CV, resulting in five trained models with the highest average dice score for foreground (a combination of GTVp and GTVn). To get the segmentation masks for the testing datasets, the mean value of the predicted probability maps from all the CV folds was computed and thresholded at 0.5 to get the respective label masks. Figure 1(A) illustrated the segmentation schemes in this study.

Fig. 1
figure 1

Illustration of the workflow for tumor localization and segmentation. (A) Segmentation schemes with regular Unet, 2D Retina Unet, and 3D Retina Unet. (B) Input and output of the Retina Unet. The green boxes denote the predicted bounding box and the confidence score are shown in white. The predicted segmentations are shown in yellow

Adaptive filtering scheme for 2D/3D retina unets

To avoid segmentations from bounding boxes with low confidence scores, we implemented an adaptive filtering scheme. In this approach, thresholds with a range of 0.0 to 1.0 were evaluated for 2D/3D Retina Unets as hyper-parameters, which were optimized to improve the segmentation results during 5-fold cross-validation. If a bounding box had a confidence score larger than the threshold times the maximum confidence score of that patient, a binary mask corresponding to the box coordinates was constructed. This mask was then multiplied with the predicted segmentation of the patient, effectively removing predictions derived from bounding boxes with unsatisfactory confidence scores. Finally, all the multiplied segmentations were aggregated to build the filtered segmented predictions.

Facility-specific training for segmentation networks

Due to the variability in GTV segmentation from independent facilities, the trained segmentation models (2D Retina, 3D Retina, Unet) from the HECKTOR dataset might not be transferable to the external testing datasets. Besides, it has been demonstrated in previous work that transfer learning from a baseline model could improve segmentation accuracy for independent institutions [26]. Therefore, we also extended the baseline segmentation model to three facility-specific models using transfer learning for the three external cohorts. The purpose of this approach was to adaptively adjust the baseline models to the different institutions. It should also be noted that the 2D Retina Unet for tumor localization was not re-trained with transfer learning. We only focused on the segmentation networks here.

For each segmentation network, the weights and biases were initialized with the baseline model and further trained and tuned with part of the PET/CTs randomly selected for each external facility. We implemented the transfer learning for each external cohort, and randomly selected 30% of the dataset for training and 20% for validation. The remaining 50% dataset of each external cohort was kept to test the models after facility-specific learning. Table 2 summarized the facility-specific training for the segmentation networks. The same data augmentation was employed to prevent overfitting, and the learning rate ranging from 1e-3 to 1e-6 was fine tuned. For 2D/3D Retina Unets, transfer learning was carried out with an NVIDIA Quadro RTX 8000 (48 GB) GPU. The transfer learning of the Unet was carried out on an NVIDIA Quadro RTX A6000 (48 GB) GPU.

Table 2 Number of training, validation, and testing patients for the facility-specific regular Unet, 2D Retina Unet and 3D Retina Unet

Evaluation metrics

To evaluate the segmentation performance, the predicted GTV contours were compared to their ground truth via Dice similarity coefficient (DSC) [14] and Hausdorff distance [27] at average (HDavg) and 95th percentile (HD95), respectively. In this study, we used Plastimatch [19] to compute DSC, HDavg and HD95.

To verify if facility-specific training can significantly improve DSC and HD from the baseline models, the Wilcoxon signed-rank test was performed for the regular Unet and the 2D/3D Retina Unets, respectively. In addition, to compare the segmentation performances between different networks, we also implemented the non-parametric Friedman tests [28]. If the Friedman test revealed a significant difference (p-value < 0.05), a post-hoc Nemenyi test [29] was implemented to identify which network obtained significantly better DSC and HD in a pair-wise fashion.


Tumor localization

The maximum tumor center differences for the HECKTOR dataset are summarized in Table 3, where the confidence score threshold ranged from 0.4 to 0.9. For the training dataset, the differences were almost at the same level over different thresholds. Comparatively, the optimal threshold was 0.6 for the validation dataset, where the differences were always smaller than 4 cm in superior-inferior, lateral, and anterior-posterior directions, respectively [24]. Figure 2 displays the histogram of tumor center differences for the validation dataset at a confidence score threshold of 0.6 and 0.9, showing a higher threshold could lead to precise localization for most patients but might suffer from outliers. Therefore, to avoid this drawback, we chose the confidence score threshold as 0.6 in this study, accepting less accurate but more robust localization.

Table 3 Maximum tumor center differences in superior-inferior, lateral and anterior-posterior directions for the HECKTOR dataset (training and validation)

The maximum tumor center difference for the external testing cohorts was also computed. With confidence score thresholds of 0.6, the tumor center differences in lateral, anterior-posterior and superior-inferior directions were (3.9, 3.0, 6.6), (3.8, 2.8, 6.2) and (3.6, 3.5, 6.8) cm for the MAASTRO, CRO, and BERLIN cohorts, respectively. The histogram of the tumor center differences is displayed in Fig. 3 without any outlier beyond 7 cm observed.

Fig. 2
figure 2

Histogram of tumor center differences (in cm) with thresholds (A) 0.6 and (B) 0.9. The values in lateral, anterior-posterior and superior-inferior directions are shown in blue, orange and green, respectively. The x-axis denotes the bounding box center difference in cm, and the y-axis denotes the patients count for the validation cohort (105 patients)

Fig. 3
figure 3

Histogram of tumor center differences (in cm) with a threshold of 0.6 for external testing cohorts. (A) MAASTRO (70 patients), (B) CRO (108 patients) and (C) BERLIN (93 patients)

Adaptive filtering for 2D/3D retina unets

For the 2D Retina Unet, various threshold values (0, 0.45, 0.65, 0.85, 0.90, 0.95, 0.99, 1.0) were examined. Compared with the model without filtering (threshold set to be 0), improvements in DSC were observed for all threshold values. As the threshold value increased from 0.45 to 0.90, the DSC consistently improved. The highest average cross-validation DSC of 0.71 was achieved and remained stable when the threshold values were set to be 0.90 and 0.95. When the threshold was 0.99, the DSC began to decline. Subsequently, the results were compared in terms of Hausdorff Distance at threshold values of 0.90 and 0.95. The averaged CV HDavg / HD95 were found to be 3.2 / 11.3 mm for threshold 0.90 and 3.1 / 10.9 mm for threshold 0.95. Therefore, the optimal filtering threshold for 2D Retina Unet was selected to be 0.95 throughout the rest of the study.

Table 2 Median (25% − 75% percentile) DSC and HD from 5-fold CV via baseline models.

For the 3D Retina Unet, various threshold values (0, 0.3,0.5,0.7,0.9,0.95) were also examined, and similar trends in DSC and HD performance were observed as in the 2D counterpart. The optimal threshold value turned out to be 0.70, yielding best averaged CV DSC of 0.71 and HDavg / HD95 of 2.9 / 9.8 mm. Consequently, the threshold value for adaptive filtering of 3D Retina Unet was chosen to be 0.70 in this study.

Segmentation by baseline models

The DSC was computed for the GTV foreground (combination of GTVp and GTVn). Compared with the equally weighted multi-task losses function in Eq. (1), the loss function with weightings (0.1, 0.1, 1.0) for (Lc, Lb, Ls) could yield higher averaged CV DSC values in both 2D (0.69 vs. 0.71) and 3D Retina Unets (0.70 vs. 0.71); these latter weightings were therefore selected for this study. The CV results are summarized in Table 2. In general, the regular Unet achieved the best averaged DSC of 0.74 for 5-fold CV. In contrast, the 2D and 3D Retina Unets obtained slightly lower averaged DSCs of 0.71.

However, if these baseline models were directly applied to the external testing cohorts, rather low DSC scores were obtained as summarized in Tables 3, 4 and 5. Compared with the 2D/3D Retina Unets, the regular Unet produced the highest DSC scores of 0.60, 0.63, 0.52 for MAASTRO, CRO, and BERLIN cohorts, respectively. Besides, we also noted that the 2D Retina Unet output higher averaged DSC than its 3D counterpart in the MAASTRO and CRO testing cohorts (0.60 vs. 0.57, 0.64 vs. 0.56). For the BERLIN cohort, the two Retina Unets produced similar median DSC scores, which were both smaller than 0.50. Fig. 4 collects several exemplary slices showing cases with one of the best (DSC 0.87), average (DSC 0.63), and poor (DSC 0.40) predicted segmentation from the CRO cohort with the regular Unet. It was observed that the predicted GTV segmentation was highly related to regions with higher SUV values in the PET image. However, in several cases, the high SUV region could still be outside of the GTV, leading to false positive predictions or larger GTV segmentation, as shown in Fig. 4(B). Conversely, the low SUV region could also contain GTVn, thus leading to false negative predictions.

We also evaluated the HDavg/HD95 for each CV and external testing cohort. The results are summarized in Tables 2, 3, 4 and 5. For cross validation, the regular Unet obtained the best averaged CV HDavg/HD95 at 2.6/8.5 mm, outperforming the 2D (3.1/10.9 mm) and 3D (2.9/9.8 mm) Retina Unets. For the external testing, the regular Unet outperformed the other networks in MAASTRO and CRO cohorts, with HDavg/HD95 at 3.9/13.4 and 3.6/12.4 mm, respectively. In the BERLIN cohort, the 2D Retina Unet obtained the optimal result of HDavg at 6.8 mm, and the 3D Retina Unet obtained the optimal result of HD95 at 18.8 mm. A clear increase of HDavg /HD95 was observed when applying baseline models directly to the external testing cohorts. With the regular Unet, the exemplary patients in Fig. 4 had HDavg /HD95 at 1.9/5.3 mm, 9.9/30.5 mm, 6.0/14.8 mm for (A), (B) and (C), respectively.

Table 3 Median (25% − 75% percentile) DSC and HD for MAASTRO cohort via baseline and facility-specific models.
Table 4 Median (25% − 75% percentile) DSC and HD for CRO cohort via baseline and facility-specific models
Table 5 Median (25% − 75% percentile) DSC and HD for BERLIN cohort via baseline and facility-specific training models
Fig. 4
figure 4

Image slices from the CRO cohort showing (A) one of the best, (B) average, and (C) poor baseline model performance. All the CTs are windowed with width of 200 and level of 20 in Housfield units (HU). The ground truth segmentation is contoured in red lines on both PET (left) and CT (right). The predicted GTV (combination of GTVp and GTVn) with the baseline regular Unet is displayed in green on CT.The predicted GTV from the facility-specific transfer learning for the regular Unet is displayed in pink on PET

Segmentation from facility-specific training models

Tables 3, 4 and 5 also summarize the DSC and HDavg/HD95 results after facility-specific training for the three networks. The regular Unet produced the best segmentation results over others in the MAASTRO cohort with DSC of 0.70 and HDavg/HD95 of 2.8/7.8. For the CRO cohort, the regular Unet also achieved the best HDavg/HD95 of 2.9/7.2 mm, while the 2D Retina Unet achieved the best DSC result of 0.76. For the BERLIN cohort, it was still the 2D Retina Unet that produced the best segmentation results over the other two models with DSC of 0.67 and HDavg/HD95 of 3.8/12.4 mm. The DSC improvement after facility-specific training for each external cohort can also be seen in Fig. 5. Furthermore, according to the Wilcoxon signed-rank test between baseline models and facility-specific models in Table 6, significant improvements (p < 0.05) after the facility-specific transfer learning were observed for DSC, except from the regular Unet in the MAASTRO cohort. Besides, the HDavg and HD95 were also significantly improved after facility-specific training in some external cohorts. In general, the facility-specific transfer learning could achieve enhancements in segmentation accuracy. With the facility-specific regular Unet, the DSC (HDavg/HD95) for the exemplary patients in Fig. 4 were 0.79 (2.5/5.5 mm), 0.71 (4.6/17.1 mm) and 0.58 (4.4/12.3 mm).

The Friedman test yielded significant differences among all the models in terms of DSC and HDavg/HD95. Therefore, a post-hoc Nemenyi test was applied, with results summarized in Table 7, to check which model obtained significantly better metrics in a pair-wise manner. For the DSC, the regular Unet showed significantly improved results compared to the 2D and 3D Retina Unets, while the 2D Retina Unet also showed significantly improved results against its 3D counterpart. For the HDavg, both the regular Unet and the 2D Retina Unet showed significantly improved results compared to the 3D Retina Unet. For the HD95, only the regular Unet performed significantly better than the 3D Retina Unet. There was no significant difference between the other models.

Fig. 5
figure 5

Box plots comparing the DSC of the baseline (b) and the facility-specific training (s) models, with solid orange line and dash blue lines representing the mean and median values of the DSC. Unet stands for regular Unet, 2D stands for 2D Retina Unet and 3D stands for 3D Retina Unet. (A) MAASTRO cohort. (B) CRO cohort. (C) BERLIN cohort

Table 6 P-value obtained from Wilcoxon signed-rank test between baseline models and facility-specific training models. Significant results (p<0.05) are denoted with an asterisk
Table 7 P-values obtained from the post-hoc Nemenyi test after facility-specific training for all possible pairwise model comparisons. Significant results (p < 0.05) are denoted with an asterisk


In this study, several deep learning-based models were implemented to automatically localize and segment HNC tumors from entire PET/CT images. To avoid the manual selection of a bounding box, the 2D Retina Unet has been first used to localize the GTVs (combination of GTVp and GTVn) center, where the tumor center of each slice was selected using its confidence score. To find the optimal value of the confidence score threshold, we computed the tumor center differences with thresholds ranging from 0.4 to 0.9. According to the validation result in Table 3, the optimal threshold was 0.6, achieving a maximum difference of less than 6.8 cm in any direction in the three external testing cohorts. A higher value of 0.9 might have more precise localization results, but also suffered from outliers (maximum difference in superior-inferior direction was 18.0 cm). This might be caused by ignoring several lymph nodes with low confidence scores. Therefore, to compromise between precision and robustness, we chose 0.6 in this study.

Based on the localized tumor center, the PET/CT were cropped and input to a regular Unet for GTV segmentation. In comparison, we have also implemented two fully automated end-to-end models with 2D/3D Retina Unet, aiming to segment the HNC tumor directly from the whole-body PET/CT. The three baseline models were trained and cross-validated on the HECKTOR challenge dataset and later tested externally on MAASTRO, CRO, and BERLIN cohorts. For CV, the regular Unet outperformed the other models in terms of both DSC and HD, achieving averaged median DSC of 0.74 and HDavg /HD95 of 2.6/8.5 mm, which were compatible with the published best result of combined GTVp and GTVn (DSC was around 0.74 and HD95 around 10 mm) where 153 patients collected from one institution were involved and a 3D Unet was used for training [5]. In their study, the GTVp and GTVn were firstly delineated with one oncologist and later reviewed and adapted with another radiologist and nuclear medicine physician [5]. Compared with the regular Unet, there was no clear difference in terms of DSC for the 2D/3D Retina Unets, while a maximum increase of 1.3/4.0 mm were observed in HDavg /HD95.

However, a drop of more than 10% in DSC and a 40% increase in HD95 were observed if the baseline models were tested on the three external cohorts, suggesting models trained on the multicentric HECKTOR dataset might not be directly suitable for HNC tumor segmentation for other institutions. There might be variations in GTV delineation between the training and testing cohorts, which could potentially hamper the generalizability of the GTV segmentation. Although it was not specified which guidelines were used for GTV delineation, the MAASTRO and CRO cohorts were collected for radiomics studies with ground truth GTV segmentations checked and modified. Therefore, DSCs above 0.62 were obtained for these two cohorts with the best baseline model. In contrast, the GTV contours were directly exported from the TPS for the BERLIN cohort. Potentially, since no inspection has been done, the poorest segmentation results were observed with a DSC of merely 0.52.

When facility-specific training was applied, improvements in terms of DSC and HDavg/HD95 were obtained for all the baseline models. The regular Unet and the 2D Retina Unet were the best performing models, with best DSC (HDavg /HD95) of 0.70 (2.8/7.8 mm) and 0.76 (2.9/7.2 mm) in MAASTRO and CRO cohorts, respectively. Even for the BERLIN cohort, the DSC (HDavg /HD95) from the best model (2D Retina Unet) still achieved 0.67 (3.8/12.4 mm) after transfer learning, showing an increase of 42.5% in DSC and a decrease of 43.4% in HD95. The Wilcoxon signed-rank test again validated the enhancement of segmentation accuracy by transfer learning, suggesting the trained baseline model can still adapt to the individual segmentation style from the external institutions. Besides, according to the Nemenyi test, the regular Unet and the 2D Retina Unet yielded significantly higher DSC and lower HDavg than the 3D Retina Unets after transfer learning. The smaller improvement in 3D Retina Unet could potentially suggest it might need more data during the transfer learning process.

Additionally, the adaptive filtering scheme could enhance the segmentation accuracy of both 2D and 3D Retina Unets. Retina Unets automatically generated bounding boxes and confidence scores during the segmentation. By applying spatial thresholding to the predicted GTVs, low-confidence predictions were effectively removed, resulting in improved segmentation performance.

Due to the retrospective nature of the study, there were uncertainties on the quality of the ground truth GTV in the external testing cohorts. We used the original datasets for the external testing and did not perform any additional inspection and adaptation of the delineations. Although the facility-specific training might partially adapt to the underlying differences between individual institutions, some pronounced variability could still impact performance. For datasets from multiple centers, the elaboration and adoption of a precise contouring guideline would be beneficial.


In this study, three baseline models were trained to automatically localize and segment HNC GTVs on PET/CT images. The model using 2D Retina Unet and regular Unet for tumor localization and segmentation outperformed the other two end-to-end models with 2D/3D Retina Unets in the CV. This was also observed for most external testing cohorts, albeit with a low overall performance. Finally, the transferability of the baseline models was tested for three independent institutions, and encouraging testing results were observed after facility-specific training, where the optimal DSC and HD achieved by the regular or 2D Retina Unets were comparable with state-of-the-art studies.

Data Availability

The HECKTOR and MAASTRO cohorts are publicly available datasets and can be found in and The CRO and BERLIN datasets are the result of collaborations and can be obtained upon request.



Head and neck cancer


Gross tumor volume


Gross primary tumor volume


The associated lymph nodes


Computed tomography


Positron emission tomography


Dice similarity coefficient


Hausdorff distance

HD95 :

The 95th percentile of Hausdorff distance

HDavg :

Average value of Hausdorff distance






Maastricht Radiation Oncology clinic, Netherlands


Centro di Riferimento Oncologico Aviano, Italy


Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Radiation Oncology, Berlin, Germany

L c :

The class loss

L b :

The bounding box loss

L s :

The segmentation loss


  1. Economopoulou P, Psyrri A. Head and neck cancers: essentials for clinicians. Chap 1 (2017).

  2. Elkashty OA, Ashry R, Tran SD. Head and Neck cancer management and cancer stem cells implication. Saudi Dent J. 2019;31(4):395–416.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Grégoire V, Grau C, Lapeyre M, Maingon P. Target volume selection and delineation (T and N) for primary radiation treatment of oral cavity, oropharyngeal, hypopharyngeal and laryngeal squamous cell carcinoma. Oral Oncol. 2018;87:131–7.

    Article  PubMed  Google Scholar 

  4. Vinod SK, Jameson MG, Min M, Holloway LC. Uncertainties in volume delineation in radiation oncology: a systematic review and recommendations for future studies. Radiother Oncol. 2016;121(2):169–79.

    Article  Google Scholar 

  5. Ren J, Eriksen JG, Nijkamp J, Korreman SS. Comparing different CT, PET and MRI multi-modality image combinations for deep learning-based head and neck Tumor segmentation. Acta Oncol (Stockholm Sweden). 2021;60(11):1399–406.

    Article  CAS  Google Scholar 

  6. Jensen K, Friborg J, Hansen CR, Samsøe E, Johansen J, Andersen M, et al. The Danish Head and Neck Cancer Group (DAHANCA) 2020 radiotherapy guidelines. Radiother Oncol. 2020;151:149–51.

    Article  PubMed  Google Scholar 

  7. Han MW, Lee HJ, Cho KJ, Kim JS, Roh JL, Choi SH et al. Role of FDG-PET as a biological marker for predicting the hypoxic status of tongue cancer. 2012;34(10):1395–402.

  8. Rosenbaum SJ, Lind T, Antoch G, Bockisch A. False-positive FDG PET uptake–the role of PET/CT. Eur Radiol. 2006;16(5):1054–65.

    Article  PubMed  Google Scholar 

  9. Ronneberger O, Fischer P, Brox T, editors. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-assisted Intervention-MICCAI 2015; Cham: Springer International Publishing.

  10. Han MW, Lee HJ, Cho KJ, Kim JS, Roh JL, Choi SH et al. Role of FDG-PET as a biological marker for predicting the hypoxic status of tongue cancer. 2012;34(10):1395 – 402. Isensee Fabian, Jäger Paul, Wasserthal Jakob, Zimmerer David, Petersen Jens, Kohl Simon, Schock Justus, Klein Andre, Roß Tobias, Wirkert Sebastian, Neher Peter, Dinkelacker Stefan, Köhler Gregor, Maier-Hein Klaus (2020). batchgenerators - a python framework for data augmentation.

  11. Xie J, Peng Y, editors. The Head and Neck Tumor Segmentation based on 3D U-Net. Head and Neck Tumor Segmentation and Outcome Prediction; 2022 2022//; Cham: Springer International Publishing.

  12. Andrearczyk V, Oreiller V, Boughdad S, Rest CCL, Elhalawani H, Jreige M, et al. editors. Overview of the HECKTOR Challenge at MICCAI 2021: Automatic Head and Neck Tumor Segmentation and Outcome Prediction in PET/CT images. Head and Neck Tumor Segmentation and Outcome Prediction; 2022 2022//; Cham: Springer International Publishing.

  13. Oreiller V, Andrearczyk V, Jreige M, Boughdad S, Elhalawani H, Castelli J, et al. Head and neck Tumor segmentation in PET/CT: the HECKTOR challenge. Med Image Anal. 2022;77:102336.

    Article  PubMed  Google Scholar 

  14. Dice LR. Measures of the amount of. Ecologic Association between Species. 1945;26(3):297–302.

    Google Scholar 

  15. Balagopal A, Morgan H, Dohopolski M, Timmerman R, Shan J, Heitjan DF, et al. PSA-Net: deep learning–based physician style–aware segmentation network for postoperative Prostate cancer clinical target volumes. Artif Intell Med. 2021;121:102195.

    Article  PubMed  Google Scholar 

  16. Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5(1):4006.

    Article  CAS  Google Scholar 

  17. Wee L, Dekker A. Data from head-neck-radiomics-hn1. The Cancer Imaging Archive. 2019.

    Article  Google Scholar 

  18. Clark K, Vendt B, Smith K, Freymann J, Kirby J, Koppel P, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a Public Information Repository. J Digit Imaging. 2013;26(6):1045–57.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Zaffino P, Raudaschl P, Fritscher K, Sharp GC, Spadea MF. Technical note: plastimatch mabs, an open source tool for automatic image segmentation. Med Phys. 2016;43(9):5155.

    Article  PubMed  Google Scholar 

  20. Jaeger PF, Kohl SA, Bickelhaupt S, Isensee F, Kuder TA, Schlemmer HP, Maier-Hein KH. Retina U-Net: embarrassingly simple exploitation of segmentation supervision for medical object detection. In Machine Learning for Health Workshop 2020 Apr 30 (pp. 171–83). PMLR.

  21. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

  22. Lin T-Y, Goyal P, Girshick R, He K. Dollár PJae-p. focal loss for dense object detection2017 August 01, 2017:[arXiv:1708.02002 p.].

  23. Girshick RJae-p, Fast R-CNN. 2015 April 01, 2015:[arXiv:1504.08083 p.].

  24. Castro E, Cardoso JS, Pereira JC. Elastic deformations for data augmentation in breast cancer mass detection, 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), 2018, pp. 230–234,

  25. Isensee F, Kickingereder P, Wick W, Bendszus M, Maier-Hein KH, editors. Brain Tumor segmentation and radiomics survival prediction: contribution to the brats 2017 challenge. International MICCAI Brainlesion Workshop; 2017.

  26. Kawula M, Hadi I, Nierer L, Vagni M, Cusumano D, Boldrini L et al. Patient-specific transfer learning for auto-segmentation in adaptive 0.35 T MRgRT of Prostate cancer: a bi-centric evaluation. Med Phys. 2022.

  27. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015;15(1):29.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Friedman M. The Use of ranks to avoid the Assumption of Normality Implicit in the analysis of Variance. J Am Stat Assoc. 1937;32(200):675–701.

    Article  Google Scholar 

  29. Nemenyi P, Bjorn. Distribution-free multiple comparisons. Ph.D. Princeton University; 1963.

Download references


This work was partly supported by Sichuan Science and Technology Program under Grant 2023YFH0079, China Postdoctoral Science Foundation under Grant 2019M663471, Fundamental Research Funds for the Central Universities under Grant ZYGX2021YGCX008, German Research Foundation (DFG), Research Training Group GRK 2274 ‘Advanced Medical Physics for Image-Guided Cancer Therapy’, and Förderprogramm für Forschung und Lehre, Medical Faculty, LMU Munich, reg. no. 1084.

Author information

Authors and Affiliations



Yiling Wang, Elia Lombardo, Marco Riboldi, Christopher Kurz and Guillaume Landry wrote the main manuscript text. Yiling Wang prepared Figs. 1, 2, 3, 4 and 5. All authors reviewed the manuscript.

Corresponding author

Correspondence to Guillaume Landry.

Ethics declarations

Ethical approval and consent to participate

The HECKTOR and MAASTRO cohorts are publicly available datasets. The CRO patient data are part of two studies approved by the Unique Regional Ethics Committee, with following approval numbers: CRO-2017-50 and CRO-2019-66. Informed consent was obtained from all CRO patients. The studies involving human participants from BERLIN cohort were reviewed and approved by Ethikkomission der Charité, Charité Universitätsmedizin Berlin, Berlin, Germany. The patients/participants provided their written informed consent to participate in this study, and their data was anonymised.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Lombardo, E., Huang, L. et al. Comparison of deep learning networks for fully automated head and neck tumor delineation on multi-centric PET/CT images. Radiat Oncol 19, 3 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: