Multiple comparisons permutation test for image based data mining in radiotherapy
© Chen et al.; licensee BioMed Central Ltd. 2013
Received: 6 August 2013
Accepted: 24 November 2013
Published: 23 December 2013
Comparing incidental dose distributions (i.e. images) of patients with different outcomes is a straightforward way to explore dose-response hypotheses in radiotherapy. In this paper, we introduced a permutation test that compares images, such as dose distributions from radiotherapy, while tackling the multiple comparisons problem. A test statistic Tmax was proposed that summarizes the differences between the images into a single value and a permutation procedure was employed to compute the adjusted p-value. We demonstrated the method in two retrospective studies: a prostate study that relates 3D dose distributions to failure, and an esophagus study that relates 2D surface dose distributions of the esophagus to acute esophagus toxicity. As a result, we were able to identify suspicious regions that are significantly associated with failure (prostate study) or toxicity (esophagus study). Permutation testing allows direct comparison of images from different patient categories and is a useful tool for data mining in radiotherapy.
When planning a radiotherapy treatment, a compromise is made between coverage of the target and exposure of Organs At Risk (OAR). While the dose to the designated target is generally uniform and homogeneous between patients, the dose to surrounding structures can be highly variable, depending on patient geometries, tumor locations, and treatment techniques. Such heterogeneous incidental dose distributions in patients might “accidentally” lead to different treatment outcomes regarding tumor control (e.g. if subclinical disease is important) or normal tissue toxicity. Therefore, applying data mining techniques to incidental dose distributions gives the possibility to explore dose patterns that are associated with clinical outcomes.
The purpose of introducing data mining in radiotherapy is to explore hypotheses for dose-response relationships. In cancer radiotherapy treatment, variations in stem cells, tumor microscopic disease and radiosensitivity distributions can be expected to affect dose-response relationships. Unfortunately, many of them are unknown. Data mining on incidental dose may yield suspicious anatomical features from which –based on biological or clinical considerations– hypotheses for dose-response relationships can be formulated. If validated, such dose-response relationships would eventually provide evidence for better treatment planning, such as refined knowledge of the clinical target volume (CTV), optimal dose painting inside the GTV and more effective sparing of OARs.
Several studies have focused on exploring dose-response relationships from a different perspective than the conventional dose volume histogram (DVH) based method[1–3]. These methods include either exploring the characteristics of dose distributions (e.g. eccentricity, homogeneity), or applying an advanced classifier (e.g. neural network). However, these methods are not easily applicable in the situation where the hypothesized region is a priori not known. Directly comparing dose distributions is then a straightforward method of exploring dose-response relationships. Since the dose at each voxel is compared without prior anatomical or geometrical based hypothesis, a voxel-by-voxel based testing is suitable for hypothesis generation, i.e., to localize suspicious regions. In a prostate study, the 3D dose prescribed to prostate patients were registered to an anatomy grid and tested voxel-by-voxel (t-test) for relations with failure. Results indicate that a cluster of voxels outside the prostate yield a p-value of less than 0.05. However, obtaining a p-value at every voxel is not yet the complete result. Since a large number of voxels were tested simultaneously, it is likely that the null hypothesis was incorrectly rejected at some voxels (type I errors), and this is the so called multiple comparisons problem.
The aim of this paper is to introduce a multiple comparisons permutation test to compare images between patients in radiotherapy studies. We begin with describing the methodology. Afterwards, we demonstrate the validity of this method with simulations. Finally, we give two examples of applying permutation test in radiotherapy: one study that relates dose to failure for prostate cancer patients and another study that relates dose to acute esophagus toxicity for non-small cell lung cancer (NSCLC) patients.
Materials and methods
A permutation test involves five steps: 1) register images from different patients, 2) form a null hypothesis, 3) define a scalar test statistic, 4) generate random samples by permuting the true labels of the patients and extract the test statistic from each random sample, 5) calculate the adjusted p-value from the distribution of the test statistic. Thus, instead of a p-value for every voxel, this test gives a single p-value to describe the difference between two imaging datasets.
Suppose we observe a sample of patients with two outcomes: non-event (N) and event (E). These patients are considered to be representative for the entire population. To compare the dose distributions between the two groups, the first step is to register the dose distributions of all patients into the same grid, through an image registration method[5, 6]. The null hypothesis then states that there be no difference in dose distributions between the N and E labeled groups. In the following part, we introduce a test statistic Tmax, and describe the permutation procedure to compute the adjusted p-value.
where μE,i,k and μN,i,k are the average dose at voxel k for group E and N in sample i, and σ k is the standard deviation of di,k over Np samples. σ k is computed over the random samples generated from the permutation procedure, as described in the following part. As a result, we obtain a normalized dose difference map (or Ti,k map) for sample i. The test statistic Tmax,i is then selected as the maximum value of the Ti,k map. Unlike a voxel-by-voxel based test, Tmax,i gives a single number that summarizes the discrepancy of the dose distributions between the two label groups, rather than the discrepancy of a particular voxel. Therefore, Tmax accounts for multiple comparisons. Clearly, Tmax is not the only option for extracting a single value test statistic from the T k map. Other test statistics like the x percentile, are also eligible. However, Tmax is often chosen for its strong control over Type 1 errors.
The 10 possible label combinations of 3 Ns and 2 Es
1. N N N E E
6. N E E N N
2. N N E N E
7. E N N N E
3. N N E E N
8. E N N E N
4. N E N N E
9. E N E N N
5. N E N E N
10. E E N N N
Study I: prostate
We applied the permutation test on data used by. The aim of this study was to relate dose distributions with failure in prostate cancer patients. We selected a group of 67 patients with a relatively higher risk for extraprostatic disease, estimated according to T-stage, iPSA and Gleason score or differentiation grade. These patients were treated in Erasmus Medical Center, The Netherlands, and they were included in the Dutch Phase III trial (CKVO 96-10) with dose randomized between 68 Gy and 78 Gy. The Ethical Committee of each institution approved the protocol. Patients mainly had tumors of stage T3b and were treated to the delineated prostate and the seminal vesicles. The extra boost of 10 Gy had a 5 mm margin to the CTV (except towards the rectum, where a 0 mm margin was applied). For the 68 Gy PTV, a 10 mm margin was applied. In this study, the failure was biochemical (PSA nadir plus 2) or clinical (locoregional or distant progression or start of salvage hormone therapy), determined at a fixed 4 year endpoint. As a result, 37 failure patients and 30 non-failure patients were eligible for analysis. Delineations from the planning CT and the planned dose distributions were collected for each patient. Firstly, dose distributions of all patients were registered into a dose grid as described in. In short, voxels correspond if their direction with respect to the center of mass (CM) is the same, and their distance to the surface in this direction is the same. For voxels inside the prostate, corresponding voxels have the same fractional distance between the CM and the surface. The registration identifies anatomical points at locations relative to the delineated prostate surface. The choice for this registration procedure is an important part of the dose-effect hypothesis, and was based on the suspicion that extracapsular extension might have affected outcome. The resulting grid has a dimension of 31 × 35 × 34, resulting in Nv = 36890 dose voxels. The null hypothesis is that there is no dose distribution difference between the failure and the non-failure patients. The multiple comparison permutation test was applied to the registered dose maps.
Study II: esophagus
We applied the permutation test on a esophagus toxicity study. The aim of this study is to relate dose distributions on the esophagus surface with acute esophagus toxicity (AET) in NSCLC patients. We selected 185 NSCLC patients treated in Netherlands Cancer Institute (NKI) from 2008 to 2010 with concurrent chemotherapy combined with IMRT. The RT dose was 66 Gy in 24 fractions. The concurrent chemotherapy included a daily low dose cisplatin. AET was scored according to the Common Toxicity Criteria 3.0. Toxicity was scored weekly from baseline until 3 weeks after RT. Afterwards, patients were checked every 2 months or more frequently if necessary. Of the 185 patients, 76 had no or grade 1 AET; 67 patients developed grade 2 and 42 patients had grade 3; Grade 4 or 5 AET did not occur. The delineated esophagus from the planning CT and the planned dose distributions were retrospectively collected for each patient, allowing a 2D esophagus surface dose map (ESDM) to be computed. For each patient, dose was sampled on every slice of the CT scan (3 mm thickness) at 36 fixed orientations along the delineated esophagus contour to the center, from 0 to 360 degrees with 10 degrees increment. The 0 degree angle was always chosen in the Right-Left direction. (In our experience the esophagus is always star shaped, i.e. the full contour can always be seen from a single centreline.) The same sampling procedure was then done through all the m slices where the esophagus was delineated. As a result, the ESDM contains m × 36 pixels for every patient (m varies from patient to patient). ESDMs of all patients were registered such that the pixel with the highest dose is in the center of the 2D dose map, alowing translations along and rotations around the length of the esophagus. The choice for this mapping was based on the assumption that the length and the circumference of the high dose region on the esophagus surface is associated with AET, irrespective of its anatomical location. Permutation tests were applied to find differences between grade 0–1 and grade 2–3, and between grade 0–2 and grade 3 AET.
Results of study I
Results of study II
In this paper, we introduced multiple comparison permutation testing for voxel based data mining in radiotherapy and we demonstrated the test in two studies. For both studies we were able to locate regions where dose significantly associates with the outcome. In the prostate study, we were able to provide strong statistical evidence for a dose difference between non-failure and failure patients, confirming a difference located in the obturator region that could be suspicious for subclinical disease. In the esophagus study, both regions to predict grade ≥2 and grade 3 are consistent with the V50 dose volume histogram (DVH) parameter as derived in study. Grade 2 seems to be caused by high dose (≥50 Gy) and the length/circumference coverage of low dose, while the length/circumference coverage of high dose (around 50 Gy) plays a role in severe AET of grade 3. This result suggests that using the length and circumference parameters may be a more sensitive method to predict AET compared to DSHs.
A broadly recognized method to address the problem of multiple comparisons is the Bonferroni correction: if n independent hypotheses are tested, each individual hypothesis is tested at the 1/n times of the original statistical significance level when tested for only one hypothesis. However, this correction is not straightforwardly applicable to voxel maps, since there can be millions of voxels that are highly correlated in space. Hypothesis testing on images was first conducted through parametric random field methods: t-tests are conducted at every voxel and the distributional results for continuous random fields are used to identify regions that are significant. Contrary to these parametric methods, non-parametric permutation tests on voxel maps were introduced by[8, 16]. Two test statistics are often used: a single maxima threshold and the supra-threshold cluster. A single maxima threshold is the Tmax as we used in our study. In, Tmax was applied in a permutation test to localize the region of visual cortex sensitive to motion on 3D PET imaging and to analyze the order effects in working memory using fMRI. Contrarily, a supra-threshold cluster test assesses the size of the connected supra-threshold regions for significance. As a result, the power to localize regions was reduced. Since the goal of data mining in radiotherapy was to localize suspicious regions, we recommend using Tmax as test statistic.
The incidental dose essentially comes from the variations of dose planning for some un-targeted organs, and it’s a good thing to explore. Whether or not we are able to detect a significant dose difference depends on two aspects: 1) the average incidental dose between 2 groups and 2) the variance of the incidental dose. Statistically, a higher average incidental dose difference and a lower variance of each group facilitates yielding a true positive result. On the other hand, registration error is a bad thing. Inaccuracies in the registration, or an inappropriate choice for the registration method, could prevent the method to identify an existing dose effect relation, thus reducing the power of the statistical test. While it seems less likely that some particular registration procedure or inaccuracy therein generates a false positive result from dose variations which only consist of noise, any dose effect relation which is subsequently derived should be verified on an independent data set. Depending on the specific anatomical properties and the expected dose-effect parameters, the requirements for the accuracy of the registration vary. For instance, if the data mining is conducted in regions with small structures (e.g. head and neck), a sophisticated registration procedure may be required to find significant results. Contrarily, if we want to explore a large volume of low gradient dose distributions (e.g. lung), a loose registration may suffice. If we aim to explore dose distributions surrounding one structure, the registration accuracy is then focused on regions close to this structure. Therefore, a registration strategy should be chosen in advance based on the type of hypothesis that we want to explore. Afterwards, significant regions can be anatomically identified, and subjected to biological and clinical interpretation. Such consideration can then guide further efforts to derive dose-response relationships through conventional modeling methods.
Permutation testing is a useful tool to explore dose patterns from incidental dose distributions. Instead of analyzing dose-response effect, we intend to use permutation testing as a preliminary step to identify suspicious regions for hypothesis generation. Permutation testing takes into account multiple comparisons by yielding an adjusted p-value and gives visually straightforward suspicious regions. Another advantage of such method is that it is non-parametric. Thus, this test does not depend on the assumption of Normal distribution, which is often not true in the case of incidental dose in the planning. Permutation testing is practically useful and important in radiotherapy, especially in the era where adaptive radiotherapy is on the agenda, but we still have only limited knowledge about tumor stem cells, microscopic disease, radiosensitivity, etc. Permutation testing helps us to maximally explore dose-response relationships from the incidental dose in the clinical data.
We introduced a permutation test that deals with hypothesis testing on images and illustrated this method in a synthetic dataset, and in clinical datasets from a prostate and an esophagus study. Compared to a voxel-by-voxel based test, the permutation method reduces the rate of false positives. Permutation testing is a useful tool to identify hypotheses for dose-response relationships and tackle the multiple comparisons problem.
Written informed consent was obtained from the patient for the publication of this report and any accompanying images.
Appendix A: the multiple comparisons permutation test
- (i)Compute the average dose difference between E and N groups in the observed sample:(A.1)
where and are the average dose value at voxel k for group E and N, respectively.
- (ii)Permute the labelling of the observed sample and compute the average dose difference. Repeat this process for N p times:(A.2)
where μE,i,k and μN,i,k are the average dose value at voxel k for group E and N in the ith permuted random sample.
- (iii)Compute the standard deviation for every voxel k over all N p random samples:(A.3)
- (iv)Compute the locally normalized dose difference for every voxel in every random sample as well as the observed sample:(A.4)(A.5)
- (v)Compute the test statistic T max for every resampling as well as the true labeling sample:(A.6)(A.7)
- (vi)Compute the adjusted p-value:(A.8)
Compare the adjusted p-value with the significance level α. If p < α, reject the null hypothesis, otherwise the null hypothesis can not be rejected.
Compute the T ∗ as (1 - α) percentile of T max,i = 1,…,N p. Regions in that are above T ∗ show significant difference between E and N groups.
The authors would like to acknowledge the Dutch center for translational molecular medicine (CTMM) for supporting the project “Personalized chemo-radiation of lung and head-and-neck cancer” (AIRFORCE).
- Buettner F, Gulliford SL, Webb S, Partridge M: Using dose-surface maps to predict radiation-induced rectal bleeding: a neural network approach. Phys Med Biol 2009, 54: 5139-5153. 10.1088/0031-9155/54/17/005View ArticlePubMedGoogle Scholar
- Buettner F, Gulliford SL, Webb S, Sydes MR, Dearnaley DP, Partridge M: Assessing correlations between the spatial distribution of the dose to the rectal wall and late rectal toxicity after prostate radiotherapy: an analysis of data from the mrc rt01 trial (isrctn 47772397). Phys Med Biol 2009, 54: 6535-6548. 10.1088/0031-9155/54/21/006View ArticlePubMedGoogle Scholar
- El Naqa I, et al.: Exploring feature-based approaches in pet images for predicting cancer treatment outcomes. Pattern Recognit 2009, 42: 1162-1171. 10.1016/j.patcog.2008.08.011View ArticlePubMedPubMed CentralGoogle Scholar
- Witte MG, Heemsbergen WD, Bohoslavsky R, Pos FJ, Al-Mamgani A, Lebesque JV, van Herk M: Relating dose outside the prostate with freedom from failure in the dutch trial 68 gy vs. 78 gy. Int J Radiat Oncol Biol Phys 2010, 77: 131-138. 10.1016/j.ijrobp.2009.04.040View ArticlePubMedGoogle Scholar
- Castadot P, Lee JA, Parraga A, Geets X, Macq B, Grégoire V: Comparison of 12 deformable registration strategies in adaptive radiation therapy for the treatment of head and neck tumors. Radiother Oncol 2008, 89: 1-12. 10.1016/j.radonc.2008.04.010View ArticlePubMedGoogle Scholar
- Goshtasby AA: 2-D and 3-D Image Registration: for Medical, Remote Sensing, and Industrial Applications. Hoboken: Wiley-Interscience; 2005. ISBN 0471649546Google Scholar
- Chen C, Witte MG, Heemsbergen WD, van Herk M: Significance testing in dose difference maps. In Proc. of XVIth International Conference on the use of Computers in Radiation Therapy. Amsterdam; 2010.Google Scholar
- Holmes AP, Blair RC, Watson JD, Ford I: Nonparametric analysis of statistic images from functional mapping experiments. J Cereb Blood Flow Metab 1996, 16: 7-22.View ArticlePubMedGoogle Scholar
- Edgington ES: Approximate randomnization tests. J Psychol 1969, 72: 143-149. 10.1080/00223980.1969.10543491View ArticleGoogle Scholar
- Partin AW, Yoo J, Carter HB, Pearson JD, Chan DW, Epstein JI, Walsh PC: The use of prostate specific antigen, clinical stage and gleason score to predict pathological stage in men with localized prostate cancer. J Urol 1993, 150: 110-114.PubMedGoogle Scholar
- Peeters ST, Heemsbergen WD, van Putten W L, Slot A, Tabak H, Mens JW, Lebesque JV, Koper PC: Acute and late complications after radiotherapy for prostate cancer: results of a multicenter randomized trial comparing 68 gy to 78 gy. Int J Radiat Oncol Biol Phys 2005, 61: 1019-1034. 10.1016/j.ijrobp.2004.07.715View ArticlePubMedGoogle Scholar
- Roach M, Hanks G, Thames H, Schellhammer P, Shipley WU, Sokol GH, Sandler H: Defining biochemical failure following radiotherapy with or without hormonal therapy in men with clinically localized prostate cancer: recommendations of the rtog-astro phoenix consensus conference. Int J Radiat Oncol Biol Phys 2006, 65: 965-974. 10.1016/j.ijrobp.2006.04.029View ArticlePubMedGoogle Scholar
- Kwint M, Uyterlinde W, Nijkamp J, Chen C, de Bois J, Sonke JJ, van den Heuvel M, Knegjens J, van Herk M, Belderbos J: Acute esophagus toxicity in lung cancer patients after intensity modulated radiation therapy and concurrent chemotherapy. Int J Radiat Oncol Biol Phys 2012,84(2):e223-e228. 10.1016/j.ijrobp.2012.03.027View ArticlePubMedGoogle Scholar
- Abdi H: Bonferroni and Sidak corrections for multiple comparisons. In Encyclopedia of Measurement and Statistics. Edited by: Salkind NJ. Sage; 2007.Google Scholar
- Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ: Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapp 1994, 2: 189-210. 10.1002/hbm.460020402View ArticleGoogle Scholar
- Nichols TE, Holmes AP: Nonparametric permutation tests for functional neuroimaging: a primer with examples. Human Brain Mapp 2002, 15: 1-25. 10.1002/hbm.1058View ArticleGoogle Scholar
- Watson JDG, Myers R, Frackowiak RSJ, Hajnal JV, Woods RP, Mazziotta JC, Shipp S, Zeki S: Area V5 of the human brain: evidence from a combined study using positron emission tomography and magnetic resonance imaging. Cereb Cortex 1993, 3: 79-94.View ArticlePubMedGoogle Scholar
- Marshuetz C, Smith EE, Jonides J, DeGutis J, Chenevert TL: Order information in working memory: fMRI evidence for parietal and prefrontal mechanisms. J Cogn Neurosci 2000,12(Suppl 2):130-144.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.