Predicting multi-class responses to preoperative chemoradiotherapy in rectal cancer patients

Background Preoperative chemoradiotherapy (CRT) has become a widely used treatment for improving local control of disease and increasing survival rates of rectal cancer patients. We aimed to identify a set of genes that can be used to predict responses to CRT in patients with rectal cancer. Methods Gene expression profiles of pre-therapeutic biopsy specimens obtained from 77 rectal cancer patients were analyzed using DNA microarrays. The response to CRT was determined using the Dworak tumor regression grade: grade 1 (minimal, MI), grade 2 (moderate, MO), grade 3 (near total, NT), or grade 4 (total, TO). Results Top ranked genes for three different feature scores such as a p-value (pval), a rank product (rank), and a normalized product (norm) were selected to distinguish pre-defined groups such as complete responders (TO) from the MI, MO, and NT groups. Among five different classification algorithms, supporting vector machine (SVM) with the top 65 norm features performed at the highest accuracy for predicting MI using a 5-fold cross validation strategy. On the other hand, 98 pval features were selected for predicting TO by elastic net (EN). Finally we combined TO- and MI-finder models to build a three-class classification model and validated it using an independent dataset of rectal cancer mRNA expression. Conclusions We identified MI- and TO-finders for predicting preoperative CRT responses, and validated these data using an independent public dataset. This stepwise prediction model requires further evaluation in clinical studies in order to develop personalized preoperative CRT in patients with rectal cancer. Electronic supplementary material The online version of this article (doi:10.1186/s13014-016-0623-9) contains supplementary material, which is available to authorized users.


Background
Treatment strategies for patients with rectal cancer have changed substantially in recent decades. Historically, postoperative chemoradiotherapy (CRT) was considered to be first-line therapy for stage II and III rectal cancers. However, preoperative CRT is now considered to be optimal therapy for locally advanced rectal cancer due to improved local control, reduced toxicity, and increased rates of sphincter preservation [1,2].
Preoperative chemoradiotherapy (CRT) has been widely used as the treatment of choice for locally advanced rectal carcinomas [3,4]. Radiotherapy works by inhibiting cell proliferation and inducing apoptosis in vitro, and inhibiting tumor growth in vivo [5,6]. However, responses to radiotherapy differ among individuals with similar histologic backgrounds, and the essential determinants of these responses have yet to be studied. Thus, identifying the key factors that predict responses to radiotherapy before treatment could be helpful in that those patients predicted to have poor responses can undergo the initial surgery without preoperative CRT.
To date, several studies have investigated the response of rectal cancer to radiotherapy using gene expression microarrays [7][8][9][10]. In one report, pre-therapeutic biopsies from 30 patients with locally advanced rectal carcinomas were profiled. By grouping the radiation responses of patients defined as T-level downsizing, a set of 54 genes was found to be differentially expressed between responders and nonresponders. Diagonal linear discriminant analysis (LDA) of the 54 genes was applied to predict responses to CRT. Using a leave-one-out cross validation (LOOCV) strategy, responses for 83 % of patients were correctly predicted.
Watanabe et al. [8] identified 33 genes that were differentially expressed in responders and non-responders as determined by histopathologic regression grading of surgically resected specimens from 52 rectal cancer patients. The prediction accuracy in this study was 88.6 % for the 17 test samples using the k-nearest-neighbor algorithm. Similarly, Kim et al. [7] and Rimkus et al. [10] reported 87 and 78.5 % prediction accuracies for 46 and 43 patients, respectively.
It is important to note, however, that there is very little overlap in the genes included in these studies, even when high prediction accuracies were reported [11]. This inconsistency may be due to use of different criteria in defining a response, small sample size, or different origins of responses. Additionally, all of the previous studies formulated models that predicted a two-class classification system by applying a specific algorithm with a simple feature selection procedure using a p-value threshold. Taken together, these issues indicate that no optimal system for gene identification has yet been developed for incorporation into clinical practice.
Our objective was to build a clinically feasible model that predicts multi-class responses to CRT. A schematic representation of whole analysis flow is shown in Table 1. We classified responses of patients according to Dworak grade using 77 patient samples, which is the largest sample size reported in the published literature. We then defined three different feature scores to identify a novel set of genes for predicting responses. Various classification algorithms were applied to this set of genes to build a complete regression (TO) prediction model and a minimal regression (MI) prediction model. From the most accurate models, we established a novel strategy for predicting multiple responses to CRT by applying TO and MI models sequentially. To date, this is the first trial to predict multi-class responses to CRT. Finally, we validated our model using a published dataset from an independent source.

Patient samples and response classification
This study was approved by the Institutional Review Boards (IRBs) at Samsung Medical Center (IRB No. SMC 2009-10-067). Written informed consent for participation in this study was obtained from the patient. A total of 77 rectal cancer patients who underwent preoperative CRT were included in this study. Various clinical information such as UICC stage and Dworak grade are summarized in Table 2 and more detailed information is available in  supplementary Additional file 1: Table S1. Responses to CRT were determined according to Dworak tumor regression grade. Tumor regression grades were classified into five groups: grade 0 (no regression), grade 1 (minimal regression, MI), grade 2 (moderate regression, MO), grade 3 (near total regression, NT), and grade 4 (total regression, TO). A total of 10, 36, 13, and 18 patients were classified as MI, MO, NT, and TO, respectively. Tumor stage was determined according to the guidelines set forth by the International Union Against Cancer (UICC).

RNA isolation and microarray procedures
Total RNA was extracted from tumor tissue using TRIzol reagent (Invitrogen, Carlsbad, CA), and the collected RNA was purified using RNeasy mini kits (Qiagen, Valencia, CA). The purity and concentration of RNA were determined using a Bioanalyzer (Agilent Technologies, Santa Clara, CA). The RNA was amplified and labelled according to the Affymetrix GeneChip Whole Transcript (WT) Sense Target Labelling protocol. The resultant labeled cDNA was hybridized to Affymetrix Human Gene 1.0 ST arrays and scanned. The R program from CEL file preprocessing was used for all statistical analyses. The expression data obtained from each microarray was normalized using a Robust Multislide Array (RMA) normalization algorithm. Raw and expression microarray datasets are available upon request.

Statistical analysis Feature scores
We first placed the patients into groups labeled TO or other, and MI or other. We then used a two-sample Welch's t-test with unequal variances to determine which genes were differentially expressed between groups. The feature score (FS) was defined as shown below. For genes i, p i , and d i the p-value and effect size were obtained from predefined group comparisons. P-value based ('pval'), rank product based ('rank'), and normalized value product based ('norm') feature scores were calculated.

Classification models
To build prediction models, we applied multiple classification algorithms using varying numbers of features based on 'pval' , 'rank' , and 'norm' scores. Samples from 77 patients were divided into a training set and a test set. We applied 5 different algorithms including support vector machine (SVM), random forest (RF), elastic-net (EN) [12], linear discriminant analysis (LDA) and k-nearest neighbor (kNN) [13] with k = 1, 3, and 5. The TO and the MI predictors were built by applying all algorithms using varying numbers of features with different feature scores to two different classification problems (i.e., TO vs. other and MI vs. other).

Model selection
We consider iid data ( and Y 1 denotes values of a specific class in some finite set Y. A classification rule is a function: h : X → Y. Our goal was to choose the optimal classification rule or classifier ĥ to minimize the training error. We adopted a 5-fold cross-validation (CV) scheme to estimate the prediction accuracy for each classification method and select the most accurate model. Note that the CV scheme used in this work is not a conventional CV approach, in which only training data is used for selecting features and training a classifier. Since the number of sample was not large enough, we evaluated the feature scores with whole dataset to reliably select the important features. This might cause biased estimate of the true error of the model prediction. Thus we applied the final model to an independent publicly available dataset.

Sequential multiclass prediction
Multiclass (minimal, moderate, or total response) prediction was conducted by applying the best MI and TO prediction models sequentially. With SVMs, reducing the single multiclass problem into several binary classification problems is likely a better approach [14]. We first applied the TO model to an individual to predict whether the individual was classified into that group. If not, the MI model was applied and the final conclusion was made according to the MI prediction result.

Application to independent published dataset
An independent test set was obtained from the previous study which classified the patient responses using Dworak criteria [7]. Thus, we can apply our model without a response classification criteria compatibility issue. Because of the platform difference, however, direct application of our model to data reported in [7] had two limitations: the limited number of overlapping features and the different expression scale. To resolve these issues, we first matched the features (genes) using "HGNC gene symbols" and subsequently applied quantile normalization of these genes using average values of the same ranked genes in our dataset. We identified 38 (of 65) and 32 (of 98) common genes for the MI and TO models. Since only 70 of 163 of the best model features were available, the best model found using 77 samples could not be directly applied to the dataset reported in [7]. Therefore, we rebuilt the model with common genes to better predict responses to CRT.

Results
Three feature score types and their characteristics To identify molecular features of CRT responses, we analysed whole gene expression profiles of 77 rectal cancer biopsy specimens. The overall scheme for a multi-class prediction model is summarized in Table 1.
We divided patients into TO versus other, or MI versus other. For each comparison, we obtained p-values and effect sizes for all genes. From these two values, we defined three different feature scores including p-value, normalized product, and rank product. Characteristics of the features with different FS criteria used for TO prediction are shown in Fig. 1. P-value based ('pval') features were selected using the lowest 'pval' criterion. Unlike the 'pval' score, the normalized product ('norm') and rank product ('rank') scores with higher p-values (less significant) were selected if they also had larger effect sizes. Thus, in selecting features with higher 'norm' and 'rank' scores, both the effect size and the p-value were considered. A similar pattern was observed for features predictive of MI.
To investigate the effect of prediction performance of three FS criteria, we evaluated average accuracy across the models with different number of features for each classification algorithm (Table 3). Among these three FS measures, 'pval' showed the superior to other measures. For MI prediction, 5 out of 7 different methods with 'norm' FS type were found to be the best performer. For TO prediction, however, all the highest values were observed with 'pval' FS. In addition, SVM with 'pval' FS showed the most stable and the highest average accuracy for both MI and TO prediction (Table 3).

TO and MI prediction models
To find the most accurate TO and MI prediction models, we conducted various classification analyses using multiple features. Rather than using the leave-one out cross validation (LOOCV) that many similar studies have used, we adopted a 5-fold cross-validation strategy to prevent overfitting (i.e., making the model less fit to the sample data but a better fit with new data). Classification accuracy (1error) was evaluated using five different classification algorithms: support vector machine (SVM), random forest (RF), elastic net (EN), linear discriminant analysis (LDA), and k nearest neighbor (kNN, with k=1, 3, and 5).
As can be seen in Fig. 2, in most cases, prediction performances were consistently around 85 % regardless of the classification algorithm, feature score type, or number of features. LDA, however, had poor prediction performance with an accuracy of approximately 50 % when more than 30 features were used (Fig. 2). Since some of genes may be highly correlated, collinearity among an increasing number of features may impede the performance of the LDA algorithm.
The best accuracy rates for MI and TO predictions were attained using SVM and EN, respectively. Using the SVM classification model, the prediction accuracy for MI was 100 % with 65 'norm' feature score genes (Fig. 2). For TO predictions, the EN algorithm achieved   a 94.8 % accuracy rate with 98 'pval' feature score genes (Fig. 2). Note that SVM showed the most stably high prediction rate performances regardless of the number of features used.

Gene function annotation of the best TO and MI predictors
To investigate the biological functions of the genes that are predictive of TO and MI, we conducted gene-set enrichment analysis using DAVID [15]. With the gene ontology as a gene set, we observed that the genes predictive of TO were enriched in the regulation of cell migration, cell maturation, and cell death, whereas the genes predictive of MI were involved in protein localization and protein transport (Table 4). With the pathway as a gene set, N-Glycan biosynthesis and Oocyte meiosis pathways were enriched by the genes predictive of TO and MI, respectively (Table 4). These two pathways are associated with responses to irradiation [16,17]. N-linked glycosylation in particular enhances the effects of radiation therapy and is advantageous for inhibition of tumor growth [17]. The whole list of genes and the results of gene set analysis can be found in Additional file 2: Table S2.

A sequential approach model for multi-class prediction and external validation
We next examined whether our two best models are practically useful in predicting multi-class responses to radiotherapy. To this end, we combined TO-and MIfinder models for a three-class prediction (Fig. 3). We used a sequential approach in applying the TO and MI models to both our dataset and validated the performance using a previously published dataset. Since not all of the genes used in our study were available in the dataset obtained in [7], we used overlapping genes included in both datasets. We found that the best TO prediction algorithm was EN. However, only a third of the previously described TO predictive genes were included in the dataset, and in cases in which there were fewer feature numbers, SVM was more accurate than EN (Fig. 2). Thus, our sequential approach model was designed to use the expression of 70 matched genes including 32 genes for TO prediction and 38 genes for MI prediction. The sensitivities for predicting TO, MO (including NT), and MI using the sequential approach model among the 77 samples were 94.4, 100, and 90 %, respectively (Table 5). We defined sensitivity for each class as a correct prediction. To further test the validity of our model, we also applied it to an independent dataset [7]. The sensitivites for TO, MO and MI prediction were 86.6, 100, and 64.3 %, respectively (Table 5). We also confirmed that a sequential approach in the reverse order (MI prediction first, then TO prediction) did not change the prediction results in either dataset.

Discussion
The objective of this study was to generate a clinically feasible model that predicts multi-class responses to CRT. In the past, one of the difficulties that impeded the development of such a model was the use of non-overlapping sets of genes across different studies, which may due to different definitions of response to CRT or the genetic diversity of rectal cancer [11,18]. We sought to eliminate this issue by increasing our sample size and using a Dworak regression grade for distinct classification according to our three FS criteria. To date, this is the largest study that examines the genetic diversity of rectal cancer. The lack of independent prospective validation has also been an issue in previous studies, as many have used a LOOCV strategy to validate prediction accuracies due to small sample sizes (less than 50).
However, LOOCV may strengthen prediction accuracy by over-training with n -1 samples from the total sample pool. In our study, we applied a 5-fold cross-validation strategy to the 77 samples to achieve a higher degree of independence between training and test data. It is important for the clinician to classify the patient into the appropriate category, such as total response, intermediate response, or no response. To achieve this, we established the multiclass prediction model by adopting a novel sequential application of two separate prediction models. Using Dworak regression grade, the results of our multi-class prediction  Table 5) was obtained using the previously published independent dataset that included 46 patients. While these data are encouraging, prospective clinical trials using homogenous cohorts of patients (11) are needed to confirm the feasibility of using this model in clinical practice. Some limitations are worth noting. Although our model was supported by independent dataset, conventionally cross-validation approach, in which features are selected on a training set, was not applied in our study due to a small sample size. Also we built our model based on univariate feature selection method without considering clinical predictors. Future work should therefore include follow-up work designed to evaluate whether our model and our approach are retained in more datasets and also whether the joint analysis of multiple genes, clinical variables, and interaction among them are necessary to be used to improve prediction performance.

Conclusions
Historically postoperative chemoradiotherapy (CRT) was considered to be the first-line therapy for stage II and III rectal cancers. However, the preoperative CRT is now considered to be the optimal therapy regimen for locally advanced rectal cancer due to its improved local control, reduced toxicity, and increased rate of sphincter preservation. Our study established a clinically practical multiclass prediction model by adopting a novel strategy that applies two separate prediction models (MI and TO predictor) sequentially to a patient to predict the response. For promising clinical practice, we validated our model in a published dataset, which is completely independent dataset from ours. This study suggests a clinically plausible prediction model that correctly infers the preoperative CRT response of patients with high accuracy based on 163 gene signatures we identified. Larger prospective trials will be needed to confirm harder the validity of the present scheme.   Fig. 3 Sequential multi-class prediction. To predict the preoperative CRT response of a patient, TO predictor is performed and answers whether the patient is total response group or not. If yes, CRT is conducted, or MI predictor is performed and predicts whether the patient will show minimal response. According to the result of this step, clinician can decide a proper treatment of the patient