Making External Validation Valid for Molecular Classifier Development

PURPOSE Accurate assessment of a molecular classifier that guides patient care is of paramount importance in precision oncology. Recent years have seen an increasing use of external validation for such assessment. However, little is known about how it is affected by ubiquitous unwanted variations in test data because of disparate experimental handling and by the use of data normalization for alleviating such variations. METHODS In this paper, we studied these issues using two microarray data sets for the same set of tumor samples and additional data simulated by resampling under various levels of signal-to-noise ratio and different designs for array-to-sample allocation. RESULTS We showed that (1) unwanted variations can lead to biased classifier assessment and (2) data normalization mitigates the bias to varying extents depending on the specific method used. In particular, frozen normalization methods for test data outperform their conventional forms in terms of both reducing the bias in accuracy estimation and increasing robustness to handling effects. We make available our benchmarking tool as an R package on GitHub for performing such evaluation on additional methods for normalization and classification. CONCLUSION Our findings thus highlight the importance of proper test-data normalization for valid assessment by external validation and call for caution on the choice of normalization method for molecular classifier development.


INTRODUCTION
Precision medicine needs effective quantitative tools for outcome prediction to tailor treatment choices and optimize patient care. 1,2 Molecular profiling technologies herald the promise for developing such tools. [3][4][5][6] However, few of the published molecular classifiers have been successfully translated into clinical use so far. 7-10 Often a classifier was reported to be effective based on cross-validation in the initial publication, but later failed external validation in an independent test data set. [11][12][13][14] We recently showed that some of these failures can be attributed to biased classifier assessment by crossvalidation when training data possess handling effects (namely systematic data variations because of disparate experimental handling of the specimens) and subsequently undergo data normalization. The reason is that normalization can lead to overcompressed data variability and hence overoptimistic assessment of the classification error rate. 15,16 It remains to be elucidated how these failed classifiers were influenced by handling effects and their normalization for test data in external validation.
Although external validation is increasingly used in recent studies of molecular classification, many of these studies failed to report the normalization method used for test data. [17][18][19][20][21][22] Among those that did report, a jumble of methods was used, including median normalization (MN) and quantile normalization (QN), either applied to test data alone or in combination with training data. [23][24][25] To date, there has been no systematic study on the relative performance of the normalization methods for test data.
In this paper, we studied the issues of test data handling effects and normalization in the context of microRNA (miRNA) microarrays. 6,26,27 Our study used two data sets for the same set of tumor samples, which were previously collected at Memorial Sloan Kettering Cancer Center (MSK). [28][29][30] Arrays in one data set were collected with uniform handling to minimize handling effects and balanced array-to-sample allocation to avoid any confounding, whereas arrays in the other data set were collected with nonuniform handling and unbalanced allocation. We then performed resamplingbased simulations using the paired data sets, dubbed virtual rehybridization, to create additional data under various levels of handling effects and biologic signals and different designs for allocating arrays to samples. In this paper, we report our findings from this simulation study. These findings provide critical insights in developing reproducible miRNA classifiers for clinical application.

Empirical Data Collection
A set of 192 untreated primary gynecologic tumor samples (96 endometrioid endometrial tumors and 96 serous ovarian tumors) were collected at MSK from 2000 to 2012. Human tumor tissues of the 192 samples were obtained from participants who provided informed consent and their use in our study was approved by the MSK Institutional Review Board. The samples were profiled using the Agilent Human miRNA Microarray (Release 16.0; Agilent Technologies, Santa Clara, CA), following the manufacturer's protocol. This array platform contains 3,523 markers (representing 1,205 human and 142 human viral miRNAs) and multiple replicates for each marker (ranging from 10 to 40). In addition, it has eight arrays on each glass slide (ie, an experimental block) arranged as two rows and four columns. Two data sets were obtained from the same set of samples using different methods of experimental handling. The first data set (hereafter referred to as the uniformly handled data set) was handled by one technician in one batch with the arrays assigned to tumor samples using blocked random assignment. By contrast, the second data set (hereafter referred to as the nonuniformly handled data set) was handled by two technicians over multiple batches in the order of sample collection; the first 80 arrays were handled by one technician in two batches and the last 112 by a second technician in three batches. More details on data collection can be found in the studies by Qin et al. 28,30 Resampling-Based Simulation As a proof of concept, we used tumor type (endometrial cancer v ovarian cancer) as the outcome variable for classification. The specific steps of the resampling-based simulation are as follows.
1. First, we used the uniformly handled data set to approximate the biologic effects for each sample. Among a total of 3,523 markers on the array, 351 (10%) were significantly differentially expressed (P , .01) between the two tumor types. To be consistent with the typical signal strength in a molecular classification study, [31][32][33] we halved the between-group differences of biologic effects for the 351 significant markers (by deducting a half of the ovarian-versus-endometrial between-group differences from their levels of expression in ovarian samples), reducing the number of significant markers to 63 (2%). The resulting biologic effects served as virtual samples. They were split randomly in a 2:1 ratio into a training set (n = 128) and a test set (n = 64), balanced by tumor type. 2. Second, we used the difference between the two arrays (one from the uniformly handled data set and the other from the nonuniformly handled data set, subtracting the former from the latter) for the same samples to approximate the handling effects for each array in the nonuniformly handled data set. These handling effects served as virtual arrays. They were split nonrandomly to a training set (n = 128, the first 64 arrays and the last 64 arrays) and a test set (n = 64, the middle 64 arrays). By definition, handling effects are systematic effects that are not reproducible in different data sets; therefore, virtual arrays are split nonrandomly so that they are not comparable between training data and test data. The magnitude of handling effects in training data and test data was then adjusted by adding a constant to training data and multiplying by a factor for test data. We used three settings for the constant and the multiplication factor: (1) 2 and 2, (2) 1 and 1.5, and (3) 0.5 and 1.25, mimicking the scenarios when handling effects in test data were (1) highly, (2) moderately, and (3) slightly different from those in training data, respectively. 3. Third, training data were simulated through virtual rehybridization by assigning virtual arrays to virtual samples following a partial confounding design or a stratification design, and then summing the biologic effects for a sample and the handling effects for its

CONTEXT Key Objective
External validation is increasingly used for assessing the accuracy of a molecular classifier, yet little is known about test-data normalization for removing ubiquitous unwanted variations because of disparate experimental handling.

Knowledge Generated
We showed that data normalization mitigates the negative impact of unwanted variations to varying extents depending on the specific method used. In particular, frozen normalization methods for test data outperform their conventional forms.

Relevance
Our findings highlight the importance of proper test-data normalization for valid assessment by external validation and call for caution on the choice of normalization method for molecular classifier development.
assigned array. A partially confounding design assigned 90% of the first 64 arrays and 10% of the last 64 arrays to ovarian samples and the rest of the arrays to endometrial samples. A stratification design assigned arrays in each batch (ie, each of the experimental batches in the collection of the nonuniformly handled data set) to the two tumor groups in equal proportion. 4. Finally, test data were simulated also through virtual rehybridization using a partial confounding design or a stratification design similar to training data. Note that, here, the partial confounding design assigned 90% of the first 32 arrays and 10% of the last 32 arrays were assigned to endometrial samples and the rest of the arrays to ovarian samples. As a reference, we also examined test data that comprised only biologic effects (without adding any handling effects).
One hundred simulation runs were generated for each scenario of handling-effect pattern and array-allocation design.

Preprocessing and Analysis of the Simulated Data
The analysis for each simulated training data set followed three main steps: (1) preprocessing training data and test data; (2) building a classifier using the preprocessed training data; and (3) assessing the error rate of the classifier using the preprocessed test data. Further details are provided as below.
Data preprocessing. Preprocessing of both training data and test data consisted of three steps: (1) log2 transformation, (2) across-sample normalization, and (3) markerreplicate summarization using the median. 34 Training data were normalized with QN as the primary approach and with MN as an alternative approach. Test data were normalized by one of the six methods: (1) no normalization (NN), (2) MN, (3) QN, (4) frozen MN (fMN), (5) frozen QN (fQN; ie, mapping the empirical distribution of each individual testset sample to the frozen empirical distribution of the normalized training data), and (6) pooled QN (pQN; ie, apply QN after pooling training data and test data). 24 Classifier building. We used Prediction Analysis for Microarrays as the primary approach for classification and Least Absolute Shrinkage and Selection Operator (LASSO) as an alternative approach. 35,36 R packages pamr and glmnet were used for applying these methods, with the tuning parameters chosen by five-fold cross-validation.
Classifier assessment. Classification accuracy was measured using the misclassification error rate (ie, the proportion of samples that were misclassified). The final model of each classifier was built using the entire training data and applied to predict the group label for each sample in test data. The predicted groups were compared with their true groups for assessing the misclassification error rate.

Performance Measure of Test-Data Normalization
We denote the error rate based on test data that possessed handling effects as Error_HE and that based on test data free of handling effects as Error_noHE. To gauge the performance of a normalization method for abating the impact of test-data handling effects, we compared Error_HE against Error_noHE. A normalization method is effective if it removes handling effects so well that (1) its Error_HE approximates the corresponding Error_noHE and (2) its Error_HE is small.
All analyses in this paper were performed using R 3.5.0.

RESULTS
We present here the simulation results using QN for training data and Prediction Analysis for Microarrays for classifier development, under each of the three levels of signal-tonoise ratio and the four combinations of array-to-sample allocation design. Additional results using MN for training data and LASSO for classification are provided in the Data Supplement.
Results When Handling Effects Were Highly Different Between Training Data and Test Data Figure 1A shows the simulation results when handling effects in test data were highly different from those in training data. Across all four array-allocation designs, the error rate based on test data with handling effects (ie, Error_HE) ranged from 0.283 to 0.495 after normalization, compared with that without normalization, 0.465. The exact level of error rate depended on the specific normalization method used: fQN (0.283) and fMN (0.284) were the best performers, QN (0.493) and MN (0.495) were the worst, and pQN (0.375) was in the middle. These error rates were in nearly perfect agreement with these for handling-effect-free test data (ie, Error_noHE) for QN and MN, in their regular and frozen forms; the agreement for pQN was slightly worse. These observations suggested that, with this pattern of handling effects and design for array allocation, fQN and fMN were the best methods for test-data normalization as they not only effectively removed the negative impact of handling effects but also made test data more comparable to training data, leading to smaller error rates; QN and MN were the worst performers as they led to an error rate even worse than NN.
The use of stratification for training data reduced the small difference in Error_HE between fQN and fMN and that between QN and MN (Figs 1B and 1D), and its use for test data alone brought Error_HE for pQN to closer agreement with Error_noHE (Fig 1C), indicating a marginal benefit for balanced design in this simulation scenario.    Also similar to the pervious scenario, the use of stratification design for training data again reduced the small difference in Error_HE between fQN and fMN (Figs 2B and 2D), and its use for test data alone led to better agreement with Error_noHE for pQN, QN, and MN ( Fig  2C).   Fig 3C).

Additional Simulation Results for Alternative Methods of Training-Data Normalization and Classification
We performed additional simulations using MN as an alternative method for training data normalization and using LASSO as an alternative method for classification. We observed similar results in terms of the relative performance of test-data normalization methods, the benefit of balanced study design, and the effect of various patterns of handling effects in training data and test data (Data Supplement).
Furthermore, we generated biologic effects parametrically using a normal distribution for each miRNA with its mean and standard deviation estimated from the empirical data. The findings remain the same qualitatively, whereas the error rates decreased across the board, possibly because of the lack of between-marker correlation when simulating the data (Data Supplement).

Software Development for Reproducing Our Study and Examining Additional Methods
We encourage interested researchers to replicate our study and explore additional methods for data normalization and classifier development. Toward this end, we developed an R package containing the paired data sets and another R package implementing the resampling-based simulation study. These two packages, named PRECISION.array.DATA and PRECISION.array, are deposited at GitHub. 37 The data can also be accessed at Gene Expression Omnibus via a SuperSeries record (GSE109059). The PRECISION.array package not only has implemented the methods for normalization and classification reported in this paper, but also allows additional methods specified by the user.

DISCUSSION
In this paper, we investigated the important yet understudied problem of test-data normalization for making external validation valid. Using paired data sets and resampling-based simulations, we showed that (1) handling effects in test data can lead to biased classifier assessment and (2) test-data normalization can mitigate the bias but to varying extents depending on the method. In particular, frozen versions of QN and MN outperformed the conventional versions, especially when the pattern of handling effects is highly different between training data and test data; conventional MN and QN of test data offer limited benefits compared with NN in our simulations and can even be worse under some scenarios of handling effects.
Our findings suggest that improper choice of normalization methods for test data in published studies may have undermined validation efforts for molecular classifiers and disproved some actually useful classifiers because of improper test-data normalization. For example, using the last 64 samples in the nonuniformly handled data as test data for assessing a classifier built on the first 128 samples, the error rate was 0.391 for conventional QN but 0.297 for fQN, whereas the error rate based on the uniformly handled data of the test samples was 0.281 and 0.266, respectively. For those classifiers that were successfully validated, inadequate description of the methodology used can hamper both efforts to replicate these studies and application of the classifiers to future samples in clinical practice. For the purpose of developing accurate and reproducible molecular classifiers, we recommend using (1) uniform experimental handling in data collection to mitigate handling effects, (2) frozen normalization of quantiles or medians for test data when either training data or test data possess handling effects, and (3) comprehensive description of the study design and analysis methods in publication, ideally accompanied by software code to allow faithful replication and application.
For proof of concept, we report the simulation results for a limited number of simulation scenarios and statistical approaches for normalization and classification. We have developed two R packages, PRECISION.array.DATA and PRECISION.array, for interested researchers to use for further exploring this topic with additional simulation scenarios and statistical methods. Our simulation approach makes a working assumption that handling effects are additive to biologic effects. This assumption has been considered reasonable for microarray data and adopted in publications on microarray data normalization and analysis. 38,39 To the best of our knowledge, the issue of data normalization for external validation has not been studied before. Our findings fill a critical knowledge gap in the advancement of developing reproducible classifiers for clinical use and speak to the importance of proper methodology and sufficient reporting. 40 AFFILIATION