Multiplex serum biomarker assessments: technical and biostatistical issues

Background Identification of predictive and prognostic biomarkers for patients with disease and undergoing different therapeutic options is a very active area of investigation. Many of these studies seek biomarkers among circulating proteins accessed in blood. Many levels of standardization in materials and procedures have been identified which can impact the resulting data. Methods Here, we have observed unexpected variability in levels of commonly tested analytes in serum which were processed and stored under standardized conditions. We have identified apparent changes in cytokine, chemokine and growth factor levels detected by multiplex Luminex assay in melanoma patient and healthy donor serum samples, over storage time at -80°C. Controls included Luminex kit standards, multiplexed cytokine standards and WHO cytokine controls. Data were analyzed by Wilcoxon rank-sum testing and Spearman's test for correlations. Results The interpretation of these changes is confounded by lot-to-lot kit standard curve reagent changes made by a single manufacturer of Luminex kits. Conclusions This study identifies previously unknown sources of variation in a commonly used biomarker assay, and suggests additional levels of controls needed for identification of true changes in circulating protein levels.


Background
To improve the clinical efficacy of immunotherapies and our ability to stratify patients rationally for therapeutic intervention, biomarkers are critical to progress. The FDA's Critical Path prioritizes development of biomarkers, including a focus on aspects of: Biospecimens, Analytical Performance, Standardization and Harmonization and Bioinformatics. Accurate biomarkers offer the prospect for earlier diagnosis, improved precision of application of expensive and toxic therapies on the optimal patient populations, monitoring disease progression and therapeutic benefits as well as accelerating drug development and discovery. Guidelines for incorporation of biomarker studies in early clinical trials of novel agents have been published [1].
There is a critical need for development and validation of biomarkers to identify patients who can benefit from a particular form of immunotherapy. Only a fraction of patients benefit from IFN-α treatment [2], only a fraction of patients can achieve durable regressions in response to antigen vaccination [3], or antibody therapies, and we do not yet know the mechanisms responsible for therapeutic benefit. Despite substantial efforts from many groups, we do not know which parameters of immune response (and which assays used to assess these parameters) yield optimal results for efficacy analysis [4][5][6][7]. A major reason for this has been that objective clinical response rates are often below 10%, confounding the measurement of significant correlations between biomarkers and clinical responses in studies of modest size. Another important issue is that assay results may depend on biological specimen handling before assessment, and on methodological differences in complex, high throughput assays.
A number of studies in melanoma have identified candidate biomarkers of response to therapy. These range from circulating cytokines and growth factors [8,9], gene expression profiles in tumors [10], circulating tumor cells [11], serum autoantibody profiling [12] and tumor specific T cell IFN-γ production [13] to molecular signaling pathways in tumors [14] and the nature of tumor infiltrating cells [15]. The vast majority of candidate biomarkers have not yet achieved routine clinical use due to lack of reproducibility, need for new technology and equipment, need for high quality tumor samples or high cost. The relative ease of collecting, processing, storing and shipping blood has made it a common resource for biomarker testing.
Several reports have identified phenotypic and functional changes in blood cells and serum components when the blood is held for hours or days and at different temperatures before processing [16][17][18]. These timedependent and temperature-dependent effects should be controlled for to the extent possible before blood processing. Standardized processing procedures by trained and competency-tested personnel can also improve immunologic assay data consistency [19]. In addition, use of freezers for sample storage that are monitored for temperature stability and that have 24 hours-a-day alarm response eliminates concerns that samples might undergo freeze-thaw cycles or be otherwise compromised by temperature changes during storage. Many of these central laboratory procedures for processing, storage and equipment maintenance are mandated by accreditation groups such as CLIA and FACT, and are described in resources from CLSI [20][21][22].
During an investigation of biomarkers of prolonged survival after IFN-α treatment in banked melanoma patient serum samples, we discovered a number of both technical and biostatistical analysis issues [23]. Our preliminary results identified a large number of serum cytokines that appeared to correlate significantly with survival. However, further dissection of the data revealed a number of technical issues that made interpretation of the data impossible.
Here, we have performed a time course analysis of cytokines, chemokines and growth factors measured in the banked serum of healthy donors and melanoma patients stored for various intervals, and analyzed by multiplex Luminex assay. We find that a number of these analytes appear to be unstable during storage. We have also tested several aspects of the Luminex assay performance and identified a number of concerns with these multiplexed assays. Biostatistical tests indicate that despite several layers of procedural standardization and levels of controls, reliable multiplexed cytokine and chemokine determinations may be compromised by length of time in storage and/or by the changes regularly made by assay kit manufacturers to different lots and the analyte standards included. These results raise concerns about serum biomarker studies and suggest that additional controls may be required to confidently compare levels over time and between lots of reagents from the same manufacturer.

Study subjects
All serum samples were obtained after written informed consent, and under IRB approved protocols of investigation at the University of Pittsburgh. The samples received in 2005 were obtained from 23 patients at two clinical sites (Pennsylvania and Indiana). The UPCI #96-099 banking protocol was utilized for the five 2010 melanoma patient sera tested. The UPCI #04-001 healthy donor blood collection protocol was used for the blood obtained from 10 healthy donors in 2010.

Blood processing and banking
For serum collection, red top vaccutainer tubes (no anticoagulant) provided by our laboratory (Becton Dickinson #6430) in kits were used. Upon arrival in the lab, the samples are checked for proper identification, given accession numbers, and either processed immediately or (if received after 4 pm) put in the refrigerator (at 4°C) for processing the next morning. All samples were processed within 24 hours, including those drawn at external sites and shipped at ambient temperature overnight in insulated shipping containers. All processing was performed by technologists who received the same training, and the laboratory SOP #0108 was followed. Technologists also undergo annual competency training. Samples were centrifuged for 10 min at 2, 500 rpm in a refrigerated centrifuge at 4°C, then the serum was aliquoted into polypropylene freezer vials at 1.1 mL per vial and immediately placed in a -80°C freezer. All samples were stored in a monitored freezer until testing, freezer temperatures did not fluctuate above -55°C (during brief periods of high use). Samples were thawed before testing and repeated testing was performed on separate aliquots to eliminate variability from freeze-thaw cycles. The laboratory is certified under the Pennsylvania Department of Health, College of American Pathologists (CAP) and Clinical Laboratory Improvement Amendments (CLIA for Histocompatibility and General Immunology). The laboratory is registered with the FDA, and maintains a facilities master file (BB-MF-12244). The exploratory Luminex assay reported here is not used for clinical decision making, and is not a CLIA-certified assay.

Luminex assay and controls
The Luminex kits were obtained from the same manufacturer, which changed ownership during the period of the study (BioSource, Invitrogen, Life Technologies). Assays were performed only on serum samples that had been stored at -80°C. Serum samples were thawed in a refrigerator overnight (healthy donor controls, < 12 hours total time) or at room temperature the day of the assay (patient samples), clarified in a microfuge for 10 min at 1, 000 g, then diluted with the assay diluent provided per assay manufacturer's instructions. Healthy donor and control samples were run in duplicate, but large numbers of patient sera were run in singlets. The same trained technologist performed all of the assays reported herein, according to the same laboratory SOP #0037). The software used for all assays was the BioPlex System BioPlex Manager 4.0, which uses 5-parameter logistic regression. Each sample acquired ≥ 100 bead events, per manufacturers' instructions. Analytical sensitivity was calculated based on two standard deviations from the background MFI of the standard curve. There were no changes in the antibodies used for the analytes of interest reported here, and the standards were benchmarked in the same way over the time period tested here. R&D QC controls (R&D Systems QC02) are reconstituted with assay diluent from the Hu Extracellular buffer kit LHB0001 (BioSource). Each lot provides expected values for several commonly tested cytokines (as measured by R&D Systems ELISA assays). Additional kit details are presented in Additional File 1, Table S1.
To address potential inter-analysis variability, 770 data points from 2005 and 430 data points from 2010 were re-analyzed at the same time (2011) with version 6.0 software, on the original machine. There were 0/1, 200 changes in the resulting absolute values obtained.

Biostatistical Methods
Analyte concentrations were compared at two time points with a one-sample Wilcoxon rank-sum test on the ratio of the two concentrations. Correlation was assessed with Spearman's test. All p-values are twosided. Assay results below the lower limit of detection or above the upper limit of quantitation were not used in the analysis.

Results and Discussion
During the analysis of a retrospective biomarker study conducted with a set of banked sera from melanoma patients [23], we discovered a potential correlation between the levels of analytes measured by Luminex and the time that the sera were stored at -80°C. Therefore, we examined several aspects of serum storage and the Luminex assay.

Repeat testing in 2010 of sera stored in 2005
Our first sample set consisted of 23 melanoma patient sera (the "old patients") who had a blood sample drawn in 2005, and had a Luminex assay performed on serum samples, on either 10/31/2005, 11/01/2005 or 2/17/2006; we refer to these as the "early" assays. To determine any changes over storage time, we thawed aliquots (not previously thawed) and tested a subset of the analytes originally tested, again by Luminex (Table 1). Unexpectedly, we identified a number of apparent changes in analyte levels. We repeated these measurements up to three times (depending on the number of previously untouched aliquots remaining) for these 23 samples: (2/02/10, 5/13/ 2010 and 8/11/2010)-the "late" assays. Seven of the 10 analytes we examined had highly significant changes during the approximately 5 years of storage at -80°C.
There were different patterns seen for different groups of analytes, some of which were relatively stable over time (IL-4, change over time: p = 0.28) while others were found to change (IL-10, p = 0.093; GM-CSF, p = 0.11). Levels of some of the analytes decreased over the storage time (IL-6, p = 0.00021; decreasing in 21/23 samples; TNFα, p = 0.0078, decreasing in 20/23). Surprisingly, the IL-8 levels were significantly increased from the initial test to the subsequent tests 5 years later (IL-8, p = 0.000030, approximately 5-fold increased in 23/23 patient samples). MCP-1 levels also increased in a majority of samples (MCP-1, p = 0.00012) (Table 1/ Figure 1). Each p-value was computed with a one-sample Wilcoxon test on the ratio of the 5/13/2010 assay result (for which we had the most data) to the result of the early assay.

Healthy donor and melanoma patient serum time course in 2010
To determine whether we could detect similar changes over a period of months, we drew blood from 10 healthy   Table S3, Table 3 data). HD samples were tested initially 2 months after processing and freezing, and then twice more, at 5 and 8 months of storage on the same dates as the old patient sample described above.
The melanoma patient samples were tested 2 days after processing and cryopreservation, and again 3 months later.
As expected, HD samples had low circulating levels of many analytes tested. These HD control samples also showed changes in analyte levels, even after short-term storage. Again, some analytes were stable, others were much less stable.  Assay dates and patient codes pg/ml analyte  times were plotted together (Figure 3), the trends in concentration changes observed were not significantly different between the serum sample data sets (old patients, HD, new patients) ( Table 1, Table 2, Table 3).

Cytokine Controls used in assays
We purchased our Luminex kits from a single source, however, that source changed ownership between Oct. '05 and Aug. '10 (from Biosource to Invitrogen to Life  Technologies). Each kit includes reagents to generate an 8-point standard curve from which all values are determined. For the custom kits we requested, to test a specific array of analytes of interest, the manufacturer pretests the specific antibodies together, to confirm lack of cross-reactivity. The manufacturer indicates that the kits are not released unless the following criteria are met: " < 10% cross-reactivity to related recombinant protein at the highest point of the standard curve" (Life Technologies). We requested the specific cross-reactivity testing data performed for the kits we used in this study, but were repeatedly informed that company policy prohibits QC data release to customers.
As an additional control, we included "Multiplex QC" controls, which are complex mixtures of recombinant cytokines, chemokines and growth factors prepared by the manufacturer at 3 concentrations (low, medium and high). We have established the reproducibility of this control (Additional File 4, Table S4) when tested via Luminex (% CV = 1%-52%, average % CV = 14% for 8 analytes). While the absolute values for each analyte do not exactly match the "expected" value from the QC control manufacturer (R&D Systems), they are similar, and we use a different platform and different antibody clones for detection via Luminex, which may account for those differences (as indicated in the package insert).
We also received WHO cytokine standards for IL-4, IL-8, IL-10 and GM-CSF. These lyophilized cytokine controls were resuspended (Materials and Methods) and individually tested at 1:10, 1:50 and 1:100 dilutions in two replicate Luminex assays for the same ten analytes described above. These data are presented in Table 4. As expected, the standard under study was almost always detected. However, there were some surprising results. MCP-1 was also almost always detected in addition to the standard, and MIG was always detected when the standard IL-10 was used. The apparent concentrations of these two analytes in some instances exceeded 10% of that of the standard. IL-6, IFN-γ and GM-CSF also showed evidence of minor cross-reactivity.
The apparent cross-reactivity seen for MCP-1 and MIG might be caused by a medium additive present in the AIM V medium (a serum-free lymphocyte culture medium) used in a dilution step for these proteins. We tested several commonly used culture medias (AIM V, RPM1640, Iscoves and CellGenix DC media) in a 30plex Luminex assay which also included a repeat test of the WHO standards. The results did identify low levels (3-62 pg/mL) of several analytes in the culture medias (HGF, FGF basic, RANTES, IL-17 and IL2R) but not MCP-1 or MIG (data not shown). The MCP-1 was again detected in the IL-8 and GM-CSF WHO standards and MIG in the IL-10 standard (as well as HGF, FGF basic and RANTES). We are investigating other possible sources of low levels of other cytokines and growth factors in the WHO standards.
As a test of the day-to-day reproducibility of two of the cytokines of particular interest, IL-6 and IL-8, a set of samples and controls were run in two different custom kits one day apart (with samples kept thawed, at 4°C overnight), in which both IL-6 and IL-8 were included in both kits. Notably, these two kits also had different standard curves and upper limits of detection. For IL-6, the 10-plex kit upper limit was 7, 400 pg/mL, while in the 8-plex, it was 13, 800 pg/mL (1.8 fold higher). For IL-8, the 10-plex upper limit was 24, 800 pg/mL and in the 8-plex, 10, 160 pg/mL (2.4 fold lower). When the values for the 38 samples were compared between the two kits, the ratio of the IL-6 values was 1.0 (median & mean), showing excellent concordance. For IL-8, where the upper limits were more disparate, the ratio of the values was 0.80, which was a small but significant difference ( Figures 4A and 4B). These data indicate that the assay with the higher upper limit has larger measured values.

Upper limit problem
The Luminex kits that we used at the different time points were not identical. In particular, we noticed that the upper limits of quantitation for individual analytes changed over time for the different kits. In principal, this should not affect the measured concentrations, because the kits include kit-specific standards to generate 8-point standard curves matched to the expected  Figure 5 is a scatter plot of the late-to-early ratio of analyte concentrations versus the late-to-early ratio of assay upper limits assays with a smooth curve is superimposed. The late-to-early ratio of upper limits was different for each of the 10 analytes. Typically, 12 samples were assessed for each analyte. The correlation of the two ratios is highly significant (p < 10 -15 , Spearman's test). Therefore, we are concerned that assays performed at different times with different kits may not be comparable.
In this report, we detail reproducibility problems we encountered testing circulating cytokines, chemokines and growth factors by Luminex in serum samples which were stored over months to years under highly controlled conditions. Some of these changes were very dramatic: IL-8 increased 4-6 fold in old patient samples; MCP-1 decreased 4-6 fold in new patient samples, and up to 10-fold in healthy donor samples; IL-10 changed from negative to positive or positive to negative within the same old patient serum dataset (Figure 1). Our initial hypothesis was that the changes were entirely biological, and that despite standardized blood handling procedures and temperature-controlled freezer storage, some analytes became unstable over time or upon thaw. Two recent reports testing cytokine stability found most tested cytokines to be stable over 1-2 years at -80°C, and a subset (including IL-8 and IL-10) became unstable after 2-4 years [24,25]. Many of the proteins became unstable after repeated freeze-thaw cycles. If these were the only mechanisms, then the analytes we tested should have behaved consistently between our three datasets, because the change would be analyte-specific. This is not the only explanation, because, for example, MCP-1 increased over time in the majority of old patient samples and decreased over time in both HD and new patient sets. Our study has a number of limitations. The more recently acquired HD and new patient data sets were tested within months of blood draw. A better analysis of the impact of storage time on analyte stability would require a large number of patients and HD samples stored for longer periods with costly repeated multiplex testing. We also limited the diversity of analytes we examined. Another variable was the time from blood draw to serum separation and freezing. Some of our samples were drawn within the laboratory and at our nearby clinic and processed within a few hours, while other old patient samples were shipped overnight and processed the following morning. However, the nature of these blood handling procedures reflects the unavoidable limitations inherent in transferring patient blood from the clinic to a central laboratory capable of standardized processing, as well as for multi-institutional trials where large numbers of patients can be treated and tested, but overnight shipping is required. Lastly, some of our healthy donor and control samples were run in duplicate, but to reduce costs, large numbers of patient sera were run in singlets. Due to the small average % CVs determined for many duplicates (Additional File 1, Table S1) this may have minimal impact on the trends we observed.  The Luminex assay has been shown (by ourselves [26] and others [27]) to show good correspondence to ELISA platform assays. In addition, the Luminex assay has good reproducibility from well-to-well, and from day-today ( Figure 4). Also, our use of the R&D QC controls (Additional File 4, Table S4) indicate good reproducibility of recombinant analytes when mixed together. This may indicate that the serum matrix may impact reproducibility, and/or the biological impact of a tumor may lead to systemic changes (including altered glycosylation) which impact the assay.
This study also suggests that the changes in the upper limits of detection, which can vary substantially from kit to kit, month to month, and analyte to analyte from a single manufacturer, may impact the ability to determine analyte concentration. This impacts kit-to-kit reproducibility, and greatly increases the importance of comparing samples with the identical lot of kits with identical standard curve ranges. We attempted to dissect this further by requesting access to manufacturer QC data, but we were repeatedly denied access to any additional information specific to the testing performed on the kits we used.
We do not understand why the assay kit upper limits seem to affect assay performance in the systematic way that is evident in Figure 5. However, we have to conclude that the results of assays done with different kits cannot be directly compared. Therefore, the apparent changes in analyte levels over time that we observe may arise from the kit-to-kit variability: we cannot claim to observe changes in analyte levels over storage time at -80°C.