Calculations of the operating characteristics of a biomarker for disease are subject to verification bias if the disease status is only verified for individuals with biomarkers within a specified-range, such as values greater than what is considered the "upper limit of normal". Such types of data predominate in prospective studies that employ a biomarker to screen, such as in the Prostate Cancer Prevention Trial (PCPT), necessitating statistical methods to accommodate potential biomarker-based verification bias for utilizing samples from these studies.
The PCPT randomized 18,882 men aged 55 or older with a normal digital rectal examination (DRE) and prostate-specific antigen (PSA) level less than or equal to 3 ng per milliliter (ng/mL) to either finasteride or placebo for seven years. A PSA and DRE were performed annually. Whenever PSA exceeded 4 ng/mL or the DRE was positive indicating suspicion of cancer, the participant was referred to biopsy. At the end of seven years all individuals not previously diagnosed with cancer were requested to have an end-of-study biopsy. The aim of our correlative study was to derive the operating characteristics of PSA for biopsy-detectable prostate cancer using the seven year screening histories and outcomes from the PCPT placebo arm. We walk through this case study, illustrating a Markov Chain Monte Carlo algorithm to adjust for verification bias, and ending with our conclusions concerning the operating characteristics of PSA and open questions for the design of future prospective screening studies.
The difficult issues for the statistician designing and analyzing proteomic studies are similiar to the issues with genomic studies. I will discuss the following:
Development of HIV resistance mutations is a major cause for failure of antiretroviral treatment. This article proposes a method for jointly modeling the processes of viral genetic changes and treatment failure. Because the viral genome is measured with uncertainty, a hidden markov model is used to fit the viral genetic process. The uncertain viral genotype is included as a time-dependent covariate in a Cox model for failure time, and an EM algorithm is used to estimate the model parameters. This model allows simultaneous evaluation of the sequencing uncertainty and the effect of resistance mutation on the risk of virological failure. The method is then applied to data collected in three phase II clinical trials testing antiretroviral treatments containing the drug efavirenz. Various model checking tests are provided to assess the appropriateness of the model.
Many patients treated with combination antiretroviral therapy fail to achieve complete viral suppression. Optimizing individual treatment strategies requires an understanding of the complex relationship between replication of drug-resistant virus and the host response. In particular, the distinction between persistent drug activity, alterations in replicative capacity ("fitness") and the ability of a newly emergent variant to cause disease ("virulence") may prove to be important in designing long-term therapeutic strategies. These issues will likely become even more relevant with entry inhibitors, where drug-pressure may select for X4 variants that may be less fit but more virulent. To address these issue we have performed a series of studies focusing on the determinants of disease outcome in patients with drug-resistant viremia, and have observed the following: (1) HIV is often constrained in its ability to develop high-level drug resistance while maintaining replicative capacity, (2) immune activation is reduced in patients with drug-resistant HIV (after controlling for the level of viremia) and (3) patients who durably control HIV replication despite the presence of drug-resistance exhibit immunologic characteristics comparable to that observed in long-term non-progressors (e.g, low levels of T cell proliferation and activation and preserved HIV-specific IL-2 and gamma-interferon-high producing CD4+ T cells). We have initiated a number of interventional studies based on the hypothesis that drug-mediated alterations in HIV fitness/virulence may be clinically useful in patients with limited therapeutic options.
Supported by NIAID (AI052745,AI055273), the UCSF/Gladstone CFAR (P30 MH59037), the California AIDS Research Center (CC99-SF, ID01-SF-049) and the SFGH GCRC (5-MO1-RR00083-37).
The development of solid tumors is associated with acquisition of complex genetic alterations, indicating that failures in the mechanisms that maintain the integrity of the genome contribute to tumor evolution. Thus, one expects that the particular types of genomic derangement seen in tumors reflect underlying failures in maintenance of genetic stability, as well as selection for changes that provide growth advantage. In order to investigate genomic alterations we are using microarray-based comparative genomic hybridization (array CGH). The computational task is to map and characterize the number and types of copy number alterations present in the tumors, and so define copy number phenotypes as well as to associate them with known biological markers and with gene expression data. We discuss general analytical and visualization approaches applicable to the array CGH data. We also use unsupervised Hidden Markov Models approach to utilize the spatial coherence between nearby clones. The clones are partitioned into the states which represent underlying copy number of the group of clones. The output of the algorithm is given as an input to higher-level analyses such as testing and classification. We will also discuss some preliminary results on joint analysis of the copy number and gene expression data. The methods are demonstrated on simulated data as well as cell line and clinical tumor datasets.
With the advent of new high-throughput molecular technologies, consideration of high-dimensional data is becoming more common. A major role for statisticians to play in the future of this area of bioinformatics is combining genomic data from different sources. In this talk, we will discuss two examples of such analyses. The first is combining gene expression datasets from multiple cancer studies. The second is using gene expression data to infer chromosomal alterations.
In many vaccine studies, confirmatory diagnosis of a suspected case is made by doing a culture to confirm that the infectious agent of interest is present. However, often such cultures are too expensive or difficult to collect, so that an operational case definition, such as ``any respiratory illness'', is used. This leads to many misclassified cases and serious attenuation of efficacy and effectiveness estimates. A validation sample can be used to improve the attenuated estimates. We propose a new method of analysis for validation sets with time-to-event in vaccine studies when the baseline hazards of both the illness of interest and similar, nonspecific illnesses are changing. We analyze data from an influenza vaccine field study with these methods.
Global proteomics measurements are rapidly being developed to identify biomarkers for drug development applications. A major challenge with this strategy is the analysis of the raw data generated by high throughput HPLC-MS/MS experiments of protein digests from complex biological samples. This presentation will focus on a computational pipeline to automatically process HPLC-MS/MS data including: estimation of peptide charge and mass, noise filtering of MS/MS spectra, and peptide identification. Following this pre-processing of individual study samples we describe methods for chromatographic alignment and label-free relative quantification using integrated ion current of peptides from all samples in a biomarker study. Results from a rat serum variability study will be used to demonstrate how the method can be applied to biomarker discovery.
Biomarkers can be used for several purposes, for example as surrogate markers of treatment effect or as inputs to a diagnostic algorithm. This talk will describe applications of causal modeling and inference for both settings, and highlight the role of potential outcomes for understanding properties of a biomarker.
First, we illustrate the use of instrumental variables and associated sensitivity analysis for estimating causal treatment effects of HAART from observational cohort studies. Our focus will be on transparent representation of underlying assumptions, and on the role of coherent sensitivity analyses to understand the effects of departures from those assumptions.
Second, we will describe the role of potential outcomes for assessing diagnostic utility of a continuous biomarker. An important measure of diagnostic utility is area under the ROC curve. The area represents P(X>Y), where X and Y are, respectively, randomly-drawn marker values from the 'case' and 'non-case' populations. In some observational studies, the 'case' and 'non-case' populations may be systematically different, and bias can be introduced by confounders. We propose a new definition for area under the ROC curve that is written in terms of potential outcomes, and appeals to a causal interpretation of diagnostic utility. Standard methods for causal inference can be used to estimate the area under the curve; the ideas are illustrated by examining the diagnostic utility of viral load and CD4 as markers for HIV-related mortality, using inverse probability weighting to adjust for potential confounders. We also make qualitative and quantitative comparisons to standard methods.
Molecular data are widely used to screen for biomarkers that have prognostic significance for clinical outcomes, e.g. gene expression data or immuno-histochemical staining data may be used to screen for biomarkers that could predict post-operative survival time. A challenge is that such candidate biomarkers can sometimes not be validated in independent data sets. Here we will describe 2 different approaches that we have found to be useful for identifying biomarkers that have an increased chance of being validated.
The first approach is based on weighted gene co-expression network analysis. A clustering method is used to identify prognostic gene modules, i.e. sets of tightly co-expressed genes. Using brain cancer microarray data, we will show that highly connected prognostic `hub' genes in these modules have a substantially increased likelihood of being validated. The second approach seems to be quite different: first, it uses random forest clustering to identify high risk patient clusters. Second, a biomarker based threshold rule is derived for predicting cluster membership. Using prostate cancer data, we will provide empirical evidence that these rules can be validated while traditional approaches may lead to candidate biomarkers that cannot be validated.
There seems to be a mathematical and biological connection between these 2 approaches. Both rely on a clustering as an essential pre-processing step to identify "prognostic" clusters. The clusters correspond to global patterns that are more likely to be found in independent data sets as well. We provide empirical evidence that biomarker screening procedures that are based on prognostic clusters have an increased chance of validation success.
Acknowledgement: The gene co-expression network part was done in collaboration with Bin Zhang, Paul Mischel, and Stan Nelson. The random forest part was done in collaboration with Tao Shi, Siavash Kurdistani, and David Seligson.
Advancements in mass spectrometry (MS) instrumentation, liquid chromatography (LC) and maturing protein databases are leading many advances in the field of proteomics. Among the potential uses of this technology is the identification of predictive protein biological markers or biomarkers that can differentiate two or more groups of complex biological samples. Despite its proteome-wide potential few clinically relevant discoveries have come forth from these technologies when applied to complex protein mixtures, such as serum or tissue, characterized by a high complexity and dynamic range. Current approaches to profile proteins are dominated by the use of MALDI or LC-MS/MS mass spectrometry (MS/MS), and both approaches have difficulties in practice; MALDI can identify a large number of "peaks", but identification (sequence) of low abundant features can be difficult, and MS/MS lacks sensitivity and has poor reproducibility and low protein coverage due to its data-dependent sampling. It has been our hypothesis that greater efficiency of protein/peptide profiling could be obtained by more efficient use of high resolution LC-MS instrumentation where, like MALDI approaches, differential peptides are first identified from the list of potential precursor ions (LC-MS) and then those only those differential peptides are sequenced in subsequence LC-MS measurements. To evaluate this hypothesis, our group has developed a suite of software algorithms that produce a peptide array from a sequence of LC-MS measurements; the peptide array can be evaluated in much the same way as a transcript array with members identified by their accurate mass and time tags. Production of the peptide array requires substantial signal (image) processing, image alignment, and specialized normalization routines. We demonstrate that we can identify and compare hundreds or thousands of peptides and proteins across multiple replicates of biological samples. The algorithms will be demonstrated using data of increasingly complex biological samples; bacteria, yeast, and human serum.
Cancer progression often involves alterations in DNA sequence copy number. Multiple microarray platforms now facilitate high-resolution copy number assessment of entire genomes in single experiments. This technology is generally referred to as array comparative genomic hybridization (array CGH). In my talk, I will discuss issues that have arisen in the analysis of array CGH data. Topics will include pre-processing and normalization, identification of regions of abnormal copy number, and determination as to whether copy number abnormalities can be seen in gene expression data. Our method of identifying abnormal copy number, which we call circular binary segmentation (CBS), will be introduced.
This is joint work with E.S. Venkatraman.
Simple models of HIV infection and the effects of antiretroviral therapy have typically assumed that drugs have a constant efficacy. Here I will summarize some new models that incorporate ideas from pharmacokinetics and pharmacodynamics such that drug efficacy depends on drug concentration, which in turn depends on drug dose and time at which drug is taken. These models allow estimation of the relative efficacy of different drug combinations and also allow one to explicitly incorporate the effects of missed drug doses or intentional stopping of therapy for short periods of time. Effects of drug resistance can also be incorporated.
Developing molecular tests to predict prostate cancer progression requires first defining a meaningful endpoint. There is controversy regarding the use of PSA or biochemical failure following prostatectomy or radiation therapy for clinically localized prostate cancer as a marker of progression. As a consequence, advances in prostate cancer biomarker development may require using population-based cohorts or cases from clinical trials to identify meaningful associations. Whereas the discovery of novel candidate biomarkers was slow 5-10 years ago and often resulted from serendipity, advances in high-throughput technologies have lead to the identification of a large number of candidate genes. Strategies to identify candidate genes include the use of novel software for genomic analysis. This presentation will provide an approach to validation of these candidate genes using tissue microarrays and other high throughput technologies. Since a critical factor in the evaluation of tissue markers is reproducibility, approaches to quantitative protein expression will be presented. The approaches presented here should be applicable to other tumor types and disease processes.
Following infection, HIV-1 proteins are digested into short peptides that bind to major histocompatibility complex (MHC) molecules. Subsequently, these bound complexes are displayed by antigen presenting cells. T cells with receptors that recognize the complexes are activated, triggering an immune response. Peptides with this ability to induce T cell response are called T cell epitopes -- prediction thereof is important for vaccine development. Sung and Simon (JCB, 2004) start with compilations of peptide sequences that {bind/don't bind} to specific MHC molecules and, using biophysical properties of the constituent amino acids, develop a classifier. Properties are used because of the inability of select classifiers to effectively handle amino acid sequence itself. Tree-structured methods are not so limited (Segal et al., Biometrics, 2001). Here, we apply these methods, along with their ensemble extensions (bagging, boosting, random forests), and show they provide improved accuracy. Both additional properties (QSAR derived) and classifiers (SVMs, ANNs) are also investigated. HIV-1 genomewide comparisons with respect to predicted / conserved epitopes are also presented.
Our group utilizes a variety of proteomic approaches to biomarker discovery for the early detection of cancer. Discussed will be the current mass spectrometry based studies and their application to solid tissue cancers such as prostate and head and neck. In addition recent studies examining serum from patients infected with the Human T-cell leukemia Virus type 1 will be presented with emphasis on the utility of the expressed biomarkers in the discrimination of Adult T-cell leukemia, HAM/TSP and asymptomatic infected individuals.
Detecting ovarian cancer in asymptomatic women through regular screening tests is an appealing approach to reducing mortality from this disease due to the large survival difference between early and late stage disease, and the high proportion of cases detected at late stage (80%) under usual care. However, due to the low incidence of the disease, ovarian cancer screening is a delicate balance between detecting as many cancers as possible while limiting the number of false positive results per true positive. As the bar is lowered for declaring a test positive, the proportion of cancers detected usually increases; however, the number of false positives per cancer detected also increases. The definitive diagnosis of ovarian cancer requires invasive pelvic surgery. A method for screening requires at least one ovarian cancer to be found in ten screen related surgeries, and at least 70% of the ovarian cancers screen detected to be considered acceptable.
Prospective clinical screening trials with the blood test CA125, followed by ultrasound for elevated CA125 above a fixed cutpoint, resulted in a positive predictive value (# cancers at surgery/# surgeries), or PPV, exceeding 20% and with 70% of ovarian cancers screen detected, demonstrating this screening method is acceptable. However only 40% of screen detected cancers were found in early stage. While this result doubled the percentage found in early stage under usual care, a greater increase was required before the impact on mortality would be substantial. A method was required for increasing the sensitivity while maintaining a sufficiently high PPV. Retrospective analysis of longitudinal CA125 values indicated that CA125 values rose exponentially above an individual's baseline level prior to diagnosis of ovarian cancer, while in most other women the CA125 fluctuated around a baseline level. Incorporating this differential CA125 behavior into the screening decision for referral to ultrasound would potentially allow greater sensitivity (rise above a baseline but prior to achieving a level exceeding the fixed cutpoint) while maintaining specificity (rule out subjects with elevated yet stable CA125 levels). Modeling the longitudinal CA125 values in cases with a hierarchical longitudinal change-point model, and the CA125 in other women with a hierarchical longitudinal model, provided the basis for assessing referral to ultrasound with the Bayes factor calculated for subjects with new CA125 values. This approach has been used in a prospective randomized ovarian cancer screening trial in the UK and will be discussed at the workshop.
Cancer biomarkers can be used in many different ways in cancer research. They can be used as surrogate endpoints or auxiliary variables to help assess new therapies. They can be used for risk stratification prior to deciding on therapy. A biomarker might suggest responsiveness to a particular biological agent, and would thus assist in individualizing therapy. Modern technologies, such as from genomics and proteomics, are producing high dimensional sets of biomarkers, which give rise to numerous complex statistical issues. A longitudinal series of a biomarker can be useful for early detection of disease or for monitoring disease progression after therapy. There is a general feeling that combinations of biomarkers, that measure different aspects of the underlying biology, may be more useful than any single biomarker. This raises the statistical challenge of how to combine biomarkers. When using combinations of biomarkers to detect disease it is frequently appropriate to assume that the probability of disease is a monotonic function of the each biomarker. By incorporating this monotonicity into the analysis it may be possible to improve its efficiency. We consider the situation of two ordered categorical variables and a binary response. The probability of response is assumed to be monotonic in each of the biomarkers. Two approaches are considered, one Bayesian in which the monotonicity is built into the prior distributions and a second in which isotonic regression in two dimensions is used. When using a biomarker as a surrogate endpoint in a clinical trial it is well known that one requires more than a strong association between the biomarker and the true endpoint, one also needs the biomarker to explain the effect of the treatment on the true endpoint. Various measures of the proportion of treatment effect explained by the surrogate have been proposed. An alternative approach is to view the biomarker as an auxiliary variable, and use it to predict the true endpoint, and then perform inference on the true endpoint. Thus the problem is converted into one of missing data, for which there are various approaches. We have developed an approach of multiple imputation, in which the true endpoint is imputed based on information in the auxiliary variable, the treatment group and possibly other prognostic factors. This approach generalizes to more complex situations such as multivariate biomarkers or longitudinally measured biomarkers. A more general approach is to formulate and estimate the joint distribution of the biomarker and the true endpoint, once this is achieved measures such as the proportion explained and predictive distributions of true endpoint values are a natural consequence of the model.
Plasma HIV RNA and T lymphocytes CD4+ count are major biomarkers used to decide when to start, change or stop a treatment as well as to evaluate treatment efficacy in HIV-infected patients. Thus, repeated measurements of those biomarkers are common in HIV studies. Those data may be analysed by using models for longitudinal data such as mixed models. However, the statistical analysis is complicated by several methodological difficulties. Three of them are of particular importance: (i) left-censoring of HIV RNA due to a lower quantification limit; (ii) correlation between CD4+ T lymphocytes and plasma HIV RNA; (iii) missing data due to informative dropout or disease progression. I will present a unified approach to deal with those issues by jointly modelling longitudinal measurement data and event history data. Likelihood inference can be used to estimate the parameters of such model. I will illustrate it by studying HIV markers response to antiretroviral treatment in randomised clinical trials and observational cohort studies. This approach might help in studying the change in markers, their prognostic value and their surrogacy.
To establish efficacy of cancer chemoprevention agents using cancer incidence as the endpoint requires very large sample sizes (thousands) and long follow-up. Surrogate endpoint biomarkers (SEBs) are biomarkers of (presumably critical) intermediate steps in the carcinogenic pathway that may permit smaller and more rapid studies. If the chemopreventive agent modulates the SEB in a manner consistent with blocking or reducing progression to carcinogenesis it may be possible to infer the reduction in cancer risk attributable to the agent. However, if there is not a perfect one-to-one correspondence between the SEB and cancer then the SEB induces misclassification of the cancer outcome. The extent of bias in the SEB as a surrogate for cancer is measured by its sensitivity and specificity. This paper will show that the relative risk (RR) observed using the SEB as a surrogate for cancer can severely underestimate the true RR when specificity is less than perfect. Furthermore, if specificity in the group receiving the chemopreventive agent is less than that in the untreated group, the RR based on the SEB may even indicate that the agent increases cancer risk. The performance characteristics of SEBs as a function of sensitivity, specificity and cancer incidence will be explored, and criteria to determine if SEBs can realistically be used will be defined.
Our goal is to estimate the causal effect of mutations detected in the HIV strains infecting a patient on clinical virologic response to specific anti retroviral drugs and drug combinations. We consider the following data structure: 1) viral genotype, which we summarize as the presence or absence of each viral mutation considered by the Stanford HIV Database as likely to have some effect on virologic response to antiretroviral therapy; 2) drug regimen initiated following assessment of viral genotype (the regimen may involve changing some or all of the drugs in a patient's previous regimen); and, 3) change in plasma HIV RNA level (viral load) over baseline at twelve and twenty-four weeks after starting this regimen.
The effects of a set of mutations on virologic response are heavily confounded by past treatment. In addition, viral mutation profiles are often used by physicians to make treatment choices; we are interested in the direct causal effect of mutations on virologic outcome, not mediated by choice of other drugs in a patient's regimen. Finally, the need to consider multiple mutations and treatment history variables, as well as multi-way interactions between these variables, results in a high-dimensional modeling problem. This application thus requires data-adaptive estimation of the direct causal effect of a set of mutations on viral load under a particular drug, controlling for confounding and blocking the effect the mutations have on the assignment of other drugs. We developed such an algorithm based on a mix of the direct effect causal inference framework and the data adaptive regression deletion/substitution/addition (DSA) algorithm.
Although a single event endpoint such as time to virological failure is simple and easy to use in large AIDS clinical trials, the longitudinal biomarker data from closely monitoring of viral load and CD4+ T cell counts can provide more detailed information regarding pathogenesis of HIV infection and characteristics of antiretroviral regimens. I will present a mechanistic HIV-1 dynamic model that will incorporate the information of pharmacokinetics, drug adherence and drug susceptibility to predict viral load trajectory. A Bayesian approach is proposed to fit this model to clinical data from ACTG A5055, a study of two dosage regimens of indinavir (IDV) with ritonavir (RTV) in subjects failing their first PI treatment. HIV RNA testing was completed at days 0, 7, 14, 28, 56, 84, 112, 140 and 168. An intensive PK evaluation was performed on day 14 and multiple trough concentrations were subsequently collected. Pill counts were used to monitor adherence. IC50 for IDV and RTV was determined at baseline and at virologic failure. Viral dynamic model fitting residuals were used to assess the significance of covariate effects on long-term virologic response. As univariate predictors, none of the four PK parameters C_trough, C_12h, C_max and AUC_0-12h was significantly related to virologic response (p>0.05). By including drug susceptibility (IC50), or IC50 and adherence together, C_trough, C_12h, C_max and AUC_0-12h were each significantly correlated to long-term virologic response (p=0.0055,0.0002,0.0136,0.0002 with IC50 and adherence considered). IC50 and adherence alone were not related to the virologic response. Adherence did not provide any additional information to PK parameters (p=0.064), to drug susceptibility IC50 (p=0.086), and to their combination (p=0.22) in predicting virologic response. Simple regression approaches did not detect any significant PD relationships. Any single factor of PK, adherence and drug susceptibility cannot be detected to have significant contribution to long-term virologic response. But appropriate combination of these factors using viral dynamic modeling approach was shown to be significant to predict virologic response. Adherence measured by pill counts and multiple trough drug concentrations did not provide additional information for virologic response presumably due to the data quality and noise problems. HIV dynamic modeling is a powerful tool to establish a PD relationship and correlate other factors such as adherence and drug susceptibility to long-term virologic response, since it can appropriately capture the complicated nonlinear relationships and interactions among multiple covariates. Our findings may help clinicians better understand the roles of these clinical factors in antiviral activities and predict the virologic response of various antiretroviral regimens.
Background: The critical clinical question in prostate cancer research is to develop means of distinguishing aggressive from indolent prostate cancer. Expression array technology has lead to the development of discrete molecular signatures but the development of a robust signature to characterize aggressive prostate cancer has yet to be achieved. We describe a multi-stage approach to develop a model of prostate cancer progression.
Methods: A recent study from our group employed high-throughput immunoblotting using antibodies against 1383 distinct proteins or post-translational modifications in order to interrogate tissue extracts derived from benign prostate, clinically localized prostate cancer, and metastatic prostate cancer. An integrative analysis of this compendium of proteomic alterations and transcriptomic data derived from 8 prostate cancer profiling studies was used to select a smaller set of genes that demonstrated concordance between protein and transcript levels. 41 of these genes could be evaluated on archival tissue samples. Using a prostate cancer progression tissue microarray, the protein products of these genes were tested using quantitative analysis of immunohistochemistry. The best model was validated using prostate cancer expression array data with associated clinical outcomes data.
Clinicians and researchers collect a tremendous amount of data on cancer patients in the hopes of finding significant prognostic factors. Medical studies commonly involve thousands of clinical, epidemiological, and genomic measurements collected on each patient, along with a time to the clinical event of interest, such as disease recurrence or death. At the end of the study, some patients may have dropped out, been lost to follow-up, or not had the particular event. In this situation, the last date of follow-up is recorded and referred to as the censored time to event. These studies are intended to model time to event by the measured variables for the purposes of predicting time to event for future patients and identifying which of the variables are integral in affecting this outcome. We present a generalization of classification and regression trees (CART) (Breiman, et al., 1984) in the presence of censoring. This approach is based on a strategy to generate possible predictors of time to event, choose the best predictor, and assess its performance.
As this strategy is not limited to CART, a new more aggressive algorithm for generating possible predictors is introduced. To illustrate this approach, both CART and the new algorithm have been applied to simulation studies as well as example data from Comparative Genomic Hybridization array analysis. The proposed approach is applicable to numerous settings, including univariate and multivariate prediction and density estimation. Thus, this method provides a powerful predictive tool for linking complex data sets with censored (or non-censored) outcomes.
In this work, we demonstrate how to analyze MALDI-TOF mass spectrometry data using the wavelet-based functional mixed model approach of Morris and Carroll (2004), which is a generalization of the linear mixed model to functional data. This approach models each spectrum as a function, and is very general, accommodating a wide class of experimental designs and allowing one to identify protein peaks related to various outcomes of interest, including dichotomous outcomes, categorical outcomes, continuous outcomes, and any interactions among factors. These factors can be conditions of interest (e.g. cancer/normal) or experimental factors for which we wish to account (blocking factors). Random effects make it possible to model correlation between spectra from the same individual or block. The MCMC output can be used to perform peak detection, find which peaks are related to factors of interest while controlling the false discovery rate, and to classify future samples based on their proteomic spectra without having to search high dimensional spaces. These analyses are all done while automatically adjusting for nonlinear block effects that are characteristic of these data. We apply this method to two MALDI-TOF data sets from experiments run at MD Anderson, one a clinical study whose goal is diagnosis of pancreatic cancer from blood serum, and the other an animal study studying the serum proteome of mice injected with one of two cell lines in one of two organs. This methodology appears promising for the analysis of mass spectrometry data.
GLNE 001 is a prospective study conducted by a Clinical Epidemiology Center of the Early Detection and Research Network that collects serum samples from patients presenting at colonoscopy clinics at several sites. We are using the samples to assess the uility of SELDI-TOF to classify patients who are normal from those with adenocarcinoma. One hundred unblinded samples are used as training set, and 155 blinded samples are used for validation. Issues in the analysis include the identification of peaks, and the construction of a useful classifier where there are a multiplicity of candidate markers. We will discuss the use of wavelets for de-trending and de-noising the spectra, issues in peak identification and alignment, and a comparison of several machine learning algorithms for constructing a classifier. SELDI-TOF is found to have limited capability to classify sera from normal patients versus those with adenocarcinomas.
We describe an approach to the survival analysis of longitudinally collected genomic data. We construct a measure of association between the survival endpoint and gene expressions collected over time and find significance levels using permutations. This nonparametric approach does not depend on any untestable assumptions about the unknown distributions of gene expressions. The issue of high dimensionality and dependence present in the genomic data is addressed through a multiple testing procedure. We also address missing data problem which occurs as a result of using permutations on possibly censored, longitudinal data. Our proposed method is illustrated on a dataset from a multi-centered research study of inflammation and the host response to traumatic injury.
Keywords: gene microarrays, survival analysis, longitudinal data, permutation tests, false discovery rate.
Background: An increasing number of studies have profiled tumor specimens using distinct microarray platforms and analysis techniques. With the accumulating amount of microarray data, one of the most intriguing yet challenging tasks is to develop robust statistical models to integrate the findings.
Results: By applying a two-stage Bayesian mixture modeling strategy, we were able to assimilate and analyze four independent microarray studies to derive an inter-study validated ``meta-signature'' associated with breast cancer prognosis. Combining multiple studies ($n= 305$ samples) on a common probability scale, we developed a 90-gene meta-signature, which strongly associated with survival in breast cancer patients. Given the set of independent studies using different microarray platforms which included spotted cDNAs, Affymetrix GeneChip, and inkjet oligonucleotides, the individually identified classifiers yielded gene sets predictive of survival in each study cohort. The study-specific gene signatures, however, had minimal overlap with each other, and performed poorly in pairwise cross-validation. The meta-signature, on the other hand, accommodated such heterogeneity and achieved comparable or better prognostic performance when compared with the individual signatures. Further by comparing to a global standardization method, the mixture model based data transformation demonstrated superior properties for data integration and provided solid basis for building classifiers at the second stage. Functional annotation revealed that genes involved in cell cycle and signal transduction activities were over-represented in the meta-signature.
Conclusion: The mixture modeling approach unifies disparate gene expression data on a common probability scale allowing for robust, inter-study validated prognostic signatures to be obtained. With the emerging utility of microarrays for cancer prognosis, it will be important to establish paradigms to meta-analyze disparate gene expression data for prognostic signatures of potential clinical use.
Matrix-assisted laser desorption-ionization, time-of-flight (MALDI-TOF) mass spectrometry (MS) is a leading technology in proteomics. This technology allows direct measurement of "expression signature" of tissue, serum, plasma, or other biological specimens. It has tremendous potential for disease screening, diagnosis and treatment. The processing goal of MS data is to effectively and correctly obtain the true information from the raw MS data for further statistical analysis. Two general approaches have been studied recently: functional data analysis approach (Morris and Carroll 2004, Billheimer 2004) and the feature extraction approach (Coombes et al 2004, Chen, Hong and Shyr 2004). To provide a final peak list for future statistical analysis, the whole processing procedure by feature extraction approach usually takes the following steps: de-noising (smoothing), baseline correction, normalization, peak detection and alignment. In this talk, we will introduce some recent progress on MS data processing using mathematical tools and statistical methods. Some experimental results will be shown using the data processing software packages developed by High Dimensional Data Core in Vanderbilt-Ingram Cancer Center.
Carcinogens derived from cigarette smoke can bind to DNA to form DNA adducts, and this process is believed to initiate smoking-induced lung cancer. The goal of this work is to incorporate knowledge of this process to improve cancer risk estimates. We use data from a large case-control study of lung cancer conducted at Massachusetts General Hospital for our models, which also incorporate data on several DNA repair genes. We face several difficulties including (a) adducts were only measured on a very small subset of the dataset; (b) for some individuals, the number of adducts was below the limit of detection; and (c) DNA adducts in lung tissue can be measured in lung cancer cases but never in controls. DNA adducts were also measured in blood mononuclear cells for a small number cases and controls, and we consider blood adducts to be measured with error relative to lung adducts. By introducing a latent variable for true lung DNA adducts, we allow for measurement error in both types of observed adduct measurements, but assume greater measurement error in blood adducts. We compare the performance of models that incorporate DNA adducts versus those that do not, in predicting the case status of individuals not used in fitting the models.