Both microarray and massively-parallel sequencing technologies are offering insights never previously possible into chromatin biology, cytosine methylation and transcriptional regulation. This new field, loosely referred to as "epigenomics", is being studied not only in terms of the means by which these processes influence normal cellular physiology, but also how they become dysregulated in human disease. Of the challenges faced in this field, by far the most significant is the computational analysis of the massive datasets generated when performing epigenome-wide assays. We describe how we have developed systematic approaches for computational analysis of cytosine methylation data, and how we have exploited these resources to gain insights into the normal physiology of the epigenome and its dysregulation in disease.
Weighted gene co-expression network analysis (WGCNA) facilitates a systems biologic view of gene expression data. The network framework makes it straightforward to integrate gene expression data with other types of data, e.g. clinical traits and genetic marker data. This talk covers several theoretical topics including network construction, module definition, network based gene screening, and differential network analysis. The methods are illustrated using several applications including i) screening for cancer genes, ii) comparing human and chimp brains, and iii) complex disease gene mapping. Related articles and material can be found at the following webpage http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
Citations:
The tumor microenvironment (TME) is a significant contributor to the progression of cancer. TME in breast cancer consists of a multitude of cell types such as endothelial cells, fibroblasts, and immune cells. To design a realistic computational model for cancer, we need to model the both the intracellular reactions and cell-cell interactions in the TME. Our goal is to develop a "geographical information system" of the breast TME by integrating spatial information of cells (from microscopic imaging) with molecular information (e.g., gene expression and ChIP-seq) using a systems biology approach. While we will discuss a wide spectrum of algorithms involved in this work, we will focus on the microscopic image segmentation problem using two-point correlation function and two hybrid linear model fitting algorithms.
Mathematical formalisms in use to describe molecular biocircuits range from deterministic Boolean to discrete stochastic models. Between these extremes lie models based on deterministic ordinary differential equation or based on continuous stochastic differential equations. This talk focuses on the master equation approach to the discrete stochastic molecular biocircuits. The master equation, with time-dependent transition probabilities per unit of time, captures the signal flow through the molecular biocircuit.
The theoretical approach will be used to analyze the experimental data collected at single-cell level for the activity of the heat shock protein 70 promoter in Chinese hamster ovary cells.
Citations:
High throughput biological assays such as mass spec proteomics and gene expression arrays offer the potential to revolutionize our understanding of biological pathways. However, describing the patterns of expression and understanding what they tell us are daunting challenges. We describe a statistical model which may be fit to data that is characterized by high dimensionality and few observations. We show how this model may be used to discover associations between gene expression data and other types of high throughput assays, and thereby generate testable biological hypotheses.
In this work, we consider the problem of assessing differential expression of entire gene sets in complex biological experiments. We propose a latent variable model that directly incorporates the underlying biological network structure. Subsequently, using the theory of mixed linear models we develop the necessary inference framework for addressing the task at hand. Several test procedures are examined and a network based method for testing changes in expression levels of gene sets, as well as the structure of the network is presented. The performance of the proposed methodology is assessed through a simulation study and applied to a number of real data sets.
We consider the problem of conducting regression or classifciation analysis with predictors whose relatinships are described a priori by a network. A class of motivating examples is to model a quantitative or categorical phenotype using gene expression profiles while accounting for coordinated functioning of genes in the form of biological pathways or networks. We introduce our new methods and compare them with some existing ones.
Since their first appearance just over a decade ago, microarrays have become the assays of choice for high-throughput genome-wide studies of gene expression. At the same time the use of microarrays has broadened to include studies of DNA polymorphism, DNA copy-number, DNA binding proteins, DNA (re) sequencing, and more. In the course of these developments, we have learned a lot about the many non-biological aspects of microarray data, and have devised methods which attempt to deal with them. Also, many novel statistical methods have been developed to address the challenges posed by the availability of large amounts of microarray data for answering biological questions.
Recent improvements in the efficiency, quality, and cost of genome-wide sequencing are prompting biologists to abandon microarrays in favor of next-generation sequencers, e.g., Applied Biosystems' SOLiD, Helicos BioSciences' HeliScope, Illumina's Solexa, and Roche's 454 Life Sciences sequencing systems, and more. These high-throughput sequencing technologies have already been applied to studying genome-wide transcription levels (mRNA-Seq), transcription factor binding sites (ChIP-Seq), chromatin structure, DNA copy number, and DNA methylation status.
While we might hope that these new sequencing-based studies have overcome many of the limitations of microarray-based studies, realistically we should expect that these new technologies raise problems of their own similar to the ones we met with microarrays. If so, there will be a need for statisticians and others to understand and deal with non-biological features of the data, and to modify existing or develop novel statistical methods to get the best out of these data, when helping biologists address the questions of interest to them. This talk, which draws heavily on recent, unpublished work of Sandrine Dudoit and her students, reports on early findings, work in progress, and promising directions.
It is of great interest to identify genes that play a crucial role in the promotion stage of tumor forming. However, the differential signals at this stage tend to be much weaker compared to those obtained in the comparisons between tumor and regular tissues. One strategy in the study of diet prevention effects in tumorgenesis is to collect multivariate information, for example, microRNA and various types of mRNA measurements, from the same animals at different experimental setup. This practice allows researchers to borrow strength from the related variables to detect the weak but practically important diet differences at the early stage of the tumorgenesis. I will present some challenges we encountered during the study and methods we developed.
Genome-wide tilingarray study requires millions of simultaneous comparisons of binding signals for significance. Controlling statistical false positives in tiling array studies is very important, because the number of identified binding regions can easily go beyond the capability of experimental verification. Using ChIP-chip transcription factor binding data as an example, we introduce a novel and efficient method for accurate evaluation of statistical significance of peaks. We further introduce a modified FDR control method that is more appropriate for tilingarrays. Using a moving window approach, we further demonstrate how to combine results from various window sizes to increase the detection power while maintaining a specified type I error rate or FDR. Our approach is general and can potentially be accommodated in many large genomic and genetic studies.
In this talk, we will report our recent effort in utilizing the rapidly accumulating body of genomics data, especially the enormous amount of public microarray data, together with the associated phenotypic and environmental context information to reconstruct the biological basis of phenotypes. Traditional association studies have been relatively successful at relating genetic polymorphisms to phenotypes. However, they have met difficulties in elucidating the gene-gene interactions that contribute to complex phenotypes. Here, we develop novel methods aimed at deriving genome-wide molecular networks of genotype-phenotype associations. Furthermore, we develop methods to perform phenotype prediction and computational diagnosis utilizing public genomics databases, particularly the large public microarray repositories, to create an automated disease diagnosis database.
Common human diseases are driven by multiple coherent networks interacting within and between tissues, not by simple changes in single genes. There are transcriptional, protein-protein interaction, phosphorylation, and metabolite networks, to name just a few, in biological systems. In addition, many different genetic and environmental factors can affect these networks and in turn lead to phenotypic change at the organism level. Multiple types of high throughput data are available for constructing networks, including RNA microarray data, chip-chip data, protein array data, siRNA screening data, and DNA variation data, among several other types. These different types of networks interact with each other both within and between multiple tissues that in turn contribute to disease risk and progression. We demonstrate here, from yeast, mouse and human systems, how to integrate different types of genomic data to derive networks that elucidate complex disease traits like obesity. We demonstrate how these networks aid in the identification of new drug targets and biomarkers for common human diseases like obesity, diabetes, and heart disease.