tutorial

tutorial

English
63 Pages
Read
Download
Downloading requires you to have access to the YouScribe library
Learn all about the services we offer

Description

ReverseengineeringmammaliantranscriptionalregulatorynetworksPavelSumazin AndrewDSmithContents1 Introductionandbackground 11.1 Backgroundontranscriptionalregulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Genome scaledataonregulation . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Overviewofthistutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Technicaldetailsoftheanalysismethods 42.1 Identifyingdifferentiallyregulatedandco regulatedgenesets . . . . . . . . . . . . . . . . . 42.1.1 Differentialexpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.2 Measuringsimilaritybetweenexpressionprofiles . . . . . . . . . . . . . . . . . . . 62.1.3 Clusteringgenesbyexpression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.4 Inferringdirectregulatoryinteractions . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Modelingregulatoryelements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Predictingtranscriptionfactorbindingsites . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4 Modelingmotifenrichmentinsequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Motifbasedonlikelihoodmodels . . . . . . . . . . . . . . . . . . . . . 172.4.2 Relativeenrichmentbetweentwosequencesets . . . . . . . . . . . . . . . . . . . . 192.5 Phylogeneticconservationofregulatoryelements . . . . . . . . . . . . . . . . . . . . . . . 202.5 ...

Subjects

Informations

Published by
Reads 51
Language English
Report a problem
Reverse
engineering mammalian transcriptional networks
Pavel Sumazin
Andrew D Smith
regulatory
Contents 1 Introduction and background 1 1.1 Background on transcriptional regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Genome-scale data on transcriptional regulation . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Overview of this tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Technical details of the analysis methods 4 2.1 Identifying differentially regulated and co-regulated gene sets . . . . . . . . . . . . . . . . . 4 2.1.1 Differential expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Measuring similarity between expression profiles . . . . . . . . . . . . . . . . . . . 6 2.1.3 Clustering genes by expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.4 Inferring direct regulatory interactions . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Modeling regulatory elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Predicting transcription factor binding sites . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Modeling motif enrichment in sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Motif enrichment based on likelihood models . . . . . . . . . . . . . . . . . . . . . 17 2.4.2 Relative enrichment between two sequence sets . . . . . . . . . . . . . . . . . . . . 19 2.5 Phylogenetic conservation of regulatory elements . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 Three strategies for identifying conserved binding sites . . . . . . . . . . . . . . . . 20 2.5.2 Considerations when using phylogenetic footprinting . . . . . . . . . . . . . . . . . 22 2.6 Motif discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.1 Word-based and enumerative methods . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.2 General statistical algorithms applied to motif discovery . . . . . . . . . . . . . . . 24 2.7 Cis-regulatory modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7.1 Modelingcis . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-regulatory modules 2.7.2 Identifying cis-regulatory modules . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.3 motif module discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Applications and examples 30 3.1 Analyzing sets of co-regulated genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 Example co-regulated genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.2 Obtaining promoter sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.3 Evaluating known motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.4 Predicting functional binding sites . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.5 Identifycis-regulatory modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Analysis of transcription factor localization data . . . . . . . . . . . . . . . . . . . . . . . . 38 1
3.3
3.2.1 Example localization data sets . . . . . . . . . . . . . . . . . . . . . 3.2.2 Obtaining sequences for analysis . . . . . . . . . . . . . . . . . . . . 3.2.3 Characterizing TF binding specificity . . . . . . . . . . . . . . . . . 3.2.4 Identifying co-factors . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Near-term vision for the systems biology of transcriptional regulation 3.3.2 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
39 39 41 46 47 48 49
Abstract
We give an overview of computational methods for reverse engineering mammalian transcriptional regu-latory circuits using evidence from sequence analysis. We break the analysis down to its components and study individual task case by case. We explain the motivation for individual tasks, and the concepts under-lying solutions and computational tools for these tasks. We also describe how individual results fit within the systems view that is the larger goal of reverse engineering transcriptional regulatory networks. The tutorial includes biological background and motivation, followed by a coverage of technical concepts, and a presentation of their application. The tutorial will be useful to researchers who are interested in apply-ing computational methods to understanding transcriptional regulatory sequences. It will also be useful to quantitative scientists looking to obtain a practical view of the field, including the current state of the art and short and long term visions.
Chapter 1
Introduction and background
This chapter includes biological background and motivating biological context for computational recon-struction of transcriptional regulatory circuits. It presents short and long term visions for the field, and a brief outline of the document.
1.1 Background on transcriptional regulation Gene expression is regulated according to differentiation or maintenance programs, and is influenced by signaling cascades originating in the extracellular environment. Transcription factors (TFs) are the agents that implement the regulatory program by recruiting and initiating the transcription apparatus. The human genome is thought to encode thousands of transcription factors (TFs) that regulate gene expression at the transcription level in a context-specific manner. They form quantitative interaction networks, composed of intricate circuits of transcriptional activation and repression, that ultimately determine the level to which each gene is transcribed. While there are several mechanisms by which transcriptional regulatory circuits are implemented, all are believed to include sequence-specific interactions between proteins and DNA. The molecules and mechanisms involved in transcriptional regulation are often referred to as “regulatory systems” because their behavior is mainly defined by a complex hierarchy of interactions between system layers and modular components. We concentrate on transcriptional regulation and omit post-transcriptional regulation, and technically we focus on regulation of initiation as opposed to other steps of transcriptional regulation. This abstract view is useful because it organizes the similarities between regulatory systems and illuminates their critical differences. Focused and tenacious research programs have reverse engineered particular regulatory systems and gained a detailed understanding of how these systems operate (Davidson, 2001). This work, based on decades of research, has provided general insight about regulatory systems, and some specific quantitative description of interactions between TFs and their targets. In the April 5, PNAS special issue on gene regulatory networks for development, Levine and Davidson introduced work to map gene regulatory networks that regulate early developmental processes in worm, fly, sea urchin, frog and cell specification in mammalian B cells. The regulatory networks described are based on verified protein-DNA interaction and present a unique opportunity for direct comparison (Levine and Davidson, 2005). These gene regulatory networks explicitly represent the causality of developmental processes. They explain exactly how genomic sequence encodes the regulation of expression of the sets of genes that progressively generate developmental patterns and execute the construction of multiple states of differentiation (Levine and Davidson, 2005).
1
1.2 Genome-scale data on transcriptional regulation The availability of genomic sequence data and maturation of high-throughput technologies enable individual experiments to rapidly produce genome-scale data sets that comprise thousands of individual observations relevant to transcriptional regulation. Gene expression, chromatin immunoprecipitation microarrays and parallel sequencing are among the most prominent of these technologies, and their combined commer-cial availability and decreasing cost will soon make them widely accessible. Computational methods are important for understanding transcriptional regulation primarily for two reasons. First, the complexity of interactions in transcriptional regulatory systems requires that complex statistical models be constructed and manipulated during the analysis, which in turn requires sophisticated analysis algorithms. More critically, the “genome-scale” data sets produced by high-throughput experimental technologies require sophisticated and highly efficient computing algorithms. These algorithms need equivalent effort on developing efficient and robust implementations if they are to extract the full potential from the data. The kinds of data currently being produced and analyzed to probe the sequence-based mechanisms of transcriptional regulation are highly diverse. Microarray gene expression data is currently a component of much of this research. This technology provides information about the level to which each gene is transcribed in a particular cell type or tissue. Large-scale projects, such as those of Klein et al. (2001), Su et al. (2004), Zhang et al. (2004) and Lamb et al. (2006), have produced expression profiles for many tissues and disease states, and have provided a valuable resource. The exponential growth of the GEO database of expression profiles (Barrett et al., 2007) is evidence that this technology is now broadly accessible in support of smaller projects. A significant portion of the regulation for individual genes is encoded in proximal promoters as sets of regulatory elements, which are recognized and bound by specific TFs. One of the mechanisms by which these TFs binding in proximal promoters regulate expression is through interaction with the PolII complex, influencing the rate of initiation at the TSS. Analysis of promoters for gene sets differentially expressed between two conditions is frequently conducted to identify the binding sites and TFs contributing to the observed expression. Recently, the advances in hybridization array technology that enabled high-throughput expression pro-filing have also allowed high-throughput binding site identification for individual TFs. In ChIP-on-chip experiments, antibody-tagged TFs are covalently attached to DNA, and those genomic locations are identi-fied by sonication, immunoprecipitation and subsequent array hybridization. ChIP-on-chip has already been applied to obtain the cell type-specific genomic localization of many important TFs, including p53 (Wei et al., 2006), ER (Carroll et al., 2006), NF-κB (Schreiber et al., 2006), E2F1 (Bieda et al., 2006) and CTCF (Kim et al., 2007). The ChIP-on-chip data provides binding site locations with a resolution on the order of hundreds of bases, andin silicoanalysis is done to pinpoint the exact sites where binding occurs, to identify and control for possible experimental confounds, and to perform higher-level analysis such as identifica-tion of co-factors. Regulatory elements in vertebrates are able to regulate transcription of genes over long distances by a variety of mechanisms. Sequencing-based techniques like ChIP-PET and the use of genome-wide tiling arrays in ChIP-on-chip now permit identification of such distal binding sites by eliminating the need fora prioriestimates of where they occur. additional discovery possibilities present associated This challenges for data analysis. Epigenetic properties of the genome represent another class of regulatory mechanism. Organization of chromatin structure regulates transcription by controlling the accessibility of genes to transcriptional regulation machinery. The regulatory functions of chromatin are mediated in part by post-translational modification of histone tails, including methylation and acytelation, which have been implicated in both re-pression and activation of transcription. Modifications to individual DNA bases are also known to influence
2
transcription. The methylation of DNA may itself physically impede the binding of TFs, thus eliminating their effect on transcription of the corresponding target gene. Also, methylated DNA may be bound by proteins that recruit additional proteins, eventually modifying chromatin state. Chromatin structure data can be obtained in several ways, including some high-throughput techniques. DNaseI hypersensitive sites, which are in an open chromatin state and are often associated with regulatory DNA, can now be identified in high throughput (Follows et al., 2006). ChIP-on-chip can also be applied to identify regions with specific histone modifications (Hubert et al., 2006; Bernstein et al., 2006). Although still a developing technology, the chromosome conformation capture carbon copy (Dostie et al., 2006) method can identify genomic re-gions that interact through DNA looping, and will eventually help identify interacting regulatory elements. Analyzing these regions computationally can reveal both the mechanisms responsible for controlling the chromatin structure, and the regulatory elements whose accessibility is controlled by the chromatin state. Understanding regulatory proteins that control structural aspects of the genome, along with the associated genomic regions they influence, will be essential to fully elucidating transcriptional regulatory circuits. The computational analysis, like the experimental techniques themselves, depends critically on the avail-ability of annotated genomic sequences. Of particular importance to understanding transcription is the an-notation of promoters, which are demarcated by transcription start sites (TSS). TSS locations in mammalian genomes are being identified by maturing techniques such as CAGE tagging, oligo-cap 5’-RACE and PolII ChIP-on-chip, which has the potential to resolve cell type specificity of alternative promoters. Databases like CSHLmpd (Xuan et al., 2005), DBTSS (Yamashita et al., 2006) and Fantom3 (The FANTOM Consortium, 2005) hold high-quality information about TSS locations. Availability of multi-species genome alignments allows for the use of evolutionary information in the analysis (Blanchette et al., 2004). A wealth of data is also available from other sources, including functional data from pathway and interaction databases (e.g. the KEGG database of Kanehisa et al. (2006) and the HPRD database of Peri et al. (2003)).
1.3 Overview of this tutorial The remainder of the tutorial is divided into two parts. The first part focuses on the technical aspects, in-cluding models, algorithms and statistics used in reverse engineering transcriptional regulatory circuits. This part provides enough mathematical detail to develop intuition about analysis techniques, and communicate the essence of why and how particular techniques work. This part is written largely as a survey. Detailed references are provided, along with a historical perspective of the development of the computational tech-nology. The second part shows how to apply sequence analysis techniques, working through examples moti-vated by expression and binding data. We analyze regulatory sequences related to co-regulated genes, and sequences identified as bound and unbound by given TFs in ChIP-chip experiments. The worked examples in this section will be useful to researchers not familiar with certain individual analysis tasks, or researchers first presented with transcriptional regulation data and want a rough template on which to model their anal-ysis. To a great extent, mathematical details are avoided in this part, and more emphasis is placed on demonstrating tools and explaining why particular analysis tasks are performed, what can be learned from the tasks, and how to select the appropriate tools and avoid common pitfalls. Example data is provided (see the accompanying website) so that readers may try the examples themselves, or the examples can be followed using new data.
3
Chapter 2
Technical details of the analysis methods
We give an overview of methods to identify co-regulated genes and analyze their regulatory regions. The earliest work that used regulatory region analysis to identify DNA patterns that predict expression on a genomic scale originated from George M. Church lab (Tavazoie et al., 1999; Pilpel et al., 2001). Following a protocol similar to Aach et al. (2000), Pilpel et al. (2001) collected yeast gene expression data from various conditions, and then used a computational method (Hughes et al., 2000) to identify overrepresented patterns in SCPD (Zhu and Zhang, 1999) promoters. More recently, Beer and Tavazoie (2004) demonstrated that DNA patterns in proximal promoters of yeast genes can be used to construct predictive models for co-expression. Specifically, they showed that motifs, their sites in proximal promoters, the affinity of motifs to sites, and distance preference between sites can be used to classify co-expressed genes. In this chapter we review methods to identify differentially regulated and co-regulated genes, predict transcription factor binding sites, identify enriched transcription factor binding sites, and predict synergy between transcription factors.
2.1 Identifying differentially regulated and co-regulated gene sets We assume that gene expression data is obtained through some high-throughput experimental technology such as gene-expression microarrays. A common scenario is for expression (mRNA concentration) to be measured under two different conditions, for example a healthy cell and a diseased cell of the same type. In this scenario, we search for genes that have some function that is specific to one of the conditions: elevated expression in the diseased cell for example. We say that such genes aredifferentially expressedand our goal is to identify the root cause for theirdifferential regulation describe simple methods for identifying. We differentially expressed genes. Agene moduleis a set of genes that are grouped together because they share some property related to their expression. Usuallygene modulesare sets of co-expressed genes, and they are more likely to be co-regulated or involved in a single cellular function or pathway. Gene modules may contain genes that encode regulatory proteins (e.g.TFs and signaling proteins) and their targets, so the relationships between genes in a module can be complex. Identifying gene modules is the principal goal of many high-throughput gene expression experiments. With respect to reverse-engineering transcriptional regulation, we are rarely interested in describing the behavior of all genes. We usually only focus on a gene module related to a particular cell-type, developmental stage, disease, cellular response or perturbation. Identifying a relevant gene module is the first step towards understanding how gene expression is regulated in that particular context. We review some well-established methods for identifying gene modules and some recent advances.
4
Although the most common situation is to have expression measurements in two conditions, much more interesting and powerful analysis can be performed when expression measurements have been obtained for multiple (usually related) conditions. When multiple conditions are available, we think of them as being represented in agene expression matrix, with rows corresponding to genes, and columns corresponding to the conditions. Forngenes andmconditions, the(n×m)GEM is written xx12,,11xx12,,22xx21,,mmX, = xn.,1x.n,2...xn.,mwith entryxi,jdenoting the value associated with geneiunder condition (or experiment)j. We refer to this value as theexpression levelof geneiin conditionj. Notice that even with just two conditions we could still use a GEM to describe the expression data, and the GEM would have two columns. The expression level of geneiacross all experiments can then be represented as a gene expression profile (GEP): xi= (xi,1, xi,2, . . . , xi,m). ~ We seek to obtain information about how the expression levels of different genes relate to each other. Note that methods for processing GEMs most often require that the values be normalized (either by row/gene or column/condition).
2.1.1 Differential expression We consider two scenarios: 1. The data includes two measurements in each of two conditions. 2. The data includes two sets of experiments, where each set associated with a particular condition. The sets may include measurements across different states of a given condition, repeating measurements taken from multiple biological sources under each condition, or even simple repeated measurements. When our data consists of two measurements, we must make choices based on our expectation for the (1) number of differentially expressed genes, (2) significance of expression intensity, ands (3) significance of expression ratio. We can rely on present/absent calls made by the microarray analysis software, but this one dimensional information is seldom sufficient on its own to identify differentially expressed genes. When two sets of measurements are available, we can identify statistically significant expression changes using a non-parametric test such as the Mann-Whitney U test (Hollander and Wolfe, 1999). When the data includes more than two sets of measurements, the more general Kruskal-Wallace test can be used. We caution that to reject the hypothesis that two sets of measurements are taken from the same distribution underp-val<0.01, at least one of the sets must include more than 3 measurements; for the case of sets size 3 and 4, the smallest measurement in one set must be greater than the largest measurement in the other set.
Two measurements.This is the most difficult scenario for identifying significantly differentially ex-pressed genes. The analysis is further complicated when the measuring technology is not the same or when the mRNA sources differ substantially; the comparison of expression in tissue samples and immortal-ized cell lines is an extreme example for differing mRNA sources. Decisions should not be made based on
5
expression ratio alone, and should include consideration for absolute probe intensity, and mismatch probes when available. The simplest strategy is to set a minimum mean intensity level threshold on reads across the two measurement~xi each 2-point GEP by the Represent, and discard reads that fall below this threshold. ratio of expression values, and compare this ratio to the ratio of thekGEPs that are closest in terms of the mean of their expression across the experiments. Present/absent calls can be used to refine the list further by eliminating genes that are called absent in both experiments. We say that a gene is differentially expressed if its ratio is lower than the ratios ofρof itskclosest neighbors. Alternatively, instead of choosingkclosest neighbors use a kernel function such as a Gaussian function (Mitchell, 1997) to weight the contribution of neighbors according to the distance of their mean expression level.
Two or more sets of measurements.Given several experiment repetitions, we can identify differentially regulated genes with higher accuracy. Generally, we expects experiment repeats to be normally distributed. However, prefect repetitions are rare and some measurement subsets are more likely to be correlated than others, especially when the source of the data is variable. To avoid analysis errors, use a non-parametric significance test such as a Mann-Whitney U test or a Kruskal-Wallace test to replace the simple ratio used when only two measurements are available. Using these tests will also help identify better estimates forρ.
2.1.2 Measuring similarity between expression profiles The most straightforward method for analyzing a gene expression matrix is to identify genes that exhibit similar behavior across the data set. Identifying co-regulated genes relies on defining a metric of distance or similarity between GEPs. We briefly outline advantages and disadvantages of commonly used metrics. Euclidean distance.The straight-line distance metric between gene expression vectors in N-dimensional Euclidean space defined as: di,j=|~xi~xj|(2.1) Before computing Euclidean distance, GEPs should be normalized so each measurement represents deviation from the average level across all experiments. This is commonly done, for example, by normalizing each GEP to have zero mean and unit variance. Pearson correlation.This frequently used correlation indicates the strength of the linear relationship between genes. The Pearson correlation coefficient for genesiandjis defined as: cxi, ov(x~~j) (px~i|x~iµi)µiT|(|~x~jµj)(2.2) pvar(~xi)var(xj)xjµj| ri,j=~= whereµz=Pi=1xzi/N lues. Tbetween -1 and 1, and negative values indicate Nhis metric takes va inverse correlation. Pearson correlation should be used when each GEP is normally distributed. non-parametric analog of Pearson correlation, and computes the is a Spearman correlation. This Pearson correlation coefficient on rank transformed data. This metric should be used when a linear relationship assumption between GEPs is inappropriate, for example when GEPs are not normally distributed. Because it is rank transformed, Spearman correlation is also more robust against outliers than Pearson correlation, but has less statistical power for linearly related data. When the type of correlation between GEPs is not known, and this is generally the case, use Spearman correlation.
6
Mutual information.similarity metric and can capture arbitrary, non-linearMI is the most general relationships. MI computes the differential entropy between GEPs, and for a pair of random variables, x and y, is defined as: Ii,j=S(~xi) +S(~xj)S(x~i,~xj)(2.3) whereS(t)is the entropy of an arbitrary variablet. For a discrete variable, the entropy is: S(t) =Pip(ti) logp(ti)(2.4) wherep(ti) =P rob(t=ti)the probability of each discrete state (value) of the variable. Likeis  the Pearson correlation, MI measures the degree of statistical dependency between two variables. It is nonzero if and only if the the variables are statistically dependent (Shannon, 1948).
2.1.3 Clustering genes by expression Once a similarity metric is defined, various clustering algorithms can be applied to group together genes with similar expression profiles. As a general statistical tools, an introduction on clustering algorithms is available in MacKay (2003). Here we focus on clustering microarray expression data. Most microarray expression clustering algorithms begin by computing anN×Nsimilarity matrix, where each entryri,j, represents the similarity score between gene expression profil~xiandx~j. es Hierarchical clustering.The algorithm described by Eisen et al. (1998) is the most commonly used for both visualization and analysis of microarray data. At each iteration the two most similar GEPs are matched and joined into a cluster; this clustering is represented by an edge on the cluster graph, where the length of the edge corresponds to the distance between the GEPs. After GEPs are connected into a cluster, their values in the correlation matrix are replaced by a single value by taking either the average (average linkage), minimum (single linkage), or maximum (complete linkage) among all values previously computed using a GEPs contained in the newly formed cluster. If genes~xiandx~j are connected in the cluster graph, then, for each genexk∈ {x} \ {i, j}the valuesri,k,rj,krk,iand rk,jare replaced by a single value. K-means clustering.This clustering method (Tavazoie et al., 1999) groups the data intoknon-overlapping clusters, where each cluster is represented by a centroid, describing the average behavior of the cluster, and each GEP is assigned to the centroid with which is has the highest similarity score. Self organizing maps.These are closely related to k-means, but force cluster centers to be located on a grid (Tamayo et al., 1999). version of k-means, in which GEPs are assigned a soft Clustering by expectation maximization. A probability of belonging to each cluster (Segal et al., 2003). These clustering approaches assign all GEPs to clusters, and uninformative GEPs and even unrelated data will form clusters. For these reasons, clusters must be interpreted carefully to ensure that they are biologi-cally meaningful. When searching for specific trends of possibly few genes across expression profiles, an alternative to the hierarchical clustering approach is needed. Schug et al. (2005) describe an algorithm to identify tissue-specific genes. They were interested in isolating a cluster of genes that show significant elevated or inhibited expression in a small set of possibly related tissues. The expression value of a genegacrossttissues, is normalized to sum to1, and the entropy of a gene’s expression distributionHgis computed from the resulting
7