yinlu1978的个人博客分享 http://blog.sciencenet.cn/u/yinlu1978

博文

双聚类xmotif算法的论文翻译及个人理解

已有 3468 次阅读 2013-5-3 10:57 |系统分类:论文交流| xmotif

Extractingconserved gene expression motifs from gene expression data

(论文全文见附件

murali.pdf

1introduction

      geneexpression plays an important role in cell differentiation development andpathological behavior, DNA micorarray offer biologists the remarkable abilityto monitor the expression levels of thousands of genes in a cellsimultaneously. High-throughput gene expression analysis promises to producenew insights into cell function as well as stimulate the development of newtherapies and diagnostics.

      基因表达在细胞分化和病理行为上扮演了重要的作用,DNA微阵列技术提供给生物学家一个非常重要的能力去同时监控一个细胞中成千上万的基因的表达水平。高通量的基因水平分析不仅有望可对细胞功能的有了新的认识而且促进了新治疗和诊断技术发展。

      whena gene's expression level is measured across a variety of samples, theexpression values usually span a wide range. Biologically, these valuescorrespond to a small number of distinct states that the genes is in, e.g.,up-regulated or down-regulated. Since the up-regulated of a gene is a temporalprocess, it is often difficult to determine a gene's expression state based ona set of noisy expression values. However, on average, it might be possible todifferentiate the expression level of an up-regulated gene in one tissuetype(e.g., a cancer tissue) from its level in another type of tissue(e.g., ahealthy tissues) where the gene is not expressed with the same abundance.

      当一个基因的表达水平在多个样本上被测量时,基因值通常是在个区间范围内取值的。从生物学的意义上来说,这些值对应于数量较少的不同状态比如基因是在上调或下调状态。由于一个基因的上调状态是一个短暂的过程,所以很难根据一系列有噪音的表达值去决定一个基因的表达状态。然而,平均来说是有可能区分出在一种组织类型里比如癌症组织上的上调基因的表达水平相对于另外类型的组织比如健康人的组织,在该组织上基因并没有同癌症一样高的表达水平。(这里说明了该算法的生物支持)

      Motivatedby this observation, we say that a gene's expression level is conserved acrossa subset of samples if the gene is in the same state in each of the samples inthis subset. A conserved gene expression motif or xMotif is a subset of geneswhose expression is simultaneously conserved for a subset of samples; we saythat each of these samples matches the motif. In this paper, we use a range ofexpression values to represent a gene's state, if we map each gene to adimension, each sample to a point, each expression value to a coordinate value, an xMotif is identical to a multi-dimensional hyper rectangle that is boundin the dimensions corresponding to the conserved genes in the motif andunbounded in the other dimensions. See figure 1 for the examples.Mathematically, the expression values of a sample that matches a motif satisfya conjunction of inequalities, two for each gene in the motif.

      基于这个现象,我们可以说一个基因的表达水平在一个样本子集上是守恒的如果基因在每一个样本子集对应下的状态是一样的。一个守恒的基因表达基序或xMotif就是一个基因子集,在该子集中的每一个基因的表达水平对于一个特定的样本子集都是守恒的。我们也可以说这个样本子集中的每一个样本匹配基因表达基序。在本文中我们采用基因表达值的范围代表基因的表达状态,如果我们将基因映射为维数,将样本映射成一个点,每个表达值映射成一个坐标值,那么xMotif就等价于一个多维的超长方形,在该超长方形上对于motif中守恒的基因就是有界的维,而在其它维上是没有界的。例子见图1.从数学上来说,一个基序匹配的样本的基因表达值满足不等式的合取对于在基序中的二个基因。

      Inthis work, we address the task of identifying xMotifs in gene expression data.We concentrate on data sets where each sample belongs to a particular class,e.g., different types of cancer, cancerous and healthy tissues, and patientswith different survival rates.

      在本工作中,我们旨在解决在基因表达数据集上确定多个xMotif的任务。我们主要关注于这样的数据集上,在该数据集上每个样本属于一个特定的类型,比如不同类型的癌症,还比如癌症和健康的组织,或者不同存活率的病人。(这里指明了xMotif算法的适用数据集

     We believe that this work is the one of thefirst to suggest representing gene expression data concisely in the form ofxMotifs. Such a representation has several potential biological advantages andapplications. First, by comparing and contrasting the gene motifs for differentclasses, we can identify genes that are conserved in multiple classes but arein different states in different classes. If the classes correspond todifferent diseases or to diseased and normal tissues, such genes are possibledrug targets. Second, if the genes in a motif are believed to interact in apathway, the information present in the motif about which gene are highlyexpressed and which are suppressed could be used to deduce and refine thestructure of the pathway. Third, by requiring that multiple genes besimultaneously conserved across the samples matching a motif, we might be ableto characterize sub-classes in the data that no gene on its own provideshigh-quality evidence for.

      我们相信这个工作是最早提议用xMotif代表基因表达数据集的工作之一。这种表示有几个潜在的生物好处和应用。首先是,通过比较和对比不同类型的基因的基序,我们可以确定这样的基因,这些基因在对应于每一个同一类型的基序是守恒的,但在不同类型下的基序状态是不一样的。如果这样的类型对应于不同的疾病或对应于疾病和正常组织,那么这些基因就可能是药物的靶基因。第二是如果一个基序中的基因被认为是一个通路中在交互的,那么在基序中展示出来的哪些基因是向上调表达的哪些基因是抑制表达的信息能被用来推断和改进通路的结构。第三是,由于基序要求对应的多个基因在基序对应条件下是同时守恒的,我们也可以在该数据集上描述这个子类的特点,这样的数据集上没有一个基因自身能提供任何高质量的证据。

      Beforeattempting to develop an algorithm for computing xMotifs, it is useful toconsider the properties that an xMotif should have. Computing one motif foreach sample makes the representation over-specific. Therefore, we desire thateach motif for a class should be matched by a large fraction of the samples inthat class, if not all the samples in that class. Each motif should contain asmany conserved genes as possible. While a motif that contains one or two genesis biologically feasible, it may not be statistically significant since such amotif could appear with high probability in randomly-generated data. On theother hand, a motif that contains too many genes may be too restrictive sinceno sample may match the motif. Motivated by these observation, we propose thefollowing formal definition of an xMotif:

      在尝试开发算法解决xMotifs问题之前,考虑一个xMotif应该具备什么样的性质是有用的。对每一个样本计算其基序使得表达过于特定。因此,我们希望对于同一类型的每一个基序就算不能在该类型的所有样本上匹配也应该在该类型里的大多数样本上匹配。每一个基序应该尽可能的包含守恒的基因。尽管一个基序包含一个或二个基因在生物角度上也是可行的,但是不可能在统计上有意义由于这样的基序有很大的概率出现在于随机生成的数据集上。另一方面,一个基序包含的基因过多可能会过于严格因为这样会没有样本能匹配这个基序。基于上述事实,我们提出如下关于xMotif的正式定义:

      Definition1.1 Given a set of gene whose expression levels are measured across a set ofsamples and user defined parameters , a conserved gene expression motif or xMotif is a pair , where  is a subset of thesample and is a subset of the genes, that satisfies the followingconditions:

      Size:the number of samples in is at least an fraction of all the samples.

      Conservation:every gene in  is conserved acrossall the samples in , i.e., the gene is in the same state in all the samples in ,and

      Maximality:for every gene not in , the gene is conserved in at most a fraction of the samples in .

      Themaximality condition enforces a balance between the number of genes in themotif and the number of samples matching the motif. If we add a gene to themotif, then the number of sample matching the new motif will decrease by afraction of at least ,a cost may not be willing to pay.

      条件最大性促进了在基序中基因个数和匹配基序的样本平衡。如果我们向基序中增加一个基因,那么匹配该新基序的样本数目就会相对原来的匹配样本数减少,这个代价可能不愿意付出。

      Giventhis definition, the gene expression data may contain many xMotifs. Among allxMotif, we are interested in the largest xMotif, the one that contains themaximum number of conserved genes. In order to cover all the classes completelyusing xMotif, we adopt the following iterative algorithm: find the largestxMotif in the data, remove the sample that satisfy this motif from the data,find the largest motif in the remaining data, and continue in this manner untilall samples satisfy some motif.

      考虑到这个定义,基因表达数据集可以包含大量xMotif. 在大量xmotif中,我们最感兴趣的是最大的xMotif,该xMotif包含有最大数量的守恒基因。为了使用xMotif完全覆盖所有类型,我们采取了如下的迭代算法:在数据集上找到最大的xMotif,从数据集上移除满足该xMotif的样本后再在剩余的数据集上继续再找最大的xMotif,如此循环直到所有的样本满足一些基序。

      Ourapproach has several desirable features. (1) We allow a gene to appear in morethan one motif and in motifs representing different classes, modeling thepossibility that the gene's expression level may be regulated in multipleconditions.(2) By not deleting the samples that match a computed xMotif, we canallow samples to appear in different motifs. this property may be useful when asample belongs to multiple classes or when we are interested in discovering newclassification of the samples.(3) The system need not be told beforehand howmany motifs to compute. (4) Using this approach, we can find xMotif with vastlydifferent numbers of genes; we are likely to discover motif with many genes inearlier iterations and motifs with fewer genes in later iteration.

      我们的方法有以下几个理想的特点:(1)我们允许一个基因出现在多个基序中,其中每一个基序代表不同的类型,这正好模拟了该基因的表达水平可以被多个条件所调控。(2)由于不删除匹配已计算好的xMotif的样本,我们的算法允许样本出现在不同的基序中。这个性质或许是有用的当一个性本属于多个类型时或当我们感兴趣发现样本的新分类。(3)系统不需要预先知道要有多少个基序需要计算。(4)使用该方法,我们可以找到基因大小不一的xMotif,我们可能在早期的迭代过程中发现由多个基因组成的基序也可以在后期的迭代中发现由较少基因组成的基序。

      Ourdefinition of an xMotif is based on the notion of projective cluster developedby Procioiuce et al[6] in the context of problem in computer databases andcomputer vision. It can be shown that the problem of computing the largerxMotif is NP-complete by transforming the problem of computing the maximum-edgebipartite clique in a bipartite graph to the motif computation problem. Insection 4, we present a probabilistic algorithm that exploits the mathematicalstructure of xMotif to compute the largest xMotif efficiently. This algorithmextends the technique developed by Procioiuce [6] to compute projectivecluster.

3.Determining Gene States

      Inthis section, we describe our technique for computing the states correspondingto a gene. In our approach, a state is simply a range of expression values thatis statistically significant. Similar ideas have been adopted by otherresearchers. Thus, if we have  samples, there are  possible states foreach gene, each corresponding to one of the sub-intervals spanned by the expressionvalues. Clearly, not all these states are biologically interesting.Intuitively, a state is interesting if it contains far more expression valuesthan we would expect the state to contain if the expression values weregenerated at random. Formally, as a null hypothesis, we assume that geneexpression values are generated by a uniform distribution. We define a state  to be 'interesting' ifthe expression values in it are unlikely to have been generated by a uniformdistribution. We compute the P-value of the decision that  is interesting, orderstates by P-value, and consider only those states whose P-value is less than auser-defined parameter. When the samples belong to different classes, we alsoadopt a 'supervised' version of this idea. We define a state  to be interesting ifthere is a class such that the set of expression values of the samples fromthat class that lie in  are unlikely to havebeen generated by a uniform distribution. For each class, we calculate aP-value and assign the smallest P-value to .

      Inpractice, we also discard those intervals that contain more than a userspecified number of expression values. The rationale for this step is that evenif an interval containing a large number of expression values is statisticallysignificant, it may not be biologically interesting since it is unlikely tohelp us in distinguishing between the various classes.

      在这节,我们讨论我们的技术是如何计算一个基因所对应的状态的。在我们的方法中,一个状态简化为一个有统计意义的基因表达值的范围。类似的想法已经被其它研究人员采用过。因此,如果我们有个样本,那么每一个基因就有种可能的状态,每一个状态对应一个由基因表达值扩张生成的子区间集合中的子区间。显然,不是所有的状态都有生物意义。直观的来说,如果一个状态(子区间)包含的基因表达值个数大于我们所期待的状态的基因表达值个数(所期待的状态是随机生成基因表达值的状态),那么就说一个状态是有意义的(这里对一个状态是有意义的定义比较难理解,我个人认为如果一个状态也就是一个区间是有意义的,那么就是说由个样本对应于该基因的表达值在该区间内的个数就应该大于由随机分布而导致的基因表达值在该区间的个数)。正式的说,作为一个零假设,我们假定基因表达值是按均匀分布来生成的,我们就定义一个状态是有意义的如果其中的表达值不可能被均匀分布生成的。我们计算状态是有意义这一决策的P值,并按P值排序后只考虑P值小于用户指定的阈值的状态。当样本属于不同的类型时,我们也采用了这个想法的有监督版本。我们定义一个状态是有意义的如果在区间内存在一个类型使得来自该类型的样本基因表达取值集是不可能均匀分布生成的。对每一个类型,我们计算其P值,并将最小的P值指定给区间

      在备注中给出了如何计算P值,如果个值位于区间,并且如果基因的表达值位于范围,那么基因表达值位于区间的概率就是。因此,状态是有意义的决策的P值就是

      在实际的操作中,我们也丢弃了一些区间,这些区间包含的基因表达值超过用户指定的个数。这一步的合理性在于即使一个包含大量数量的基因表达值的区间在统计上是有意义的,它也不可能在生物上有意义,原因是它不可能帮助我们区分各种类型。

      4.algorithm

      Weare now ready to describe our algorithm for computing the largest xMotif. Theinput to the algorithm is a set of genes, a set of samples, and expressionvalue for each gene-sample pair, and for each gene, a list of intervalsrepresenting the states in which the gene is expressed in the samples.

      我们现可就准备描述我们的计算最大xMotif的算法。算法的输入是一个基因集,一个样本集,每一个基因在每一个样本下的表达水平(这也就是基因表达水平矩阵)和每一个基因的区间列表,该区间列表代表了该基因在所有样本下表达的状态。

      Todetermine an xMotif, we have to compute the set  of conserved genes,the states that these genes are in, and the set  of samples that matchthe motif. We observer that if we are given (1) the set  of conserved genes,(2)the states of the conserved genes, and (3) one sample  that matches thismotif, we can compute the remaining samples in  simply checking foreach sample  whether the genes in  are in the same statein  and . Informally,  is a seed from whichwe can compute the entire motif.

      为了确定一个xMotif,我们不得不计算由所有守恒基因组成的集合,这些守恒基因所在的状态和与这些守恒基因所匹配的样本所组成的集合。我们注意到如果我们能给出(1)守恒基因集2)守恒基因的状态和一个匹配基序的样本,那我们就可以计算出中剩下的样本通过简单的检测守恒基因集在样本的状态是不是一样的,正式而言,就是一个通过其计算整个基序的种子。

      Thisobservation is the starting point of our algorithm. Suppose we know a sample  that matches thelargest xMotif. Given such a sample , how can we compute the genes in the largest xMotif and thestates they are in? Suppose we are given a set  of samples with thefollowing properties: (1) for every sample  in  and for every gene inthe largest motif, there is exactly one state such that the gene is in thatstate in sample  and ,and (2) for every gene  that is not in thelargest motif, there exists a sample  in  such that gene  is not in the samplestate in sample  and . We call  a discriminating set.The key property of a discriminating set is that given the seed sample  and such a set , we can compute the largest xMotif by including exactlythose gene-states that satisfy these condition and exactly those samples thatagree with  on all thesegene-states.

      这个观察是我们算法的起点。假定我们知道一个样本能匹配最大的xMotif. 那么已知样本我们如何能计算出xMotif中的基因及这些基因所在的状态?假定我们已经一个样本集有以下性质:(1)对于每一个在中的样本和每一个在最大xMotif中的基因都确定性的存在一个状态使得基因在样本下都在该状态中。(2)对于每一个不在xMoitf中的基因,那么在中就存在一个样本使得基因与样本不在同一个状态中。我们把这个集合称之为区分集。区分集的关键性质是给定种子样本和一个区分集,我们就可以计算最大的xMotif,通过准确地包含那些满足那些条件的基因状态以及准确地包含和样本在同一基因状态下的样本。

      Algorihm1 description the steps of our probabilistic algortihm. We assume that for eachgene, the intervals corresponding to that gene's states are disjoint. Itproceeds by selecting  samples uniformly atrandom from the set of all samples. These samples act as seeds. For each randomseed, we select  sets of samples uniformlyat random from the set of all samples; each set has  elements. These setsserve as candidates for the discriminating set. For each seed-discriminatingset pair, we compute the corresponding xMotif as explained above. We discardthe motif if less than an fraction of the samples match it. Finally, we return themotif that contains the largest number of genes. Suppose our data consist of  samples and  genes. We can extendthe arguments made by Procopiuc to prove that by choosing ,  and , we can compute the largest xMotif with probability greaterthan 1/2 in time . We can increase the probability of success by repeatedlyexecuting this algorithm.

      算法1描述了我们的概率算法的步骤。我们假定对于每一个基因,对于于基因状态的区间是不相邻的。该算法通过从所有样本中均匀地随机选择个样本开始执行。这些选择的样本作为种子样本。对于每一个随机选择的样本,我们随机的从所有样本中按均匀分布随机选择个样本集,每一个样本集中包含个元素。这个样本集就作为区分集的候选集。对于每一个种子及区分集,我们按上述方式计算其xMotif。我们把样本匹配比例小于motif就不考虑了。最后,我们返回基因数目最大的motif。假定我们的数据由个样本个基因组成的,我们可以推广参考文献6 Procopiuc的结论去证明通过选择,,我们可以计算出概率大于1/2找到最大xMotif的时间复杂度是。我们可以通过重复执行该算法来增加成功的概率。

      算法1FindMotif():找最大xMotif的算法

             1for i=1 to

             2,   按均充分布选择一个样本

             3,    for j=1 to

             4,           从所有样本中按均匀分布随机选择个样本组成区分集

             5         对于每一个基因,如果在样本和样本区分集下的基因表达值都在状                      下,那么就将基因及对应的状态加入到集合

             6         是在所有基因集下和一致的样本集。

             7         如果集合中的个数小于,则将该基序丢弃

             8 end for

             9,end for

             10,返回上述找到的所有中最大的motif

 

 



https://wap.sciencenet.cn/blog-801621-686305.html


下一篇:开博感想
收藏 IP: 125.71.231.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-3-28 21:26

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部