chenxinzzbin的个人博客分享 http://blog.sciencenet.cn/u/chenxinzzbin

博文

分子进化程序包PAML codeml程序使用入门

已有 29947 次阅读 2016-12-27 03:19 |系统分类:科研笔记


第一篇:

PAML这个程序非常挑文件格式,建议大家在运行PAML出现问题的时候优先考虑如下几个问题:

1、做序列比对的时候是否用的Codon Alignmnet;

2、核酸序列要保证是3的倍数;

3、序列中不能出现纯数字、不能出现特殊符号,比如()一类的;

4、是否会有额外的空格、换行符等等。

在这里特别说一下,在OSX系统下做的序列文件很可能用不了,原因大概是OSX会在文本文档中增加一些特殊的不可见字符。

随手拿MEGA建个分子进化树非常容易。但是如何选择模型?进化树是否具有生物学意义?如何解读一颗进化树?进化树的拓扑结构是否可重复?进化树下面的Bar以及其数字代表什么?是否可以简单理解为序列的差异?后来在Nei的Molecular Evolution and Phylogenetics找到了部分答案,至少在距离模型里这个Bar的长度代表序列的分歧度。

PAML程序包是分子进化领域的大牛杨子恒教授开发的。他现在在伦敦大学,是英国皇家学会的会员。


下面简单介绍下PAML程序包中CODEML程序的使用。

「预备知识」

①、命令行的基本操作。 ②、MEGA,Clustal, Treeview等程序的使用。


进入正题,从最常用的codeml程序开始说起

①去官网下载并安装PAML程序包:http://abacus.gene.ucl.ac.uk/software/paml.html

Windows版本应该不用解析,Unix、Linux and Mac OSX请参考官网步骤解析程序。

②codeml程序运行需要4个文件(codeml程序,codemlde.ctl配置文件,比对好的序列文件(Phlip或者PAML格式),以及tree文件),最好将他们放在一个目录下。

③ 准备比对好的序列文件,序列要满足以下几个要求:

A.Phylips格式或者PAML格式

B.序列数需要时3的整数倍(密码子)

C.在序列中不能出现终止密码(最后一个如果是终止密码请手动删除)

具体制作步骤:

把要进行Selection分析的序列保存成Fasta格式,用MEGA打开,用Clustal或者Muscle比对,Coden alignment以后手工删掉所有终止密码(Gap可以选择删除或者不删除,反正codeml计算的时候是忽略Gaps的)—从MEGA导出为比对好的Fasta文件。

④将比对好的Fasta文件转化成Phylips格式或者PAML格式。

Phylips和PAML格式的区别仅仅在于Phylips的序列名不能超过10个字母,而PAML的序列名没有限制。

将比对好的Fasta文件改成Phlips或者PAML格式是最简单的方法:

提交比对好的Fasta文件到 http://app3.titan.uio.no/biotools/tool.php?app=fasta-paml ,导出Phlips或者PAML格式

最后,将转化好的序列格式保存成自己想要的名字就好,比如 n1.txt

「PAML」分子进化程序包PAML使用入门—codeml

⑤制作tree文件。

tree文件最简单的方法就是直接用Clustal制作,推荐 http://www.genome.jp/tools/clustalw/

将序列提交上去,点Execute Multiple Alignment,然后将结果里的dnd文件下载,改名为自己想要的文件+后缀名“.trees”。我这里保存为n1.trees。

但是 上Phylogenetics课程的时候问了老师一句关于拿Clustal建树用于PAML分析,得到的回答是:  Never, never use Clustal to build a tree。

所以还是老老实实用正经的建树工具来做Newick格式(见下图)的tree文件吧。

「PAML」分子进化程序包PAML使用入门—codeml

⑥接下来要配置codeml.ctl文件。

可以参考这个链接: http://csbl.bmb.uga.edu/~yinyb/codeml.html

比较重要的几个有

seqfile = n1.txt * 自己制作的比对好的序列名称

treefile = n1.trees * 树文件名称

outfile = mcl * 结果文件名称

runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic

seqtype = 1 * 1:codons; 2:AAs; 3:codons—>AAs

model = 0

?? * models for codons:

?? * 0:one, 1:b, 2:2 or more dN/dS ratios for branches

?? * models for AAs or codon-translated AAs:

?? * 0:poisson, 1:proportional, 2:Empirical, 3:Empirical+F

?? * 6:FromCodon, 7:AAClasses, 8:REVaa_0, 9:REVaa(nr=189)

NSsites = 0 1 2 7 8 * 0:one w;1:neutral;2:selection; 3:discrete;4:freqs;

?? * 5:gamma;6:2gamma;7:beta;8:beta&w;9:betaγ

?? * 10:beta&gamma+1; 11:beta&normal>1; 12:0&2normal>1;

?? * 13:3normal>0

其中NSsites就是模型选择。我一般选择0,1,2,7,8,这样接下去可以对结果进行LRT分析来看结果统计上的显著性。

⑦保证这四个文件:codeml, codeml.ctl, n1.txt, n1.trees 在一个文件夹,cd到目标目录,然后运行codeml程序。

「PAML」分子进化程序包PAML使用入门—codeml

⑧目标目录会出现很多结果文件,其中MCL里面是主要的结果。

「PAML」分子进化程序包PAML使用入门—codeml

⑨打开mcl文件,看M8模型下面的这个部分。

第一列是序列的位置和承受选择的氨基酸

第二列是W>1(W>1: positive selection, W<1: purifying selection)的概率,如果P>95%就会被标记上*,证明该位点收到positive selection。


但是这样的结果其实还不可信,还应该通过LRT来看结果统计上的可信度

Reference

①、「计算分子进化(Molecular Evolution)(Molecular Evolution)」杨子恒

②、「分子进化(Molecular Evolution)(Molecular Evolution)与系统发育」Masatoshi Nei

③、paml中文手册:http://blog.csdn.net/winglyx/article/details/6731787

④、codeml配置文件说明:http://csbl.bmb.uga.edu/~yinyb/codeml.html

⑤、codeml使用流程介绍:http://fhqdddddd.blog.163.com/blog/static/1869915420111091276909/


第二章:

上回详细介绍了PAML程序包中用Codeml程序计算选择位点的操作,最后提到计算选择位点以后还需要利用LRT来看看结果是否可靠。

利用Codeml计算得到的选择位点一定要经过这一步才算数,发表的文献中也都是会提到这一步的。


先简单介绍下LRT,全称是Likelihood ratio testes,在这里就是通过卡方分布(chi-squate),通过比较利用不同参数得到的值,看看结果是否在统计上显著。我们先抛开统计方面的疑问,直接看看具体拿PAML程序包怎么实现。

上回讲到了用codeml计算选择位点的时候,同时选择M7和M8模型,这两个模型的差异在于M8(beta & ω)模型比M7 (beta)多了一个参数,所以可以依靠比较这两个模型来进行LRT。

公式是这个:LRT = 2dl = abs(2 X (l1-l0)) (abs=绝对值)

Step1:

打开之前用codeml生成的mcl文件,找到Model7, 和Model8

两个lnL值相减,取绝对值,乘以2

这里是:2*|-10302.87869-(-10332.64273)|=59.52 (小数点后影响不大)

Step2:

打开PAML程序包中的Chi2程序, 分别输入自由度和刚才得到的数,自由度直接取2就好(这个是杨子恒自己说的,在PAML的文档里有写),后面写上Step1算出来的数字,得到的p小于0.05则结果比较可靠。


第三章:

之前介绍过用PAML程序包里的Codeml计算一组序列的site selection:「PAML」分子进化程序包PAML使用入门—codeml。

以及如何用LRT检验结果的可靠性:「PAML」分子进化程序包PAML使用入门——Vol2. LRT检验。

这次介绍Codeml里的branch model。


先简单介绍点背景知识,要看一组序列受到的选择,首先会从整组序列来考虑,看这组序列的的非同义突变/同义突变(Nonsynonymous substitution/ Synonymous, ω, dN/dS),ω>1,说明这组序列受到正向选择(positive selection),ω=1为中性选择(natural selection),ω<1为纯化选择(purifying selection)。

但这里就会有问题,因为如果拿整组基因来看,大部分情况会得到ω<1的结果,因为大部分有功能的基因不会所有位点都受到强烈的positive selection,这样从整组数据来看,正向选择就被中和掉了。


而PAML的优势就是通过Branch model和Site model,仅仅考察在一个进化树某一个枝所收到的选择压力,或者序列中哪些位点遭受到正选择。

Branch Model可以用于分析物种分化时间;而Site model对于基因功能的研究非常有帮助,比如可以找到基因的关键结构,然后通过定点突变来看是否会影响到功能。


接下来拿PAML自带的Example来说说Branch model。

PAML版本4.4所用数据集为Example中的lysozymeSmall

Step1

为了进行比较我们先来看看整组数据的ω, lysozymeSmall.ctl文件里设置

Model=0 (假设所有枝的w是一样的) NSite=0

运行得到结果。

Step2

因为假设所有枝受到的选择压力是相同的,所以ω一样。这时我们标记一下tree文件,假设3,4这个枝的ω和其他不同,然后改一下ctl文件

Model=2 (假设枝的w可以是不一样的) NSite=0

运行。

注意上面的红框,标记树枝的ω值已经变了。 下面的红框是树文件,可以吧这个文件拷贝到文本编辑器里。然后用Treeview或者Figtree打开,之后就可以将不同数值的ω标记到树图里了。



下面放些更详细的说明:


CodeML/PAML
Install of PAML

Download it from here: http://abacus.gene.ucl.ac.uk/software/paml.html#download

Windows (The bin/ folder already contains windows executables):

cd paml4.6/

Linux:

cd paml4.5/rm bin/*.execd srcmake -f Makefilels -lrm *.omv baseml basemlg codeml pamp evolver yn00 chi2 ../binchmod +x ../bin/*

MacOSX:

cd paml4.6/rm bin/*.execd srccc -O2 -o baseml baseml.c tools.c -lmcc -O2 -o basemlg basemlg.c tools.c -lmcc -O2 -o codeml codeml.c tools.c -lmcc -O2 -o pamp pamp.c tools.c -lmcc -O2 -o mcmctree mcmctree.c tools.c -lmcc -O2 -o evolver evolver.c tools.c -lmcc -O2 -o yn00 yn00.c tools.c -lmcc -O2 -o chi2 chi2.c -lmls -lrm *.omv baseml basemlg codeml pamp evolver yn00 chi2 ../binchmod +x ../bin/*
Theoretical principles of the Branch-site model

The selective pressure in protein coding genes can be detected within the framework of comparative genomics. The selective pressure is assumed to be defined by the ratio (ω) dN/dS. dS represents the synonymous rate (changing the amino acid) and dN the non-synonymous rate (keeping the amino acid). In the absence of evolutionary pressure, the synonymous rate and the non-synonymous rate are equal, so the dN/dS ratio is equal to 1. Under purifying selection, natural selection prevents the replacement of amino acids, so the dN will be lower than the dS, and dN/dS < 1. And under positive selection, the replacement rate of amino acid is favoured by selection, and dN/dS > 1.

CodeML and substitutions models:

CodeML is a program from the package PAML, based on Maximum Likelihood, and developed in the lab of Ziheng Yang, University College London.

It estimates various parameters (Ts/Tv, dN/dS, branch length) on the codon (nucleotide) alignment, based on a predefined topology (phylogenetic tree).

Different categories of codon models exist in CodeML:

  • The model 0 estimates a unique dN/dS ratio for the whole alignment. Not really interesting, except to define a null hypothesis to test against. The other branch models estimate different dN/dS among lineages (ie ASPM, a gene expressed in the brain of primates).
  • The site models estimate different dN/dS among sites (ie in the antigen-binding groove of the MHC).
  • The branch-site models estimate different dN/dS among sites and among branches. It can detect episodic evolution in protein sequences. IMHO, the most interesting model.

First, we have to define the branch where we think that position could have occurred. We will call this branch the “foreground branch” and all other branches in the tree will be the “background” branches. The background branches share the same distribution of ω = dN/dS value among sites, whereas different values can apply to the foreground branch.

To compute the likelihood value, two models are computed: a null model, in which the foreground branch may have different proportions of sites under neutral selection to the background (i.e. relaxed purifying selection), and an alternative model, in which the foreground branch may have a proportion of sites under positive selection.

As the alternative model is the general case, it is easier to present it first.

Four categories of sites are assumed in the branch-site model:

Sites with identical dN/dS in both foreground and background branches:

  • K0 : Proportion of sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
  • K1 : Proportion of sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.

Sites with different dN/dS between foreground and background branches:

  • K2a: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
  • K2b: Proportion of sites that are under positive selection (ω2 ≥ 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.

For each category, we get the proportion of sites and the associated dN/dS values.

In the null model, the dN/dS (ω2) is fixed to 1:

Sites with identical dN/dS in both foreground and background branches:

  • K0 : Sites that are under purifying selection (ω0 < 1) on both foreground and background branches.
  • K1 : Sites that are under neutral evolution (ω1 = 1) on both foreground and background branches.

Sites with different dN/dS between foreground and background branches:

  • K2a: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under purifying selection (ω0 < 1) on background branches.
  • K2b: Sites that are under neutral evolution (ω2 = 1) on the foreground branch and under neutral evolution (ω1 = 1) on background branches.

For each model, we get the log likelihood value (lnL1 for the alternative and lnL0 for the null models), from which we compute the Likelihood Ratio Test (LRT).

The 2×(lnL1-lnL0) follows a χ2 curve with degree of freedom of 1, so we can get a p-value for this LRT.

Let's go in details.

Identification of positive seleciton in the serine/threonine-protein kinase gene family

In the evolution vertebrates, we would like to know if the branch leading to the Teleost fishes (genes A50 to A54) in the serine/threonine-protein kinase PAK 2 gene was under positive selection or not. And if yes, which residues were under positive selection.

We need four files to run CodeML (unzip them all):

  1. The multiple nucleotide (CDS) alignment, in PHYLIP format. CodeML will strictly remove any position that contains at least one gap or an unknown “N” nucleotide: tf105351.eut.3.phy.zip
  2. The phylogenetic tree in newick format, with the branch of interest specified by ”#1”(You can view it with NJplot or FigTree): tf105351.eut.3.53876.tree.zip
  3. A command file where all parameters to run CodeML under the alternative model are specified: tf105351.eut.3.53876.ctl.zip
  4. A command file where all parameters to run CodeML under the null model are specified: tf105351.eut.3.53876.fixed.ctl.zip

The tree looks like:

Execute CodeML

Run command file (alternative model):

We estimate the Ts/Tv ratio (fix_kappa = 0) and the dN/dS (fix_omega = 0). The branch-site model is specified by setting these two parameters:

  • model = 2 (different dN/dS for branches)
  • NSsites value to 2 (which allows 3 categories for sites: purifying, neutral and positive selection).
    seqfile = TF105351.Eut.3.phy            * sequence data file name    treefile = TF105351.Eut.3.53876.tree     * tree structure file name     outfile = TF105351.Eut.3.53876.mlc      * main result file name        noisy = 9   * 0,1,2,3,9: how much rubbish on the screen     verbose = 1   * 1: detailed output, 0: concise output     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic                   * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree      aaDist = 0   * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v}       model = 2   * models for codons:                   * 0:one, 1:b, 2:2 or more dN/dS ratios for branches     NSsites = 2   * 0:one w; 1:NearlyNeutral; 2:PositiveSelection; 3:discrete;                   * 4:freqs; 5:gamma;6:2gamma;7:beta;8:beta&w;9:beta&gamma;10:3normal       icode = 0   * 0:standard genetic code; 1:mammalian mt; 2-10:see below       Mgene = 0   * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated       kappa = 2   * initial or fixed kappa   fix_omega = 0   * 1: omega or omega_1 fixed, 0: estimate       omega = 1   * initial or fixed omega, for codons or codon-based AAs       getSE = 0       * 0: don't want them, 1: want S.E.s of estimatesRateAncestor = 0       * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)  Small_Diff = .45e-6  * Default value.   cleandata = 1       * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 0       * 0: ignore, -1: random, 1: initial, 2: fixed

Run command file (null model):

The command file for the null model is the same as for the alternative model, except for two parameters (in red):

  1. The name of the output file (outfile) is different.
  2. The dN/dS ratio is fixed to 1 (fix_omega = 1).
    seqfile = TF105351.Eut.3.phy               * sequence data file name    treefile = TF105351.Eut.3.53876.tree        * tree structure file name     outfile = TF105351.Eut.3.53876.fixed.mlc   * main result file name        noisy = 9   * 0,1,2,3,9: how much rubbish on the screen     verbose = 1   * 1: detailed output, 0: concise output     runmode = 0   * 0: user tree;  1: semi-automatic;  2: automatic                * 3: StepwiseAddition; (4,5):PerturbationNNI; -2: pairwise     seqtype = 1   * 1:codons; 2:AAs; 3:codons-->AAs   CodonFreq = 2   * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table       clock = 0   * 0: no clock, unrooted tree, 1: clock, rooted tree      aaDist = 0   * 0:equal, +:geometric; -:linear, {1-5:G1974,Miyata,c,p,v}       model = 2   * models for codons:                   * 0:one, 1:b, 2:2 or more dN/dS ratios for branches     NSsites = 2   * 0:one w; 1:NearlyNeutral; 2:PositiveSelection; 3:discrete;                   * 4:freqs; 5:gamma;6:2gamma;7:beta;8:beta&w;9:beta&gamma;10:3normal       icode = 0   * 0:standard genetic code; 1:mammalian mt; 2-10:see below       Mgene = 0   * 0:rates, 1:separate; 2:pi, 3:kappa, 4:all   fix_kappa = 0   * 1: kappa fixed, 0: kappa to be estimated       kappa = 2   * initial or fixed kappa   fix_omega = 1   * 1: omega or omega_1 fixed, 0: estimate               <- this line was changed, dN/dS is fixed to 1.       omega = 1   * initial or fixed omega, for codons or codon-based AAs**           getSE = 0       * 0: don't want them, 1: want S.E.s of estimatesRateAncestor = 0       * (0,1,2): rates (alpha>0) or ancestral states (1 or 2)  Small_Diff = .45e-6  * Default value.   cleandata = 1       * remove sites with ambiguity data (1:yes, 0:no)? fix_blength = 0       * 0: ignore, -1: random, 1: initial, 2: fixed

Launch CodeML:

In MacOSX/Linux, this will look like:

codeml ./TF105351.Eut.3.53876.ctlcodeml ./TF105351.Eut.3.53876.fixed.ctl

In Windows:

codeml.exe TF105351.Eut.3.53876.ctlcodeml.exe TF105351.Eut.3.53876.fixed.ctl

Two mlc output files are produced (as it can take time, you can download them directly in the next step).:

Analyse results
Step 1) Assign significance of the detection of positive selection on the selected branch:

We retieve the likelihood values lnL1 and lnL0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively.

We retieve the number of parameters np1 and np0 from TF105351.Eut.3.53876.mlc and TF105351.Eut.3.53876.fixed.mlc files, respectively.

  • lnL(ntime: 41 np: 46): -4707.210163 +0.000000 (lnL1)
  • lnL(ntime: 41 np: 45): -4710.222252 +0.000000 (lnL0)

We can construct the LRT (you can use your favourite spreadsheet for that. Or even better with R):

ΔLRT = 2×(lnL1 - lnL0) = 2×(-4707.210163 - (-4710.222252)) = 6.024178

The degree of freedom is 1 (np1 - np0 = 46 - 45).

(With OpenOffice: 1-CHISQDIST(6.024178;1))

p-value = 0.014104 (under χ2) ⇒ significant!

A significant result with the branch-site codon model means that positive selection affected a subset of sites during a specific evolutionary time (also called episodic model of protein evolution).

Step 2) If significant, we can retrieve sites under positive selection

In the TF105351.Eut.3.53876.mlc, we can retrieve sites under positive selection using the Bayes Empirical Bayes (BEB) method:

   Positive sites for foreground lineages Prob(w>1):        36 K 0.971*       159 C 0.993**

Amino acids K and C refer to the first sequence in the alignment.

  • Position 36 has a high probability (97.1%) of being under positive selection. It shifted from a lysine to a glycine.
  • Position 159 has a very high probability (99.3%) of being under positive selection.

You can visualise the multiple in Jalview.

- Open Jalview

- Load TF105351.Eut.3.phy

- Then: Calculate→translate cDNA. (tips: by moving the pointer on the amino acids alignment, you can see the corresponding codon in the nucleotide alignment.

Using other models

Other models can be tested by changing these parameters model and NSsites.

Example 1:

Site model M1 (neutral):

  • model = 0 (dN/dS doesn't vary on branches)
  • NSsites = 1 (which allows 2 categories for sites: purifying and neutral).

Site model M2a (positive selection):

  • model = 0 (dN/dS doesn't vary on branches)
  • NSsites = 2 (which allows 3 categories for sites: purifying, neutral and positive selection).

Then we can compare M1 and M2a by the likelihood ratio test.

Example 2:

Branch model M0:

  • model = 0 (dN/dS doesn't vary on branches)
  • NSsites = 0 (dN/dS doesn't vary on sites).

Branch model M2 with different dN/dS (positive selection on selected branches):

  • model = 2 (different dN/dS for branches)
  • NSsites = 2 (dN/dS doesn't vary on sites).

Then we can compare M0 and M2 by the likelihood ratio test.



https://wap.sciencenet.cn/blog-1339458-1023521.html

上一篇:Linux root 密码忘记后修改的方法
下一篇:Linux CentOS下shell显示-bash-4.1$ 不显示用户名和主机名的解决
收藏 IP: 220.249.99.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-15 05:29

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部