yzy2020的个人博客分享 http://blog.sciencenet.cn/u/yzy2020 技术体现的是术,可以通过多次练习掌握,不要迷恋术,idea是道,需要通过文献加强训练。总之,孰能生巧!

博文

[转载]系统发育树构建相关问题,直系同源的——DNA序列,or蛋白序列

已有 1075 次阅读 2023-5-19 10:02 |个人分类:linux学习|系统分类:科研笔记|文章来源:转载

1、检索到的原问题

To do a phylogeny analysis, what is preferred~ nucleotide or protein sequence?

https://www.researchgate.net/post/Which_type_of_sequence_should_be_used_for_phylogenetic_tree_construction_conserved_or_non_conserved

Which is more informative: a phylogenetic tree based on alignment of protein amino acid sequences or one based on the corresponding DNA sequences?

https://www.researchgate.net/post/Which-is-more-informative-a-phylogenetic-tree-based-on-alignment-of-protein-amino-acid-sequences-or-one-based-on-the-corresponding-DNA-sequences

部分回答:In general I think it will be very difficult to estimate divergence times for organisms that diverged as long ago as, say, a bacterium and a mammal, both of which contribute a member to the same protein superfamily.  The rate of evolution of two organisms that are that different is very likely to be quite different, and thus the "molecular clock" assumption that underlies using phylogenies to determine divergence times is not valid on such timescales.  This is why divergence time estimation programs generally use DNA sequences:  the model is only valid for short divergence times, over which protein sequences don't vary enough to provide enough information, but DNA sequences do.

物种分化时间跨度大:同源蛋白构建的发育树更能准确反映亲缘关系;

物种分化时间跨度小:适合“分子钟学说”,同源基因序列比较合适,反映亲缘关系。

关键:怎么划分物种分化时间跨度的大或小,细菌和哺乳动物跨度大,不同细菌之间可认为分化时间跨度小?

2、相关回答

http://ib.berkeley.edu/courses/ib200b/IB200B_SyllabusHandouts.shtml (强烈推荐,从源头解决)

3、NGR综述(为自己理解,可能包含错误,需要核对原文)

https://www.nature.com/articles/s41576-020-0233-0

Phylogenetic tree building in the genomic age

3.1单拷贝同源编码基因(蛋白)推断系统发育树

3.2 多拷贝分析基因家族收缩与扩张

In particular, multi-copy gene data can be used simultaneously to estimate the species and gene-family evolution40,41.

3.3氨基酸稳定,进化慢,蛋白序列比三联体密码子比对更适合进行多序列比对

Accurate alignment is fundamental in the inference of evolutionary relationships but, for genes in which indels have been frequent, it is a challenging task. When aligning protein-coding DNA sequences, the nucleotides naturally evolve as codon triplets rather than as single nucleotides. This property, as well as the fact that amino acid sequences change less rapidly than the corresponding nucleotides, means that initial alignment at the protein level rather than at the DNA level is usually appropriate. The codon triplets can then be aligned according to their corresponding amino acids49–51.

3.4贝叶斯与ML的差异,在于贝叶斯随后会统计分布定量评估(建树的)参数;二者最大的缺陷,都是计算量太大,费时;ML和MP树需要bootstrap,贝叶斯有后验概率支持,不需要bootstrap;似然法树(ML和贝叶斯)对LBA(长枝吸引)产生的错误很敏感,能根据模型消除LBA影响,对模型假设依赖性强

The Bayesian method also relies on an explicitly stated model and on the likelihood function. It differs from ML in that it uses statistical distributions to quantify uncertainties in the parameters.

Computation in Bayesian phylogenetics is achieved using the Markov chain Monte Carlo (MCMC) algorithm, which is a computer simulation algorithm that generates a sample of the tree topologies and parameters from their posterior. In practical terms, the frequency with which the algorithm visits a given tree topology is an estimate of the posterior probability for that tree. The maximum posterior probability tree (or the MAP tree) is our best estimate of the true tree95.

A serious drawback of likelihood-based methods, including both ML and BI, is their heavy computational demand since they may take many thousands of CPU hours to run; this is particularly true of MCMC algorithms.

The bootstrap is applied to assess confidence in estimated trees for the distance, parsimony and ML methods. For Bayesian methods, the posterior probabilities for trees and clades provide the natural measure of confidence so that the bootstrap is unnecessary.

Likelihood methods (ML and BI) are more robust to LBA errors than is parsimony, as they are branch-length aware and hence consider the increased possibility of convergence on two long branches. ML and BI can nevertheless suffer from LBA if the assumed substitution model is incorrect or too simplistic109 such as wrongly assuming a homogeneous rate of change across sites.

胶原蛋白比组蛋白突变高;不同位点核苷酸突变频率不同;氨基酸最稳定,突变最慢;一种理想的方案:将蛋白氨基酸序列比对——修剪完成,再转换为核苷酸序列,同时注意去除终止密码子,三联体密码子的第一和第二核苷酸模型参数估计,第三核苷酸高GC含量和突变较高

Collagens change more quickly than histones, introns change more quickly than exons, third positions in a codon change more quickly than the first and second positions, and some amino acids within a protein are under strong stabilizing selection while others are free to vary; ultimately, assuming a constant rate among sites of a gene is unrealistic.

4、以下可作为补充

1. http://www.ncbi.nlm.nih.gov/books/NBK21122/

2. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC98952/

3. http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter5.html

4. http://eebweb.arizona.edu/blast/rna_lecture.pdf





https://wap.sciencenet.cn/blog-3434047-1388592.html

上一篇:[转载]orthofinder构建单拷贝基因发育树——相关命令参数
下一篇:[转载]使用 ProtTest 来选择最优氨基酸替代模型
收藏 IP: 221.11.67.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-16 18:31

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部