博文

在系统发生树上整合功能性状和DNA条形码

已有 2177 次阅读 2023-5-22 11:45 |个人分类:论文简介|系统分类:论文交流

Phylogeny-based assignment of functional traits to DNA barcodes

In 2003, DNA barcoding was proposed as a means to profile species level diversity (e.g. MOTU) of invertebrates, and to assign taxonomic names. Since there are many easy-to-use software tools to build phylogenies from molecular sequences, it also became common to report phylogenies for sets of DNA barcodes. Thus, these three types of diversity: species, taxonomic, and phylogenetic, became standard measures for studies of alpha and beta diversity in invertebrates. During this period in plant sciences, functional traits moved center stage, as a way to move beyond simple structural measures of diversity and attempting to understand mechanisms behind community structure. Oddly, there has been very little intersection of these two developments, except for in microbiology where tools and databases were developed for assignment of metabolic trait profiles to 16S metabarcodes. Tingting Xie et al. address this gap, with two major aims(fig.1). Firstly, they propose approaches for assignment of traits to DNA barcodes. For a second aim of the study, they propose a practical pipeline for incorporation of traits into DNA barcode based profiling.

To achieve the first aim, the authors developed three approaches to DNA based assignment of traits, from references to queries. Each of these approaches was built around existing tools for DNA barcoding, phylogenetics and trait modelling. The first method they called ‘Phylogenetic Assignment’. In this, a phylogeny was constructed that incorporated reference members which had known traits, and the traits were modelled on the phylogeny, and the model used to predict states of queries. The second method they called ‘Blast Neighborhood’ and equivalent to an established distance-based approach to DNA based taxonomic assignment. For Blast Neighborhood, two Blast searches are conducted, the first retrieves a distance threshold for the neighborhood which is the distance from the query to the nearest reference, and the second retrieves all references within this threshold. Then, the states present in this neighborhood are assigned to the query. Finally, the authors conduct a simple Blast top hit and assign to the queries the states of this. These three approaches are assessed using a ‘Leave-1-Out’ strategy, often used for assessing DNA barcoding accuracy.

The three methods required a high-resolution dataset upon which comparative tests could be conducted. The test dataset comprising DNA barcodes and traits at the specimen level. To make this dataset, thousands of bees were collected at several sites across China, and returned to the lab. The authors measured several morphometric traits from these specimens, particularly body length, inter-tegular distance, head width, wing length, and hair length. Further, COI DNA barcodes were sequenced from the same specimens. This gave a dataset with the resolution necessary to account for intraspecific variation in any traits.

In comparative tests on the specimen-level dataset, the authors report that the rate of successful assignment of traits is primarily determined by the genetic distance between the query and the nearest reference. However, the three approaches did not behave the same(fig.3). Particularly, Phylogenetic Assignment was found to have advantageous features. Particularly, it rarely returned a significant state assignment where success was unlikely, where the distance from query to reference was large. In other words, it has a much lower false positive rate than the other two methods.

With an accurate approach to trait assignment established (Phylogenetic Assignment), the second major aim of the study was to develop a practical near term pipeline for the incorporation of functional traits into DNA based profiling. For this, the authors propose a species-level reference framework based around published trait records, public DNA barcodes and an integrative phylogeny, because of the wide availability of these types of data. The authors collated thousands of records for a wide variety of functional traits, including morphometric and life history. They also mined public DNA barcode data and use a phyloinformatics pipeline to integrate with backbone phylogenetic information, and giving a species level phylogeny. Using this trait+DNA barcode+phylogenetic reference framework(fig.3), the authors then measured the rates of assignment of the many traits, to a query dataset. The query dataset used was of Chinese bee DNA barcodes, which were delineated to MOTU. The rate of trait assignment was found to vary greatly for different types of trait. The highest rates of state assignment were observed for conserved life history traits such as sociality, nest location and parasitism. Whereas labile morphometric traits were assigned at a much lower rate (fig.4).

Xie et al. (2023) propose to unite two key themes in ecology which to date have been unlinked: functional traits, and DNA barcoding. Leveraging the high throughput in community profiling afforded by DNA barcoding, and the high power of functional traits for explaining community composition, this opens powerful new capabilities in understanding the structure and function of biodiversity patterns.

Tingting Xie, Michael C. Orr, Dan Zhang, Rafael R. Ferrari, Yi Li, Xiuwei Liu, Zeqing Niu, Mingqiang Wang, Qingsong Zhou, Jiasheng Hao, Chaodong Zhu, Douglas Chesters. (2023) Phylogeny-based assignment of functional traits to DNA barcodes outperforms distance-based, in a comparison of approaches. Molecular Ecology Resources.

2003年，DNA条形码技术被提出并迅速发展为鉴定无脊椎动物物种多样性（例如MOTU）和注释物种名的重要手段。随着相应软件的开发，基于分子序列（如，DNA条形码数据集）构建系统发生树的研究报道日趋增加。因此，在无脊椎动物方面，物种、性状及系统发生多样性这3类多样性指标，已发展成为研究α和β多样性的标准性指标。在植物学相关研究中，功能性状因其不仅能够有效反映群落的多样性结构，并且有助于理解群落结构机制的优点而广受学者重视。令人惊讶的是，除了微生物学中基于16S条形码序列注释生物代谢性状的研究外，DNA条形码及功能性状这2个研究领域之间几乎没有交集。谢婷婷等人试图填补这一空白，主要包括2个目的：首先，本研究提出了如何将功能性状分配给DNA条形码的方法。其次，本研究构建了将功能性状与DNA条形码相结合的流程框架（图1）。

为实现第一个目标，本研究开发了三种基于DNA条形码的性状注释方法。每种方法皆基于DNA条形码、系统发生关系及性状所构建。第一种方法为“基于系统发生关系的注释方法”。该方法通过构建一个包含已知性状的系统发生树，进行建模以预测未知（query）的性状状态。第二种方法称为“邻域比对”，原理与已有的基于距离注释物种名的方法相似。该方法进行2次比对检索，第一次检索到“查询”（query）和最近参考数据(reference)之间的距离阈值，第二次检索此阈值内的所有结果。然后，将比对的邻域中存在的所有性状结果分配给“查询”（query）。最后，本研究进行了一个简单的比对匹配工作，“查询”（query）的注释结果为最近的比对匹配结果。这三种方法的准确性均通过“留一交叉验证”的方法进行评估。

这三种方法注释正确率的比较依赖于高分辨率的数据集。因此，本研究构建了一个测试数据集，包括样本水平的DNA条形码和性状数据。为构建此数据集，本研究采集了全国多个样点的上千头蜜蜂，并测量了其形态性状，主要包括体长、翅基宽、头宽、翅长和毛长。此外，通过分子手段获得了其对应的DNA条形码序列（COI）。该数据集的高分辨率将有助于解释任何性状的种内变异。

在样本水平，通过比较3种方法的注释结果，发现性状注释的成功率主要取决于“查询”（query）与最近参考数据（reference）之间的遗传距离。并且，不同方法的注释正确率不同（图2）。其中，基于系统发生关系的注释方法表现最佳，尤其是当“查询”（query）与参考数据库（reference）之间存在较大差异时，该方法很少返回一个错误的结果。换言之，相较于其他2种方法，该方法具有更低的误报率。

在确定了一个具有高正确率的性状注释方法后（基于系统发生关系的注释方法），本研究进一步提出将功能性状体系纳入基于DNA的分析中并制定出了可行性方案。基于此，本研究基于已发表的性状记录、公开的DNA条形码及整合的系统发生关系，构建了一个物种级的参考数据框架。为实现该研究目标，本研究收集了成千上万条性状数据，包括形态及生活史等功能性状数据。然后，通过基于系统发生关系的生物信息学（phyloinformatics ）流程，将挖掘的公共DNA条形码数据与骨干树相融合，构建一个物种级的系统发发生树。最后，结合性状、DNA条形码以及系统发生关系的参考框架，尝试对中国蜜蜂 DNA 条形码的数据集进行性状注释（图3）并评估其正确率。结果发现不同类型性状的注释正确率存在较大差异，保守的生活史性状相较于多变的形态性状数据，具有较高的正确率，例如社会习性、筑巢位置和是否寄生（图4）。

谢婷婷等人（2023年）提出将生态学中迄今为止未关联的2个关键主题（功能性状和DNA条形码）相结合，这将有助于理解生物多样性的结构和功能。

图1