|
基因注释一般是指采用生物信息学的方法获得已组装好的基因组中基因的位置、结构等信息,一般包括从头注释、同源注释和基于转录组和蛋白质组的注释。基于转录组和蛋白质组的注释是目前最准确的方法,但受限于不可能获得所有时空下的转录组或蛋白质组,所以有必要用同源注释和从头注释的结果作为补充。基因注释是分子生物学研究的基础,若基因注释结果不正确或不完整,则以此为基础的后续研究也会受到影响。
目前已经有众多的算法或软件被开发出来用于基因注释领域。从头注释的软件包括SNAP (Korf, 2004),TwinScan (Korf et al., 2001),FGENESH (Salamov and Solovyev, 2000),Augustus (Stanke et al., 2006),Genscan (Burge and Karlin, 1997), GAZE (Howe et al., 2002) 等。这些软件往往需要将一些已知基因作为训练集,然后根据训练好的模型去预测基因。理论上,只要训练集包括足够的基因,该方法可以预测出所有基因位点的,但却不能精确界定基因的外显子-内含子边界。同源注释则是将近源物种基因的转录本序列或蛋白序列映射至需要注释的基因组上,常用的工具有BLAST、BLAT (Kent, 2002)、Splign (Kapustin et al., 2008)、Spidey (Wheelan et al., 2001)、sim4 (Florea et al., 1998)、Exonerate (Slater and Birney, 2005)、gmap (Wu and Watanabe, 2005)、Magic-BLAST(Boratyn et al., 2019)和minimap2 (Li, 2018) 等软件, 其中gmap、Magic-BLAST和minimap2为新一代比对工具,可将大量的转录本序列快速比对至基因组上。同源注释有助于基因位点的发现,但由于不同物种之间基因组上存在差异,在基因结构以及是否表达上还需要本物种转录水平的证据支持。
基于转录组的注释,是指将不同来源的转录本序列比对至基因组上,然后根据转录本的位置进行注释。比对常用的软件与上述同源转录本的比对所用的软件一致。转录本序列一般来自EST序列、全长cDNA序列、二代测序获得的转录本序列以及三代测序获得的转录本序列。由于二代测序打断测序的缺点,在拼接成全长转录本时会有假阳性的转录本产生。而三代测序获得的转录本则可以避免这种情况的发生,但由于错误率高而且价格也比较高,只被用在少数研究中。另外,为了获得基因的方向、更精确的转录起始位点和结束位点等信息,诸如链特异性RNA-seq、Cap Analysis Gene Expression (CAGE-seq) 和PolyA-seq等基于二代测序平台获得的数据也被加入到基因组注释流程中 (Wang et al., 2019)。相比转录组来说,目前高通量蛋白质组技术还未获得关键性突破。核糖体印记测序 (Ribo-seq) 可在一定程度上代替高通量蛋白质组技术。该技术能够获得正在翻译过程中的mRNA片段, 但目前还未见将该数据应用到注释流程当中的报道。基因组正常转录时可能会出现一些转录噪音,并不是真正的基因,因此注释基因时也应当考虑基因的表达量,排除可能的转录噪音。
为提高基因注释的准确性和完整性,可以将上述三种基因注释方法综合起来使用。目前有一些软件将这三个方面的注释方法整合到一个流程当中,如MAKER (Cantarel et al., 2008)、MAKER-P (Campbell et al., 2014)、PASA (Haas et al., 2003)、Funannotate[1] 以及一些综合性的生物数据库网站也会开发一套自己的注释流程,如Gramene pipeline (Liang et al., 2009)、Ensembl gene annotation system (Aken et al., 2016)、NCBI Eukaryotic Genome Annotation Pipeline[2]和PGSB[3]等。随着使用三代测序获得的转录本日益增多,一些基于三代转录组数据的基因注释软件也被开发出来,如LoReAn (Cook et al., 2019)、mikado (Venturini et al., 2018) 等。另外,随着测序价格的降低以及基因组组装技术的进步,从头组装一个新基因组也变得容易起来,对于那些已有基因注释的物种来说,可将已有的基因注释转移至新基因组上,目前已经有一些生物信息学工具可以方便的完成这一过程 (Konig et al., 2016; Song et al., 2019)。总结来说,这种将不同注释方法整合起来的生物信息软件极大简化了基因注释的过程,在此基础之上可辅以人工校正来纠正仍然可能错误的基因。其中Dunn et al. (2019 开发的工具Apollo让研究者进行人工校正变得更加便捷。
上述基因注释方法同样可以应用到小麦基因组的注释上,无论是乌尔拉图小麦、节节麦还是野生二粒小麦和栽培二粒小麦以及中国春的基因注释工作,都使用了上述三种方法和相关的软件。其中,国际小麦测序联盟在注释中国春基因组的过程中采用了多种方法,除PGSB、PASA流程之外,还使用了专门为注释小麦基因组所开发的TriAnnot流程 (Leroy et al., 2012)。该流程包括了转座子注释、基因注释以及后续的基因功能注释 (图1)。尽管如此,在实际使用过程中发现,目前中国春小麦的基因注释中仍然存在错误,如小麦雄性不育基因Ms2就不在当前注释版本中 (Ni et al., 2017)。另外,我们鉴定的小麦族特异基因有很多也不在当前注释的基因中 (Ma et al., 2019)。
图1 TriAnnot软件工作流概述 (Leroy et al., 2012)。
参考文献
Aken, B.L., Ayling, S., Barrell, D., Clarke, L., Curwen, V., Fairley, S., Fernandez Banet, J., Billis, K., Garcia Giron, C., Hourlier, T., et al. (2016). The Ensembl gene annotation system. Database (Oxford) 2016.
Boratyn, G.M., Thierry-Mieg, J., Thierry-Mieg, D., Busby, B., and Madden, T.L. (2019). Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20:405.
Burge, C., and Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78-94.
Campbell, M.S., Law, M., Holt, C., Stein, J.C., Moghe, G.D., Hufnagel, D.E., Lei, J., Achawanantakun, R., Jiao, D., Lawrence, C.J., et al. (2014). MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol 164:513-524.
Cantarel, B.L., Korf, I., Robb, S.M., Parra, G., Ross, E., Moore, B., Holt, C., Sanchez Alvarado, A., and Yandell, M. (2008). MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18:188-196.
Cook, D.E., Valle-Inclan, J.E., Pajoro, A., Rovenich, H., Thomma, B., and Faino, L. (2019). Long-Read Annotation: automated eukaryotic genome annotation based on long-read cDNA sequencing. Plant Physiol 179:38-54.
Dunn, N.A., Unni, D.R., Diesh, C., Munoz-Torres, M., Harris, N.L., Yao, E., Rasche, H., Holmes, I.H., Elsik, C.G., and Lewis, S.E. (2019). Apollo: Democratizing genome annotation. PLoS Comput Biol 15:e1006790.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. (1998). A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8:967-974.
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., et al. (2003). Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31:5654-5666.
Howe, K.L., Chothia, T., and Durbin, R. (2002). GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12:1418-1427.
Kapustin, Y., Souvorov, A., Tatusova, T., and Lipman, D. (2008). Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3:20.
Kent, W.J. (2002). BLAT—the BLAST-like alignment tool. Genome Res 12:656-664.
Konig, S., Romoth, L.W., Gerischer, L., and Stanke, M. (2016). Simultaneous gene finding in multiple genomes. Bioinformatics 32:3388-3395.
Korf, I. (2004). Gene finding in novel genomes. BMC Bioinformatics 5:59.
Korf, I., Flicek, P., Duan, D., and Brent, M.R. (2001). Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1:S140-148.
Leroy, P., Guilhot, N., Sakai, H., Bernard, A., Choulet, F., Theil, S., Reboux, S., Amano, N., Flutre, T., Pelegrin, C., et al. (2012). TriAnnot: a versatile and high performance pipeline for the automated annotation of plant genomes. Front Plant Sci 3:5.
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100.
Liang, C., Mao, L., Ware, D., and Stein, L. (2009). Evidence-based gene predictions in plant genomes. Genome Res 19:1912-1923.
Ma, S., Yuan, Y., Tao, Y., Jia, H., and Ma, Z. (2019). Identification, characterization and expression analysis of lineage-specific genes within Triticeae. Genomics doi:10.1016/j.ygeno.2019.1008.1003.
Ni, F., Qi, J., Hao, Q., Lyu, B., Luo, M.C., Wang, Y., Chen, F., Wang, S., Zhang, C., Epstein, L., et al. (2017). Wheat Ms2 encodes for an orphan protein that confers male sterility in grass species. Nat Commun 8:15121.
Salamov, A.A., and Solovyev, V.V. (2000). Ab initio gene finding in Drosophila genomic DNA. Genome Res 10:516-522.
Slater, G.S., and Birney, E. (2005). Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31.
Song, B., Sang, Q., Wang, H., Pei, H., Wang, F., and Gan, X. (2019). A weighted sequence alignment strategy for gene structure annotation lift over from reference genome to a newly sequenced individual. bioRxiv.
Stanke, M., Schoffmann, O., Morgenstern, B., and Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62.
Venturini, L., Caim, S., Kaithakottil, G.G., Mapleson, D.L., and Swarbreck, D. (2018). Leveraging multiple transcriptome assembly methods for improved gene structure annotation. Gigascience 7.
Wang, K., Wang, D., Zheng, X., Qin, A., Zhou, J., Guo, B., Chen, Y., Wen, X., Ye, W., Zhou, Y., et al. (2019). Multi-strategic RNA-seq analysis reveals a high-resolution transcriptional landscape in cotton. Nat Commun 10:4714.
Wheelan, S.J., Church, D.M., and Ostell, J.M. (2001). Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11:1952-1957.
Wu, T.D., and Watanabe, C.K. (2005). GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859-1875.
[1] https://funannotate.readthedocs.io
[2] https://www.ncbi.nlm.nih.gov/books/NBK169439
[3]http://pgsb.helmholtz-muenchen.de/plant
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-12-27 01:41
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社