ljxue的个人博客分享 http://blog.sciencenet.cn/u/ljxue Liangjiao Xue, Bioinformatics is my favorite.

博文

JC3: De novo genome assembly

已有 2751 次阅读 2014-3-18 02:39 |个人分类:Journal Club|系统分类:论文交流| Assembly, genome

De novo genome assembly: what every biologist should know


If you want a genome assembled....

Seek help. For dnGASP and Assemblathon, some teams simply fed data into an assembler and applied all the default settings. Those teams performed poorly, and even running an assembler on its default settings requires considerable computational expertise. “The developer of software normally knows how to use it best,” says Ivo Gut. Researchers also need help planning and making their libraries.

Know what you want. Assemblers have different strengths and weaknesses. Someone who cares about how large swaths of the genome are arranged would value longer, more accurate contigs. A scientist who cares about having correct reading frames for genes would be more concerned about finer-grained errors.

Take the transcriptome, too. Analyzing transcribed regions can vastly improve assemblies. “Every de novo genome project should have a parallel RNA-seq project,” says Ian Korf. Besides identifying the intron-exon structure within genes, he says, this can help assess the accuracy of assembly, inform scaffold construction and help train algorithms that find genes.

Be realistic about computer resources. Scientists who are considering using a desktop version of a genome assembler must calibrate expectations to the size of the genome they hope to analyze. One study that compared eight assemblers found that only three programs worked on the approximately 250-megabase bumble bee genome. One required certain kinds of data that weren't available. For four of the others, the genome was simply too big for the computer's memory.

If you want to analyze a newly assembled genome....

Don't assume that features missing from the assembly are missing from the organism. If there are ten closely related genes in the genome, the assembly program may not be able to tease those apart, and some genes may be dropped. If researchers really care about a specific gene or other feature, they should consider targeted resequencing. “Don't take as Gospel the output of an assembly program,” says Benedict Paten. “If your paper is going to rely on that [finding], it is absolutely essential that you do PCR and other follow-up experiments.”

Compare alternate assemblies. Although combining assemblies is still difficult, looking at different assemblies may give researchers the information they need. For example, two assemblies of the cow genome each have similar numbers of genes that have not been put together properly, but the genes involved are different.

Turn the assembly tracks on. Though there are as of yet few local measures that assess genome quality, savvy biologists should be on the lookout for trouble. Many misassemblies can be identified by a measure known as the compression-expansion statistic, says Michael Schatz at Cold Spring Harbor Laboratory. “This is one of the few sensitive and specific metrics for identifying insertions and deletions in an assembly without requiring a reference genome.”

Regions that have considerably lower read depth than the rest of the assembly may represent a single polymorphic locus that the assembler has classified as two distinct loci. If the read depth is too high, an assembler may have merged repetitive regions, particularly a type of repetitive sequence known as segmental duplication. If a gene or region of interest is near a gap between contigs, researchers should be suspicious. Also, if the tracker indicates high levels of both discordant and concordant data, the region may be polymorphic, with differences between homologous chromosomes.

Expect lower quality in difficult regions. Some genomes are harder to assemble than others. In general, the larger the genome, the more mistakes. But if a scientist's region of interest has a high percentage of guanine and cytosine content or a lot of repeats, that scientist should be particularly wary: DNA amplification and assembly technologies deal poorly with such content.


Take home messages:

1) No software works perfect for a particular genome.

2) Use several criteria to evaluate the assembly, especially the one you care mostly.

3) Combination of several assemblies seems to be a good choice.






https://wap.sciencenet.cn/blog-285393-776919.html

上一篇:JC2: Herbivores and nutrients vs plant diversity
下一篇:JC4: NAC day
收藏 IP: 128.192.8.*| 热度|

1 高建国

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-17 07:19

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部