syfox的个人博客分享 http://blog.sciencenet.cn/u/syfox

博文

Bioinformatics data

已有 7088 次阅读 2010-8-9 09:28 |个人分类:测序|系统分类:科研笔记

K-mer

k-mers (或 x-mers 那里 x 可以是实际上选择所有辅音)通常提到具体 n元组n克 核酸氨基酸 序列 那在原生质之内可以用于辨认某些地区象脱氧核糖核酸(即。 为 基因预言)或蛋白质。 任一k-mer串象这样可以为发现地区利益或者k-mer统计给使用 分离概率分布 一定数量可能的k-mer 组合 (或宁可 变更 以重复)使用。 具体短的k-mers叫 齐聚物 或“oligos”为短小。

参见

例子
  • 二聚体= AGAGAGAGAGAGAG
  • 三聚合物= AAGAAGAAGAAG

contig就是重叠群的意思。就是基因组分析测序中的一个概念。
把含有STS序列标签位点的基因片段分别测序后,重叠分析就可以得到完整的染色体基因组序列。分析中用到的一个概念就是重叠群。

物理图谱的制作,可以更好的理解这个概念:
基本原理是把庞大的无从下手的DNA先“敲碎”,再拼接。以Mb、kb、bp作为图距,以DNA探针的STS(sequence tags site)序列为路标。1998 年完成了具有52,000个序列标签位点(STS),并覆盖人类基因组大部分区域的连续克隆系的物理图谱。构建物理图的一个主要内容是把含有STS对应序列的DNA的克隆片段连接成相互重叠的“片段重叠群(contig)”。用“酵母人工染色体(YAC)作为载体的载有人DNA片段的文库已包含了构建总体覆盖率为100%、具有高度代表性的片段重叠群”,近几年来又发展了可靠性更高的BAC、PAC库或cosmid库等。

分子遗传学常用词汇(English)

Sequence

Raw sequence Individual unassembled sequence reads, produced by sequencing of clones containing DNA inserts.

Paired-end sequence Raw sequence obtained from both ends of a cloned insert in any vector, such as a plasmid or bacterial artificial chromosome.

Finished sequence Complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps.

Coverage (or depth) The average number of times a nucleotide is represented by a high-quality base in a collection of random raw sequence. Operationally, a 'high-quality base' is defined as one with an accuracy of at least 99% (corresponding to a PHRED score of at least 20).

Full shotgun coverage The coverage in random raw sequence needed from a large-insert clone to ensure that it is ready for finishing; this varies among centres but is typically 8–10-fold. Clones with full shotgun coverage can usually be assembled with only a handful of gaps per 100 kb.

Half shotgun coverage Half the amount of full shotgun coverage (typically, 4–5-fold random coverage).

Clones

BAC clone Bacterial artificial chromosome vector carrying a genomic DNA insert, typically 100–200 kb. Most of the large-insert clones sequenced in the project were BAC clones.

Finished clone A large-insert clone that is entirely represented by finished sequence.

Full shotgun clone A large-insert clone for which full shotgun sequence has been produced.

Draft clone A large-insert clone for which roughly half-shotgun sequence has been produced. Operationally, the collection of draft clones produced by each centre was required to have an average coverage of fourfold for the entire set and a minimum coverage of threefold for each clone.

Predraft clone A large-insert clone for which some shotgun sequence is available, but which does not meet the standards for inclusion in the collection of draft clones.

Contigs and scaffolds

Contig The result of joining an overlapping collection of sequences or clones.

Scaffold The result of connecting contigs by linking information from paired-end reads from plasmids, paired-end reads from BACs, known messenger RNAs or other sources. The contigs in a scaffold are ordered and oriented with respect to one another.

Fingerprint clone contigs Contigs produced by joining clones inferred to overlap on the basis of their restriction digest fingerprints.

Sequenced-clone layout Assignment of sequenced clones to the physical map of fingerprint clone contigs.

Initial sequence contigs Contigs produced by merging overlapping sequence reads obtained from a single clone, in a process called sequence assembly.

Merged sequence contigs Contigs produced by taking the initial sequence contigs contained in overlapping clones and merging those found to overlap. These are also referred to simply as 'sequence contigs' where no confusion will result.

Sequence-contig scaffolds Scaffolds produced by connecting sequence contigs on the basis of linking information.

Sequenced-clone contigs Contigs produced by merging overlapping sequenced clones.

Sequenced-clone-contig scaffolds Scaffolds produced by joining sequenced-clone contigs on the basis of linking information.

Draft genome sequence The sequence produced by combining the information from the individual sequenced clones (by creating merged sequence contigs and then employing linking information to create scaffolds) and positioning the sequence along the physical map of the chromosomes.

N50 length A measure of the contig length (or scaffold length) containing a 'typical' nucleotide. Specifically, it is the maximum length L such that 50% of all nucleotides lie in contigs (or scaffolds) of size at least L.

Computer programs and databases

PHRED A widely used computer program that analyses raw sequence to produce a 'base call' with an associated 'quality score' for each position in the sequence. A PHRED quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRED quality score of 30 corresponds to 99.9% accuracy for the base call in the raw read.

PHRAP A widely used computer program that assembles raw sequence into sequence contigs and assigns to each position in the sequence an associated 'quality score', on the basis of the PHRED scores of the raw sequence reads. A PHRAP quality score of X corresponds to an error probability of approximately 10-X/10. Thus, a PHRAP quality score of 30 corresponds to 99.9% accuracy for a base in the assembled sequence.

GigAssembler A computer program developed during this project for merging the information from individual sequenced clones into a draft genome sequence.

Public sequence databases The three coordinated international sequence databases: GenBank, the EMBL data library and DDBJ.

Map features

STS Sequence tagged site, corresponding to a short (typically less than 500 bp) unique genomic locus for which a polymerase chain reaction assay has been developed.

EST Expressed sequence tag, obtained by performing a single raw sequence read from a random complementary DNA clone.

SSR Simple sequence repeat, a sequence consisting largely of a tandem repeat of a specific k-mer (such as (CA)15). Many SSRs are polymorphic and have been widely used in genetic mapping.

SNP Single nucleotide polymorphism, or a single nucleotide position in the genome sequence for which two or more alternative alleles are present at appreciable frequency (traditionally, at least 1%) in the human population.

Genetic map A genome map in which polymorphic loci are positioned relative to one another on the basis of the frequency with which they recombine during meiosis. The unit of distance is centimorgans (cM), denoting a 1% chance of recombination.

Radiation hybrid (RH) map A genome map in which STSs are positioned relative to one another on the basis of the frequency with which they are separated by radiation-induced breaks. The frequency is assayed by analysing a panel of human–hamster hybrid cell lines, each produced by lethally irradiating human cells and fusing them with recipient hamster cells such that each carries a collection of human chromosomal fragments. The unit of distance is centirays (cR), denoting a 1% chance of a break occuring between two loci.



https://wap.sciencenet.cn/blog-223428-351242.html

上一篇:Research progress on functional analysis of rice WRKY gene
下一篇:水稻WRKY72
收藏 IP: 210.72.93.*| 热度|

0

发表评论 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-1 23:06

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部