今天除了第一条推送以外,特地为大家准备了一顿丰盛的小麦生信基础大餐,主要就是介绍一下小麦基因组序列的各种数据库,其中就包括今天所介绍的IWGSC RefSeq 1.0 以及前段时间群里小伙伴聊到的在plants.ensembl上边的TGAC版本。当然,这其中最新最全的还是IWGSC RefSeq1.0,平常大家还是要首选这个数据库,但是当你的序列在这个版本上查不到或者你的genetic map跟RefSeq线性关系很差的时候,就必须要试一试其它的版本了。更重要的是,像一些小麦基因表达数据库或者小麦TILLING数据库并不是按照最新的IWGSC RefSeq 1.0来做annotation的,所以说掌握了小麦数据库的各种版本才能在以后的学习过程中游刃有余!
内容一:Genome assemblies介绍
来自:www.wheat-training.com
这份总结没有包括去年最新释放的 D genome 的数据库,不过大家读完了下面的总结肯定就能触类旁通了。
a) Introduction to the wheat genome
Wheat is an allopolyploid of which there are two major types:
hexaploid common wheat (Triticum aestivum ssp. aestivum; 17 Gb genome size; AABBDD genomes), which is mainly used for bread and biscuit products
tetraploid durum wheat (Triticumturgidumssp.durum;12Gbgenomesize;AABB genomes), used mainly for pasta.
Hexaploid wheat arose from a polyploidization event ~ 10,000 years ago, whereas tetraploids wheat arose ~ 400,000 years ago (Figure 1).
Figure 1: The evolutionary history of allopolyploid wheat. FromBorrill et al., 2015. DOI: 10.1111/nph.13533
Hexaploid wheat contains three closely related genomes (A, B and D) which contain homoeologous genes in a conserved order. Wheat homoeologues share over 95% sequence identity within coding regions and most wheat genes are expected to be present as three copies in the A, B, and D genome. Due to the high sequence conservation between homoeologues, genes may be functionally redundant or act in a dose-dependent manner. This means that often all three copies must be knocked out to cause a strong phenotype. However, in other cases, the homoeologous genes have developed specialized functions or become pseudogenized over time due to reduced selection pressure on duplicated genes.
b) Multiple genome assemblies are available
The large size of the wheat genome has made it very challenging to produce a reference genome. Different strategies have been used to create draft genome assemblies. Currently, several genome assemblies need to be used in a complementary fashion because no single reference is the best across all regions. A brief explanation about each genome assembly is given below alongside any caveats.
c) Introduction to each assembly
Chinese spring Survey Sequence (CSS)
The International Wheat Genome Sequencing Consortium (IWGSC) has used flow-sorting to separate out individual chromosome arms. The landrace used (Chinese Spring) had aneuploid genetic stocks available, which have only one arm of each chromosome, e.g. the short arm of chromosome 1A is deleted so chromosome arm 1AL can be separated from the other chromosomes. This enabled the IWGSC to separate each individual chromosome arm by flow sorting, before sequencing each chromosome arm separately. Individual chromosome arms were sequenced to 30-240x coverage using Illumina NGS, generating a 10.2 Gb assembly of Chinese Spring (termed Chinese Spring survey sequence (CSS)).
Gene models were created by mapping RNA-seq data and using gene models from related species. A total of 99,386 protein-coding genes were predicted, with 193,667 transcripts and splice variants. The gene models presented rely on the scaffolds assembled and in some cases, gene models are incomplete because of the underlying genomic scaffolds are not full-length assemblies. For example in some cases genes are split between two different genomic scaffolds (shown below), therefore one gene is given two different identifiers.
Figure 2: CSS gene models are affected by truncated scaffolds.
Chromosome 3B was assembled using a BAC by BAC approach and is currently considered the “gold standard” for the reference genome assembly. 3B gene models were created separately using RNA-seq data.
CSS reference: IWGSC 2014, DOI:10.1126/science.1251788
CSS data access: http://archive.plants.ensembl.org/Triticum_aestivum/Info/Index
W7984
A whole-genome sequencing approach was undertaken in the synthetic hexaploid wheat “Synthetic W7984”. This approached used large-insert sequencing libraries and enabled separate assemblies of the three homoeologues genomes, to a total assembly size of 9.1Gb. In some regions, the W7984 scaffolds are more continuous than the CSS, but for other regions, the CSS is more continuous than the W7984. Using both references will give the complete picture of genomic regions of interest. W7984 does not have any gene models associated with it.
W7984 reference: Chapman et al.,2015 DOI: 10.1186/s13059-015-0582-8
W7984 data access: http://www.cerealsdb.uk.net/cerealgenomics/CerealsDB/blast_WGS.php
TGAC
A whole genome shotgun sequence assembly of Chinese Spring was carried out using nested long mate-pair libraries alongside a modified version of the DISCOVAR algorithm for assembly. This method created an assembly of total length 13.4 Gb, with approximately 10x N50 longer than the CSS and W7984 assemblies.
Gene models from IWGSC were projected onto the TGAC assembly, with 99% of the total genes located on the TGAC assembly. De novo gene prediction has been carried out for the TGAC assembly resulting in a total of 273,739 transcripts (including non-coding and transcript variants). Of these, 104,305 are high confidence protein-coding genes. In general, these gene models are more complete than the CSS gene models due to the longer contig length within the TGAC assembly. These gene models are available from http://plants.ensembl.org
TGAC reference: Clavijo et al., 2017 DOI:10.1101/gr.217117.116
TGAC data access: http://plants.ensembl.org/Triticum_aestivum/Info/Index?db=corehttps://wheatis.tgac.ac.uk/grassroots-portal/blast
Triticum 3.1
A whole genome shotgun sequence assembly of Chinese Spring, which was assembled using short Illumina and long PacBio reads: the assembly was done in several steps using the MaSuRCA and Celera Assembler software.
The combination of very long reads (average read length ~10 kb) coupled with deep sequencing of low error-rate short reads (65x coverage) produced an assembly with a total length of 15.3 Gb represented by 279,439 contigs with an N50 of 232,659 bp and average contig size of 54,912 bp. In contrast to other assemblies of the wheat genome, the Triticum 3.1 assembly is highly contiguous and does not contain unknown nucleotides (Ns). The Triticum 3.1 assembly was aligned to an existing assembly of Aegilops tauschii (the D-genome progenitor of hexaploidy wheat) to identify its D-genome portion; this data was saved into a separate assembly called Triticum D 1.0.
There is currently no further annotation available for the Triticum 3.1 assembly.
Triticum 3.1 reference: Zimin et al., 2017DOI: 10.1093/gigascience/gix097
Triticum 3.1 data access: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA392179
TriticumD 1.0 data access: ftp://ftp.ccb.jhu.edu/pub/data/Triticum_aestivum/Wheat_D_genome/
RefSeqv1.0
A whole genome assembly has been carried out by the IWGSC in collaboration with the company NRGene. Using a proprietary algorithm DeNovoMAGIC with Illumina sequencing data, a 14.5 Gbassembly was produced. This assembly has much larger contigs (N50 super-scaffold length 22.8 Mb) and represents a draft genome more similar in quality to rice and other model species. Sequences have been ordered using POPSEQ data and Hi-C (chromosome conformation capture) to generate 21 pseudomolecules representing the majority of the wheat genome. Gene models have been generated consisting of 107,891 high confidence genes (homology to genes in other species) and161,537 low confidence genes (e.g. truncated genes missing a start or stop codons and genes lacking transcriptional evidence or lacking homology to other species).
WGA reference: unpublished
WGA data access: available after registration with URGI.
BLAST at https://urgi.versailles.inra.fr/blast_iwgsc/?dbgroup=wheat_whole_genome_assemblies&program=blastn
All data downloadable at https://urgi.versailles.inra.fr/download/iwgsc/IWGSC-WGA/
d) Comparison between assemblies
Currently, all 4reference genomes have their merits due to the differences in gene annotation, incorporation into other resources (e.g. expression browsers, SNP markers and TILLING mutants) and the variety which was sequenced.
Table1. Comparison between different genome assemblies.
内容二:What is TGAC:
来自https://www.wheatgenome.org/News/Latest-news/New-wheat-genome-assembly-available-at-TGAC
The Genome Analysis Centre (TGAC) announced recently that it has made available a new assembly of the wheat genome. TheIWGSC welcomes the production of additional resources ahead of the completion of the full reference sequence, anticipated for 2018. This latest assembly provides an incremental improvement to the genic information currently available to breeders and researchers.
Taking advantage of recent improvements in high throughput sequencing technologies, the new assembly from TGAC builds on the IWGSC chromosome survey sequence to assign improved gene sequences assembled from whole genome sequencing to individual wheat chromosomes.
With this new resource, it will be easier to define the structures of genes and the sequences that surround them, which often have a role in their regulation. The challenge remains, however, to order all of this information along each of the chromosomes to support the identification and isolation of genes and regulatory elements underlying agronomically important traits in bread wheat.
Towards this end, the IWGSC will continue to pursue the physical map-based sequencing of individual chromosomes, such as the recently completed 3B chromosome to (1) provide an efficient link between genetic maps and the draft sequences; and (2) achieve a reference sequence comparable in quality to that of the rice genome gold standard. The TGAC gene assemblies and all other future efforts that deliver genomic resources for bread wheat will be integrated by the IWGSC into the final gold standard wheat genome sequence.