Our reference genome for European ash (Fraxinus excelsior) was published in Nature in 2016. This assembly can be downloaded here:
Gene annotation files for this assembly can be downloaded here:
The gene annotation of this assembly can be browsed on the Hard Wood Genomics Project web-site and searched on their installation of BLAST.
We are currently updating this genome using long reads and Hi-C libraries, and expect to release a chromosomal-level assembly in 2022.
Information about the UK reference genome assembly
The UK reference genome is for an individual tree derived from self-pollination of a tree growing in woodland in Oxfordshire. The controlled self-pollination of the parent tree was carried out by Dr David Boshier of Oxford University. The offspring from this self-pollination are growing at Paradise Wood in Oxfordshire, owned by the Earth Trust. Tissue was collected from one of these trees in January 2013 and DNA was extracted from it by Jasmin Zohren (funded by MSC ITN “Intercrossing”) at QMUL. Using flow cytometry, we estimated the 1C genome size of the tree to be 877 Mbp.
Raw DNA sequence data for the British ash genome were generated by Eurofins, and the data was assembled by Lizzy Sollars (funded by MSC ITN “Intercrossing”) and Richard Buggs at QMUL, in collaboration with CLCbio, using open access and proprietary software. Assembly and analysis of the genome is being carried out on the QMUL-High Performance Computing MidPlus cluster, and servers at CLCbio.
From 2014-2016 we collaborated with The Genome Analysis Centre in Norwich to improve and annotate the genome assembly.
The initial sequencing was funded by an urgency grant awarded by the Natural Environment Research Council in 2013.
The reference assembly was released 29/10/2015 and published in Nature in January 2017, assembled by Lizzy Sollars. Paired reads with insert sizes of: 200bp, 300bp, 500b, 5kb and 454 reads were used to build contigs in CLC Genomics Workbench. Scaffolding was performed using SSPACE with all paired reads (those mentioned in addition to Long Jumping Distance libraries of 3, 8, 20 and 40 kbp). Gaps in the scaffolds were closed using GapCloser and further joining of scaffolds was done using PBJelly with the 454 reads. The chloroplast and mitochondrial genomes have been extracted from the nuclear genome; both are located at the end of the assembly file. The chloroplast genome is contained in one contig of 155,498 bp, named ‘Cp1’. The mitochondrial genome is present as a draft version in 26 contigs, named ‘Mt#’ (1-26). Stats for the mitochondrial genome are shown in the table below, along with full stats of the whole genome assembly.
Contigs | Scaffolds | Mt Genome | |
---|---|---|---|
Number | 118,959 | 89,514 | 26 |
Total size | 718.4 Mbp | 867.5 Mbp | 580,788 bp |
Longest | 209,591 | 884,900 bp | 184,534 bp |
Shortest | 326 | 326 bp | 326 bp |
Number > 1K nt | 68,628 | 40,777 | 25 |
Number > 10K nt | 18,860 | 10,151 | 11 |
Number > 100K nt | 210 | 2,522 | 1 |
Mean size | 6,039 | 9,691 | 22,338 |
Median size | 1,228 | 911 | 6,487 |
N50 length | 25,341 | 103,995 bp | 60,627 |
L50 count | 7,922 | 2,389 | 3 |
A | 32.87% | 27.22% | 27.49% |
C | 17.13% | 14.19% | 22.50% |
G | 17.14% | 14.19% | 22.29% |
T | 32.85% | 27.2% | 27.67% |
N | 0.00% | 17.19% | 0.04% |
CEGMA complete hits | 208 genes (84%) | ||
CEGMA partial hits | 238 genes (96 %) |
Annotation
RNA was extracted by Jasmin Zohren from 5 tissues: leaf, cambium, root, and flower of the ‘mother’ tree, and from leaf tissue of the ‘selfed’ tree (the individual for which we have provided a reference genome sequence). These were sequenced using Illumina HiSeq paired-end technology. Adapter sequences were removed from the reads, which were then also quality trimmed to a minimum Phred score of 20 and minimum length of 50bp. The transcriptome data were assembled by Lizzy Sollars using the CLC Transcript Discovery Plugin. RNA-seq reads were mapped to the BATG_0.4 reference genome using the Large Gap Read Mapper (accounts for intron sequences in the reference), and the location of genes and mRNA transcripts were predicted using the Transcript Discovery tool. Reads were then mapped back to the transcripts and those with an average coverage of less than 5 were filtered out.
A three-way pipeline was used to predict genes ab initio consisting of 1). MAKER, 2). Augustus (without RNA-seq data), and 3). Augustus (with RNA-seq data). The following data were fed into the pipeline: RNA reads from five samples, the gene models produced by Lizzy Sollars, a repeat-masked genome produced by Laura Kelly at QMUL, and alignments of protein sequences from eight other plant species. The pipeline was run by Gemy Kaithokottil and David Swarbreck at TGAC. Evidence Modeller was then used to select the most accurate structure for each gene, as each of the three methods will predict slightly different gene structures. Filtering was performed using PASA. Resulting genes were annotated using BLAST, GO terms and Interproscan.
Raw data
The raw reads for the assembly and annotation of our reference genome are available as follows:
Description (tissue type, accession) | Reads on ENA | Publication |
Whole genomic DNA | Project: PRJEB4958 Sample: ERS370607 Experiments: ERX1470833-ERX1470834; ERX344708-ERX344729 Assembly: GCA_900149125.1 | Sollars et al (2017) Nature [doi.org/10.1038/nature20786 |
Leaf transcriptome of reference tree | Project: PRJEB4958 Sample: ERS370607 Experiments: ERX1470832 | Sollars et al (2017) Nature [doi.org/10.1038/nature20786] |
Root, leaf, cambium and flower transcriptomes of parent tree | Project: PRJEB4958 Sample: ERS1138331 Experiments: ERX147051- ERX147054 | Sollars et al (2017) Nature [doi.org/10.1038/nature20786] |