UK Reference Genome

Our reference genome for European ash (Fraxinus excelsior) was published in Nature in 2016. This assembly can be downloaded here:


Gene annotation files for this assembly can be downloaded here:


The gene annotation of this assembly can be browsed on the Hard Wood Genomics Project web-site and searched on their installation of BLAST.

We are currently updating this genome using long reads and Hi-C libraries, and expect to release a chromosomal-level assembly in 2022.

Information about the UK reference genome assembly

The UK reference genome is for an individual tree derived from self-pollination of a tree growing in woodland in Oxfordshire. The controlled self-pollination of the parent tree was carried out by Dr David Boshier of Oxford University. The offspring from this self-pollination are growing at Paradise Wood in Oxfordshire, owned by the Earth Trust. Tissue was collected from one of these trees in January 2013 and DNA was extracted from it by Jasmin Zohren (funded by MSC ITN “Intercrossing”) at QMUL. Using flow cytometry, we estimated the 1C genome size of the tree to be 877 Mbp.

Raw DNA sequence data for the British ash genome were generated by Eurofins, and the data was assembled by Lizzy Sollars (funded by MSC ITN “Intercrossing”) and Richard Buggs at QMUL, in collaboration with CLCbio, using open access and proprietary software. Assembly and analysis of the genome is being carried out on the QMUL-High Performance Computing MidPlus cluster, and servers at CLCbio. 

From 2014-2016 we collaborated with The Genome Analysis Centre in Norwich to improve and annotate the genome assembly.

The initial sequencing was funded by an urgency grant awarded by the Natural Environment Research Council in 2013.

The reference assembly was released 29/10/2015 and published in Nature in January 2017, assembled by Lizzy Sollars. Paired reads with insert sizes of: 200bp, 300bp, 500b, 5kb and 454 reads were used to build contigs in CLC Genomics Workbench. Scaffolding was performed using SSPACE with all paired reads (those mentioned in addition to Long Jumping Distance libraries of 3, 8, 20 and 40 kbp). Gaps in the scaffolds were closed using GapCloser and further joining of scaffolds was done using PBJelly with the 454 reads. The chloroplast and mitochondrial genomes have been extracted from the nuclear genome; both are located at the end of the assembly file. The chloroplast genome is contained in one contig of 155,498 bp, named ‘Cp1’. The mitochondrial genome is present as a draft version in 26 contigs, named ‘Mt#’ (1-26). Stats for the mitochondrial genome are shown in the table below, along with full stats of the whole genome assembly.

 ContigsScaffoldsMt Genome
Total size718.4 Mbp867.5 Mbp580,788 bp
Longest209,591884,900 bp184,534 bp
Shortest326326 bp326 bp
Number > 1K nt68,62840,77725
Number > 10K nt18,86010,15111
Number > 100K nt2102,5221
Mean size6,0399,69122,338
Median size1,2289116,487
N50 length25,341103,995 bp60,627
L50 count7,9222,3893
CEGMA complete hits 208 genes (84%)
CEGMA partial hits 238 genes (96 %)


RNA was extracted by Jasmin Zohren from 5 tissues: leaf, cambium, root, and flower of the ‘mother’ tree, and from leaf tissue of the ‘selfed’ tree (the individual for which we have provided a reference genome sequence). These were sequenced using Illumina HiSeq paired-end technology. Adapter sequences were removed from the reads, which were then also quality trimmed to a minimum Phred score of 20 and minimum length of 50bp. The transcriptome data were assembled by Lizzy Sollars using the CLC Transcript Discovery Plugin. RNA-seq reads were mapped to the BATG_0.4 reference genome using the Large Gap Read Mapper (accounts for intron sequences in the reference), and the location of genes and mRNA transcripts were predicted using the Transcript Discovery tool. Reads were then mapped back to the transcripts and those with an average coverage of less than 5 were filtered out.

A three-way pipeline was used to predict genes ab initio consisting of 1). MAKER, 2). Augustus (without RNA-seq data), and 3). Augustus (with RNA-seq data). The following data were fed into the pipeline: RNA reads from five samples, the gene models produced by Lizzy Sollars, a repeat-masked genome produced by Laura Kelly at QMUL, and alignments of protein sequences from eight other plant species. The pipeline was run by Gemy Kaithokottil and David Swarbreck at TGAC. Evidence Modeller was then used to select the most accurate structure for each gene, as each of the three methods will predict slightly different gene structures. Filtering was performed using PASA. Resulting genes were annotated using BLAST, GO terms and Interproscan.

Raw data

The raw reads for the assembly and annotation of our reference genome are available as follows:

Description (tissue type, accession)Reads on ENAPublication
Whole genomic DNAProject: PRJEB4958
Sample: ERS370607
Experiments: ERX1470833-ERX1470834; ERX344708-ERX344729
Assembly: GCA_900149125.1
Sollars et al (2017) Nature [
Leaf transcriptome of reference treeProject: PRJEB4958 Sample: ERS370607 Experiments: ERX1470832Sollars et al (2017) Nature []  
Root, leaf, cambium and flower transcriptomes of parent treeProject: PRJEB4958 Sample: ERS1138331 Experiments: ERX147051- ERX147054Sollars et al (2017) Nature []