UK Reference Genome – Ash tree genomes

BATG-1.0

In 2023 we have produced a chromosomal-level assembly for F. excelsior using long read sequencing of the same tree as the reference genome published in Nature in 2016. The first version of this new assembly is available below.

The methodology we used was as follows. A scion from the individual used to generate the original reference genome (BATG-0.5 – see below) was grafted onto new rootstock by Forest Research in January 2017, before being grown at Kew. Leaf tissue from this individual (i.e. the same genotype as used for the BATG-0.5 assembly) was collected in May 2022, flash frozen, and sent to Cantata Bio (Scotts Valley, California) for genome sequencing and assembly.

Genomic DNA was extracted and whole genome sequencing performed using a PacBio Sequel. Omni-C libraries were also produced using this tissue, for scaffolding the assembly. These data were used to produce an initial assembly using hifiasm v0.15.4-r347, which was further scaffolded with the proprietary software HiRise.

This sequencing and ongoing analysis of the genome is a ‘Centre for Forest Protection’ project (https://www.forestprotection.uk/project/2208_ash_pangenome/), funded by Defra. The Centre for Forest Protection is a unique collaboration led by Forest Research and RBG Kew, focused on the future of forest and tree health. Analysis is being performed using Queen Mary University of London’s Apocrita HPC facility, supported by QMUL Research-IT (http://doi.org/10.5281/zenodo.438045).

The assembly for one haplotype of this genome is available to download via the link below. Users are free to publish papers dealing with specific genes or small sets of genes (ten or fewer) using the sequence data. If these data are used for publications, please cite this webpage and the 2016 Nature paper. For complete (whole-genome) analyses of features such as genes/gene families, regulatory elements, repeats or other features, or whole-genome comparisons to other F. excelsior individuals or other species, please contact us (l.kelly@kew.org) prior to downloading the assembly to discuss your plans.

BATG-1.0.fasta.gz Download

Assembly statistics – BATG-1.0

Total Size (Mb)	793,579,810
Number of scaffolds	2,024
N50	32,931,246
N90	24,538,431
L50 count	11
L90 count	22
Number of gaps	25
GC Content (%)	35.3

BATG-0.5

The initial reference genome for European ash (Fraxinus excelsior) was published in Nature in 2016. This assembly can be downloaded here:

Gene annotation files for this assembly can be downloaded here:

The gene annotation of this assembly can be browsed on the Hard Wood Genomics Project web-site and searched on their installation of BLAST.

The BATG-0.5 reference genome is for an individual tree derived from self-pollination of a tree growing in woodland in Oxfordshire. The controlled self-pollination of the parent tree was carried out by Dr David Boshier of Oxford University. The offspring from this self-pollination are growing at Paradise Wood in Oxfordshire, owned by the Earth Trust. Tissue was collected from one of these trees in January 2013 and DNA was extracted from it by Jasmin Zohren (funded by MSC ITN “Intercrossing”) at QMUL. Using flow cytometry, we estimated the 1C genome size of the tree to be 877 Mbp.

Raw DNA sequence data for the British ash genome were generated by Eurofins, and the data was assembled by Lizzy Sollars (funded by MSC ITN “Intercrossing”) and Richard Buggs at QMUL, in collaboration with CLCbio, using open access and proprietary software. Assembly and analysis of the genome is being carried out on the QMUL-High Performance Computing MidPlus cluster, and servers at CLCbio.

From 2014-2016 we collaborated with The Genome Analysis Centre in Norwich to improve and annotate the genome assembly.

The initial sequencing was funded by an urgency grant awarded by the Natural Environment Research Council in 2013.

The reference assembly was released 29/10/2015 and published in Nature in January 2017, assembled by Lizzy Sollars. Paired reads with insert sizes of: 200bp, 300bp, 500b, 5kb and 454 reads were used to build contigs in CLC Genomics Workbench. Scaffolding was performed using SSPACE with all paired reads (those mentioned in addition to Long Jumping Distance libraries of 3, 8, 20 and 40 kbp). Gaps in the scaffolds were closed using GapCloser and further joining of scaffolds was done using PBJelly with the 454 reads. The chloroplast and mitochondrial genomes have been extracted from the nuclear genome; both are located at the end of the assembly file. The chloroplast genome is contained in one contig of 155,498 bp, named ‘Cp1’. The mitochondrial genome is present as a draft version in 26 contigs, named ‘Mt#’ (1-26). Stats for the mitochondrial genome are shown in the table below, along with full stats of the whole genome assembly.

Assembly statistics – BATG-0.5

	Contigs	Scaffolds	Mt Genome
Number	118,959	89,514	26
Total size	718.4 Mbp	867.5 Mbp	580,788 bp
Longest	209,591	884,900 bp	184,534 bp
Shortest	326	326 bp	326 bp
Number > 1K nt	68,628	40,777	25
Number > 10K nt	18,860	10,151	11
Number > 100K nt	210	2,522	1
Mean size	6,039	9,691	22,338
Median size	1,228	911	6,487
N50 length	25,341	103,995 bp	60,627
L50 count	7,922	2,389	3
A	32.87%	27.22%	27.49%
C	17.13%	14.19%	22.50%
G	17.14%	14.19%	22.29%
T	32.85%	27.2%	27.67%
N	0.00%	17.19%	0.04%
CEGMA complete hits		208 genes (84%)
CEGMA partial hits		238 genes (96 %)

BATG-0.5 Annotation

RNA was extracted by Jasmin Zohren from 5 tissues: leaf, cambium, root, and flower of the ‘mother’ tree, and from leaf tissue of the ‘selfed’ tree (the individual for which we have provided a reference genome sequence). These were sequenced using Illumina HiSeq paired-end technology. Adapter sequences were removed from the reads, which were then also quality trimmed to a minimum Phred score of 20 and minimum length of 50bp. The transcriptome data were assembled by Lizzy Sollars using the CLC Transcript Discovery Plugin. RNA-seq reads were mapped to the BATG_0.4 reference genome using the Large Gap Read Mapper (accounts for intron sequences in the reference), and the location of genes and mRNA transcripts were predicted using the Transcript Discovery tool. Reads were then mapped back to the transcripts and those with an average coverage of less than 5 were filtered out.

A three-way pipeline was used to predict genes ab initio consisting of 1). MAKER, 2). Augustus (without RNA-seq data), and 3). Augustus (with RNA-seq data). The following data were fed into the pipeline: RNA reads from five samples, the gene models produced by Lizzy Sollars, a repeat-masked genome produced by Laura Kelly at QMUL, and alignments of protein sequences from eight other plant species. The pipeline was run by Gemy Kaithokottil and David Swarbreck at TGAC. Evidence Modeller was then used to select the most accurate structure for each gene, as each of the three methods will predict slightly different gene structures. Filtering was performed using PASA. Resulting genes were annotated using BLAST, GO terms and Interproscan.

BATG-0.5 Raw data

The raw reads for the assembly and annotation of BATG-0.5 are available as follows:

Description (tissue type, accession)	Reads on ENA	Publication
Whole genomic DNA	Project: PRJEB4958 Sample: ERS370607 Experiments: ERX1470833-ERX1470834; ERX344708-ERX344729 Assembly: GCA_900149125.1	Sollars et al (2017) Nature [doi.org/10.1038/nature20786
Leaf transcriptome of reference tree	Project: PRJEB4958 Sample: ERS370607 Experiments: ERX1470832	Sollars et al (2017) Nature [doi.org/10.1038/nature20786]
Root, leaf, cambium and flower transcriptomes of parent tree	Project: PRJEB4958 Sample: ERS1138331 Experiments: ERX147051- ERX147054	Sollars et al (2017) Nature [doi.org/10.1038/nature20786]