Transcriptomes & Proteome of Fraxinus excelsior

Annotation Version 4 - BATG-0.5

Released 08/02/16 in collaboration with The Genome Analysis Centre

Annotation File Notes
Full Annotation GFF GFF file of all gene, mRNA, CDS, 3'UTR and 5'UTR annotations (all isoforms), excluding transposable elements
Transposable Elements GFF GFF file of transposable elements.
cDNA Fasta FASTA FASTA file of all cDNA (transcript) sequences (all isoforms).
CDS Fasta FASTA FASTA file of all CDS DNA sequences (all isoforms).
Proteome FASTA FASTA file of all CDS peptide sequences (all isoforms).
Longest transcript: Annotation GFF GFF file of all genes, and only mRNA, CDS, 3'UTR and 5'UTR annotations for the longest isoform.
Longest transcript: cDNA FASTA FASTA file of cDNA (transcript) sequences (longest isoform per gene only).
Longest transcript: CDS FASTA FASTA file of CDS DNA sequences, (longest isoform per gene only)
Longest transcript: Proteome FASTA FASTA file of CDS peptide sequences (longest isoform per gene only)
Functional annotation TSV GO and Interproscan terms associated with each gene

Annotation Version 3 - BATG-0.4

Released 26/02/15 in collaboration with The Genome Analysis Centre

Annotation File Notes
Full Annotation GFF GFF file of all gene, mRNA, CDS, 3'UTR and 5'UTR annotations. (all isoforms).
cDNA Fasta FASTA FASTA file of all cDNA (transcript) sequences (all isoforms).
CDS Fasta FASTA FASTA file of all CDS DNA sequences (all isoforms).
Proteome FASTA FASTA file of all CDS peptide sequences (all isoforms).
Functional annotation TSV GO and Interproscan terms associated with all genes (all isoforms).
Longest transcript: Annotation GFF GFF file of all genes, and only mRNA, CDS, 3'UTR and 5'UTR annotations for the longest isoform.
Longest transcript: cDNA FASTA FASTA file of cDNA (transcript) sequences (longest isoform per gene only).
Longest transcript: CDS FASTA FASTA file of CDS DNA sequences, (longest isoform per gene only)
Longest transcript: Proteome FASTA FASTA file of CDS peptide sequences (longest isoform per gene only)
Longest transcript: Functional annotation TSV GO and Interproscan terms associated with longest isoform of each gene.

Methods V3 (BATG-0.4)

A three-way pipeline was used to predict genes ab initio consisting of 1). MAKER, 2). Augustus (without RNA-seq data), and 3). Augustus (with RNA-seq data). The following data were fed into the pipeline: RNA reads from five samples shown below in Annotation Version 2, the gene models produced in Version 2 by Lizzy Sollars, a repeat-masked genome produced by Laura Kelly at QMUL, and alignments of protein sequences from eight other plant species. The pipeline was run by Gemy Kaithokottil and David Swarbreck at TGAC. Evidence Modeller was then used to select the most accurate structure for each gene, as each of the three methods will predict slightly different gene structures. Filtering was performed using PASA. Resulting genes were annotated using BLAST, GO terms and Interproscan. This updated version predicts more genes than the previous version, as the previous relied solely on RNA data and therefore only those genes that were expressed at the time of RNA extraction.

This annotation can be visualised in the JBrowse tool on this website, and also on gbrowse hosted at TGAC.

Annotation Version 2 - BATG-0.4:

Sample GFF3 FASTA No. of
genes/proteins
Notes
Selfed tree leaf S_L1 S_L1 27,360 Assembled transcriptome of leaf tissue from the selfed tree (gz compressed).
Mother tree leaf M_L1 M_L1 24,473 Assembled transcriptome of leaf tissue from the mother tree (gz compressed).
Mother tree cambium M_C1 M_C1 27,368 Assembled transcriptome of cambium tissue from the mother tree (gz compressed).
Mother tree root M_R2 M_R2 28,275 Assembled transcriptome of root tissue from the mother tree (gz compressed).
Mother tree flower M_F1 M_F1 29,562 Assembled transcriptome of flower tissue from the mother tree (gz compressed).
All samples combined Bulk Bulk 36,944 Tar archive of the five files above.
All samples   Unigenes 36,944 The longest transcript per gene (gz compressed). Filtered on coverage.
All samples   Proteome 36,893 Predicted protein coding sequence for each unigene (gz compressed)
All samples unfiltered   Unigenes 41,521 Longest transcript per gene, before filtering.
All samples unfiltered   mRNA 72,139 All mRNA transcripts, before filtering.

Methods (v2)

RNA was extracted by Jasmin Zohren from 5 tissues: leaf, cambium, root, and flower of the 'mother' tree, and from leaf tissue of the 'selfed' tree (the individual for which we have provided a reference genome sequence). These were sequenced using Illumina HiSeq paired-end technology. Adapter sequences were removed from the reads, which were then also quality trimmed to a minimum Phred score of 20 and minimum length of 50bp. The transcriptome data were assembled by Lizzy Sollars using the CLC Transcript Discovery Plugin. RNA-seq reads were mapped to the BATG_0.4 reference genome using the Large Gap Read Mapper (accounts for intron sequences in the reference), and the location of genes and mRNA transcripts were predicted using the Transcript Discovery tool. Reads were then mapped back to the transcripts and those with an average coverage of less than 5 were filtered out.

The GFF3 files above contain the locations of genes, transcripts, exons, and introns for each tissue, and the FASTA files contain the sequences of each mRNA transcript. The 'bulk' files are tar archives of the five separate sample files. The unigenes files comprises one transcript per gene, with all samples combined, i.e. a 'complete' reference ash transcriptome. The longest gene in each location was selected for this file, regardless of which sample it originated from. The proteome file contains one protein sequence for each gene, predicted from running a command-line version of OrfPredictor. The input for this was the unigene file and its resulting output of a BLASTx search against all plant proteins in the RefSeq database.