FB2014_06, released November 12th, 2014
 
 

A Database of Drosophila Genes & Genomes

Reference Manual G - FlyBase Documentation

Last Updated: 20 March 2012

G.1. Nontraditional alleles

In addition to 'alleles' in the traditional sense, FlyBase now names and curates further classes of allele so that phenotypic or expression pattern data can be captured for in vitro construct alleles and alleles of reporter (e.g., Ecol\lacZ), effector (e.g., Scer\FLP) or toxin (e.g., Rcom\DT-A) genes. Since these alleles have not historically been named by researchers, and have been named by FlyBase, their presentation in FlyBase requires some explanation:

G.1.1. Alleles of reporter genes

Alleles of reporter genes currently fall into two main classes, those resulting from enhancer trap experiments, and those resulting from promoter (or other regulatory region) analysis, where a fragment is used to drive the expression of a reporter gene. Ecol\lacZ will be used for illustration.

Enhancer trap results:

  • The enhancer trap construct causes an allele of a gene and is expressed in a pattern consistent with insertion in that gene. The resulting insertion will be described with the format P{A92}hL43a, and the Ecol\lacZ allele symbol is of the format Ecol\lacZh-L43a.
  • The reporter gene reflects the expression of a gene without causing a mutant allele of that gene. The resulting insertion will be described with the format P{PZ}P2023-44, where P2023-44 reflects the insertion identifier, and the Ecol\lacZ allele symbol is of the format Ecol\lacZhh-P2023-44.
  • The reporter gene reflects the expression of an undescribed gene/enhancer. The resulting insertion will be described with the format P{lacW}1.28, and the Ecol\lacZ allele symbol is of the format Ecol\lacZ1.28.

Promoter analysis results:

  • Generally some fragment of a gene promoter/intron/3'-region is fused to the reporter gene. In this case the allele symbol is of the form 'gene symbol.fragment descriptor' e.g., Ecol\lacZeve.prox54. The fragment descriptor reflects that used in the publication, even though this may be long and cumbersome (this may not be strictly true for such alleles curated early in the FlyBase project).
  • Where a reporter gene is simply described in a publication as being driven by, e.g., an arm promoter, the symbol of the Ecol\lacZ allele is 'arm.PI', where I is the first letter of the surname of the first author of the paper, e.g., Ecol\lacZarm.PV for 'Ecol\lacZ arm promoter construct of Vincent'.
  • For logistical reasons some promoter fusions involving reporter genes such as Ecol\lacZ, though technically protein fusions, are simply treated as alleles of the reporter gene. The symbol for the additional gene(s) contributing to the fusion is indicated as part of a superscript, e.g., Ecol\lacZP\T.A92. In these special cases there is no distinction made between promoter fusions and protein fusions in the gene name.

G.1.2. Alleles of ectopically expressed Drosophila gene products

Products of genes may be ectopically expressed due either to juxtaposition with different regulatory sequences in the genome (as a result of being inserted into different-than-wild-type locations by chromosome rearrangement or P element transposition) or due to in vitro construction creating a different constellation of regulatory sequences than in wild type.

By analogy with alleles of Ecol\lacZ for enhancer traps, P-element-borne insertions of genes e.g., w or ve that have a qualitatively distinct _position-dependent_ mutant phenotype will be curated as new alleles of e.g., w or ve, e.g., veStg caused by a particular insertion of P{HS-rho}, P{HS-rho}Stg.

The 'in vitro construct' ectopic expression alleles currently fall into two main classes, one component or two component systems:

One component systems:
Gene A is expressed from a promoter of gene B. The allele is typically generated by in vitro construction. In such cases the allele symbol is of the format 'gene-Agene-B.PI', e.g., phylsev.PC or 'gene-Agene-B.fragment descriptor' where the author includes a promoter fragment descriptor, e.g., phylninaE.GMR.

An occasional exception is made for promoter fusions that are widely used to provide essentially wild-type gene function; these alleles have the mini-gene '+m construct' designation (see below) prepended to an, e.g., heat shock designation, e.g., w+mW.hs.

It is common that authors report a construct where e.g., ftz is expressed under a 'heat shock' or Hsp70 promoter, while providing no further details about the nature of the promoter. For these cases the allele symbol hs.PI is employed, e.g., Antphs.PZ for 'Antp heat shock construct of Zeng'. An 'hs' designation should be reserved for when the heat inducible, not just the minimal, promoter fragment is used.

Where the allele is both altered in its coding region and being expressed from an ectopic promoter the sequence 'alteration.promoter' is used in the allele designation, e.g., tor13D.hs.sev to denote the coding sequence of tor13D expressed from a heat shock (undefined) promoter with a sev enhancer. An exception to this rule is made for Tags, which appear as the last component of the allele symbol (see below).

Two component systems:

  • GAL4-UAS The allele symbol for the gene whose expression is dependent upon Scer\GAL4 shall include 'Scer\UAS' and an identifier. The identifier should reflect the construct as named by author e.g., l(1)scDeltaB.Scer\UAS. In the absence of any other identifier '.cIa' is used, where 'c' stands for construct, I for the first author's last name initial and 'a' for the first in the series (subsequent ones will be b, c, etc). e.g., aseScer\UAS.cBa for 'Scer\UAS construct a of Brand'.
  • FLP-FRT Alleles of Scer\FLP are named as outlined above for reporter genes, and allele symbols of genes whose expression is dependent upon that of Scer\FLP include 'Scer\FRT'.

G.1.3. Alleles of ectopically expressed non-Drosophila effector products

A note on ribozymes: FlyBase has a foreign ribozyme gene, symbol LTSV\RBZ. Alleles of LTSV\RBZ capture the different variants, e.g., for a heat inducible ftz-targeted ribozyme: LTSV\RBZhs.ftz (syntax 'promoter.target gene') will be named.

'+m' minigenes

The minigene allele designation is used in its narrow sense, i.e., where the only difference between the allele and the wild type is the removal of more or less non-essential sequences. Thus the minigene allele symbol designation reserved for those cases where the gene's own promoter is driving its expression.

The minigene allele symbols begin with 'm', for minigene, and are followed by the construct symbol used in the publication. If no construct symbol has been used, the string 'mIa' where 'm' stands for minigene, 'I' for the first author's last name initial and 'a' for the first in the series is used. If the function of the minigene is stated to be indistinguishable from that of the wild type allele, the 'm' is preceded by a '+'.

Tags Genes can be modified by the addition of a tag allowing the product to be identified, purified, or targeted to a particular subcellular distribution. Tagged alleles have the syntax 'gene-symbol x.T:y' , where x is an identifier and y is the name of the tag, e.g., Hsap\MYC, T:Ivir\HA1, SV40\nls2, e.g., dap1gm.T:Hsap\Myc. Where a tag is artificial, the species prefix Zzzz is used, e.g. T:Zzzz\His6.

G.1.4. Classical alleles engineered into transgene constructs, including rescue constructs

A class of alleles are named to capture fragments of genomic DNA used in rescue constructs. The symbol for the rescuing allele symbol begins with '+t'. This is followed by length as stated by authors, construct symbol if length is not given or '+tIa', where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series (if neither length nor construct symbol is stated). When rescue is incomplete, the construct is considered as carrying a mutant allele. Allele designator is construct symbol, 'length of genomic insert.tIa' if no symbol is given or 'tIa' where neither length nor construct symbol is stated.

When a classical allele, e.g., wa, is put into a transgene construct it will get a new designation, e.g., wa.tIa, to reflect its transgenic environment, where 't' stands for transgene, 'I' for the first author's last name initial and 'a' for the first in the series

FlyBase is, of course, happy to discuss and advise on use of nomenclature of these non-traditional alleles.

G.2. Controlled vocabularies used by FlyBase

For many reasons several of the fields in FlyBase use structured controlled vocabularies (aka ontologies). This makes it much easier (and more robust) to make links within the database, as well as making it much easier to search the database for information. Moreover, several of these controlled vocabularies are shared with other databases, and this provides a degree of integration between them. The controlled vocabularies are only implemented in certain fields in FlyBase.

The controlled vocabularies currently used by FlyBase are:

  • The Gene Ontology (GO). This provides structured controlled vocabularies for the annotation of gene products (although FlyBase at present annotates genes with GO terms, as a surrogate for their products). The GO has three domains: the molecular function of gene products, the biological process (i.e. roles) in which they are involved and their cellular component (location).
  • Anatomy. A structured controlled vocabulary of the anatomy of Drosophila melanogaster, used, for example, for the description of phenotypes and where a gene is expressed.
  • Development. A structured controlled vocabulary of the development of Drosophila melanogaster, used, for example, for the description of phenotypes and when a gene is expressed.
  • The Sequence Ontology (SO). A structured controlled vocabulary for sequence annotation, for the exchange of annotation data and for the description of sequence objects in databases. Its use by FlyBase means that the various components of the genome are described in a consistent and rigorous manner.
  • FlyBase controlled vocabulary. A structured controlled vocabulary used for the annotation of various objects in FlyBase, including publications (by their type), alleles (for their mutagen etc). Although some of these domains will probably always remain local to FlyBase, in time, community ontologies will be available for others (e.g. chemical compounds for mutagens) and FlyBase will then use these.

All of these structured controlled vocabularies are in the same format, that used by the Open Biomedical Ontology group. This format is called the OBO format and files using it have the suffix '.obo', e.g. gene_ontology.obo. The OBO format is designed to be used with the freely-downloadable OBO-Edit tool.

Users should be aware that controlled vocabularies undergo continual development; terms and definitions are refined, added, merged, split and obsoleted in an effort to improve the way they represent their various subjects.

Both the current 'live' versions of each controlled vocabulary and the static versions taken at the time data for this FlyBase release was frozen are available to download from the Precomputed files download page under the Files menu of the Navigation bar.

The detail of each controlled vocabulary term is displayed in a CV Term Report in FlyBase. Individual CV Term Reports can be reached either by clicking on the controlled vocabulary term where it is displayed in a report page (e.g. the GENE ONTOLOGY: Function, Process, and Cellular component section of the Gene Report), or by using the TermLink tool, which allows users to search directly for controlled vocabulary terms from any of the controlled vocabularies used by FlyBase.

Controlled vocabulary terms can also be searched using the QueryBuilder tool, via their links to objects (such as genes) in FlyBase. If you wish to search using a controlled vocabulary term in QueryBuilder, you should select the GO/Anatomy CV DB dataset in the query segment box (see the QUERY BUILDER HELP section at the bottom of the QueryBuilder page for more details.

G.3. Classification of Gene Products using Gene Ontology (GO) terms

FlyBase uses Gene Ontology (GO) controlled vocabulary (CV) terms for cellular component, biological process and molecular function to describe properties of gene products. Although GO terms are intended to describe the properties of gene products, FlyBase currently assigns GO terms to genes rather than protein or RNA.

FlyBase is one of the founding members of the Gene Ontology (GO) Consortium and follows the general guidelines for GO annotation as described in the GO documentation. FlyBase also participates in the GO reference genome project.

G.3.1. FlyBase GO data

GO data is displayed in the GENE ONTOLOGY: Function, Process, and Cellular component section of individual Gene Reports.

In addition, the current release of GO data for all Drosophila melanogaster FlyBase genes can be found in the tab delimited text file gene_association.fb. The following provides a brief description of the columns in the gene_association.fb file.

  • DB
    The database contributing the gene_association file
    FB File: always "FB" for gene_association.fb.
  • DB_Object_ID
    A unique identifier in the database for the item being annotated.
    FB File: This is always the primary FlyBase identifier number for a Drosophila gene.
    Example: FBgn0000490
  • DB_Object_Symbol
    A (unique and valid) symbol to which the DB_Object_ID is matched.
    FB File: This is always the valid gene symbol for a Drosophila gene.
    Example: dpp
  • Qualifier (this field is optional)
    One or more of 'NOT', 'contributes_to' or 'colocalizes_with' as qualifier(s) for a GO annotation.
    Multiple qualifiers are separated by a pipe (|).
    FB File: 'contributes_to' or 'colocalizes_with' are not currently
    displayed in gene_association.fb, but they will be displayed in the
    next release of the FlyBase gene_association file.
  • GO ID
    The unique GO identifier for the GO term attributed to the DB_Object_ID.
    Example: GO:0005160
  • DB:Reference
    The unique identifier for the reference to which the GO annotation is attributed.
    FB File: Each FlyBase reference including published literature,
    conference abstracts, personal communications, sequence records and
    computer files has a unique 7 digit identifier (an FBrf). Where this
    reference is a published paper with a PubMed identifier, the PubMed ID
    is also listed in column 6, separated from the FBrf with a pipe (|).
    Example: FB:FBrf0136863|PMID:11432817
  • Evidence
    The evidence code for the GO annotation; one of IMP, IGI, IPI, ISS, IDA, IEP, IEA, TAS, NAS, ND, IC, RCA
  • With (or) From
    FB File: This column contains the identifier for annotations where the
    evidence code is IGI, IPI, ISS, IEA or IC. For IGI the database gene
    symbol and identifier is listed. For ISS and IPI the identifier can be a gene
    symbol and identifier, or a sequence (protein or nucleic acid)
    identifier. For IC, the GO identifier of the term used as the basis of
    a curator inference is given. With statements for IC are not currently
    displayed in gene_association.fb, but they will be displayed in the
    next release of the FlyBase gene_association file.
    IGI example: FLYBASE:rpr; FB:FBgn0011706
    ISS example: UniProt:P35569
    ISS example: EMBL:AF064523
    ISS example: SGD_LOCUS:COP1; SGD:S0002304
    IC example: GO:0045298
  • Aspect
    Which ontology the GO term belongs to: Function (F), Process (P) or Component (C).
    Example: P
  • DB_Object_Name
    FB File: The full name of the FlyBase gene.
    Example: decapentaplegic
    Where a FlyBase gene has no full name (eg Pten), this field is left blank.
  • DB_Object_Synonym
    Alternative names by which the database object is known.
    FB File: Multiple synonyms of a FlyBase gene are separated by a pipe (|).
    Example: M(2)LS1|shortvein|Dm-DPP|dpp|Dpp|DPP|CG9885|
    TGF-beta|TGF-&bgr;|TGF-b|Hin-d|l(2)10638|shv|
    DPP-C|ho|M(2)23AB|blk|l(2)22Fa|l(2)k17036|Tg|TGF&bgr;
  • DB_Object_Type
    The type of object being annotated. Always a gene for FlyBase data.
    FB file: always "gene" for gene_association.fb.
  • taxon
    The taxonomic identifier of the species encoding the gene product
    Example: taxon:7227
  • Date
    The date of last annotation update, in the format 'YYYYMMDD'. At
    present this date is the same for all annotations and corresponds to
    the date of the latest FlyBase update; we are in the process of
    changing our system so that dates more accurately reflect the date the
    annotation is made.
    Example: 20040821
  • Assigned_by
    The source of the GO annotation.
    FB File: One of either FB or UniProtKB.

The latest version of this data is also available for download here from the Gene Ontology consortium site. The accompanying README document includes a detailed description of the file format, FlyBase GO annotation policy and sources used for FlyBase GO annotations.

Note that the GO data available from FlyBase will not necessarily be identical to that found on the GO website. GO validate the data FlyBase submits and remove lines of data that are no longer valid e.g. when a GO term becomes obsolete.

QueryBuilder can be used to identify all the genes associated with a particular GO term. The AmiGO and QuickGO browsing tools can be used to find GO terms of interest.

G.3.2. Evidence

Evidence for a GO term consists of an evidence code that describes the type of analysis carried out together with, in some cases, a reference to another database object in that supports the evidence (see with/from Supporting Evidence below).

Evidence codes The Gene Ontology Guide to GO Evidence Codes contains comprehensive descriptions of the evidence codes used in GO annotation. FlyBase uses the following evidence codes when assigning GO data:

  • inferred from mutant phenotype (IMP)
  • inferred from genetic interaction (IGI)
  • inferred from direct assay (IDA)
  • inferred from physical interaction (IPI)
  • inferred from expression pattern (IEP)
  • inferred from sequence or structural similarity (ISS)
  • inferred from electronic annotation (IEA)
  • inferred from reviewed computational analysis (RCA)
  • traceable author statement (TAS)
  • non-traceable author statement (NAS)
  • inferred by curator (IC)
  • no biological data available (ND)

G.3.2.1. Use of evidence codes

Consistent with the aims of the GO reference genome project, FlyBase prefers to assign GO terms based on experimental evidence codes (IMP, IGI, IDA, IPI, IEP). Of these five codes, FlyBase uses IEP relatively infrequently since expression patterns generally provide less direct evidence for GO terms than the other four codes. FlyBase does use IEP where an author explicitly states that expression data is the evidence for a term.

Evidence codes based on computer predictions (ISS, IEA, RCA), author statements (NAS, TAS) and curator inference (IC) will continue to be used in the absence of experimental data for the same or a more specific GO term. However, we aim to remove GO data with these codes when experimental evidence for the term is curated.

The evidence code ND (no biological data available) is used for annotations to the three unknown GO terms: "molecular_function unknown ; GO:0005554", "biological_process unknown ; GO:0000004" and "cellular_component unknown ; GO:0008372". In FlyBase the use of any of these three GO terms, attributed to reference FBrf0159398 and supported by the ND evidence code, signifies that a curator has examined the available literature and sequence for this gene and that, as of the date of the annotation to the term, there is no information supporting an annotation to any more specific GO term in that ontology. Recently, GO removed the unknown terms and changed to using the root terms "molecular_function ; GO:0003674", "biological_process ; GO:0008150" or "cellular_component ; GO:0008372" with the ND evidence code; this provides a more accurate ontological representation of the current knowledge about the gene products. FlyBase will implement this change in the next release.

Additional information about the way FlyBases uses evidence codes can be found in the README document.

with/from Supporting Evidence

Some evidence codes (IGI, IPI, ISS, IEA, IC) are used in conjunction 'with' supporting data in the form of a reference to another database object. These objects are identified by their database abbreviation followed by a colon and the unique identifier for the object in that database. A list of current database abbreviations can be found in the GO.xrf_abbs file. See the GO Annotation Guide for more details.

ISS and IEA 'with'

FlyBase captures GO data based on similarity to other gene products that are known to have that attribute. Since October 1st 2006, it has been mandatory for ISS annotations to include an identifier for the sequence used to make the annotation; earlier FlyBase ISS annotations that do not include identifiers will be updated gradually. In line with current guidelines for reference genomes, curators now check that the similar sequence can be annotated to the GO term with experimental evidence (IDA, IMP, IGI, IPI, IEP) before making an ISS annotation. This policy was adopted to avoid circular similarity-based annotations. Consequently, GO terms are not curated based multiple sequence alignments if none of the sequences in the alignment have been experimentally verified. Annotations made before October 2006 have not necessarily been checked in this way.

For example, the Drosophila gene bigmax is annotated with the GO term 'regulation of transcription' based on sequence similarity to Max. This annotation is legitimate because Max has been shown to regulate transcription in a direct assay.

The combined evidence appears on the gene report in the format:

inferred from sequence or structural similarity with FLYBASE:Max; FB:FBgn0017578

In this case we have give two identifiers (symbol and gene ID) for the same sequence; identifiers for the same sequence are separated by a semi-colon. If more than one sequence is used to make the annotation then the identifiers for the different sequences are separated by a comma. Note that this use of multiple identifiers is a different to that for IGI and IPI.

Where the database object used to to make IEA annotations can be identified then this is included in the same way. However, the majority of FlyBase annotations with IEA do not yet include such a reference. Most IEA annotations in FlyBase are based on the presence of protein domains that are mapped to GO terms. The identifiers for the protein domains will be included in future releases.

IGI and IPI 'with'

For both IGI and IPI there is a special meaning and All annotations inferred from genetic interaction (IGI) include an identifier for the interacting gene. If the GO term is inferred based on multiple genes interacting simultaneously then all interacting genes are identified using 'with' (separated by commas). However, if the GO term is inferred from multiple pairwise interactions these are treated as separate pieces of experimental evidence and appear with separate evidence codes on the gene report.

For example, Bruce is annotated with the GO term 'programmed cell death' based on two different pairwise genetic interaction experiments; the evidence appears on the gene report as:

inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946 AND inferred from genetic interaction with FLYBASE:rpr; FB:FBgn0011706

Contrast this with, the following which would imply that all three genes had to interact together to provide evidence for the annotation:

inferred from genetic interaction with FLYBASE:grim; FB:FBgn0015946, FLYBASE:rpr; FB:FBgn0011706

Similar notation is used for IPI where the interacting gene product is identified using 'with'. Where several gene products interact simultaneously they are recorded in a single annotation (separated by commas after the evidence code). Pairwise physical interactions are recorded independently with using separate evidence codes.

IC 'from' Evidence inferred by curator (IC) is the case that includes 'from'. Curators use this code for those cases where an annotation is not supported by any evidence, but can be reasonably inferred by from other GO annotations, for which evidence is available. The object identified in the IC evidence is always a GO term identifier.

For example, a protein shown to have transcription factor activity in a direct assay could be annotated with the GO term 'general RNA polymerase II transcription factor' (GO:0016251). In the absence of any evidence for the cellular location of that protein, if would be reasonable for the the curator to infer that it is (at least sometimes) located in the nucleus. This would lead the the annotation, nucleus inferred by curator from GO:0016251; the annotation is attributed to the reference that contains evidence for transcription factor activity.

G.3.2.2. Use of Qualifiers

Qualifiers are used as flags that modify the interpretation of an annotation. Allowable values are NOT, contributes_to, and colocalizes_with. On the gene report page, qualifiers precede the GO term in the CV column. More information about using qualifiers is available in the GO Annotation Guide.

NOT

NOT may be used with terms from any of the three GO ontologies (cellular component, biological process, molecular function).

NOT is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method).

For example, if a protein has sequence similarity to an enzyme such as galactosyltransferase, but has been shown experimentally not to have the galactosyltransferase activity, it can be annotated as NOT galactosyltransferase activity (GO molecular function term: GO:0008378).

NOT can also be used when a cited reference explicitly says (e.g. "our favorite protein is not found in the nucleus"). Prefixing a GO term with the string NOT allows curators to state that a particular gene product is NOT associated with a particular GO term. This usage of NOT was introduced to allow curators to document conflicting claims in the literature.

Note that NOT is used when a GO term might otherwise be expected to apply to a gene product, but an experiment, sequence analysis, etc. proves otherwise; it is not generally used for negative or inconclusive experimental results.

colocalizes_with

colocalizes_with is used only with cellular component terms.

Gene products that are transiently or peripherally associated with an organelle or complex are annotated to the relevant cellular component term, using the colocalizes_with qualifier. This qualifier is also be used in cases where the resolution of an assay is not accurate enough to say that the gene product is a bona fide component member.

contributes_to

contributes_to is used only with molecular function terms.

An individual gene product that is part of a complex is annotated to terms that describe the function of the complex. Many such function annotations include the qualifier contributes_to:

Annotating individual gene products according to attributes of a complex is especially useful for molecular function annotations in cases where a complex has an activity, but not all of the individual subunits do. (For example, there may be a known catalytic subunit and one or more additional subunits, or the activity may only be present when the complex is assembled.) Molecular function annotations of complex subunits that are not known to possess the activity of the complex include the qualifier contributes_to.

Note that contributes_to is not used to annotate a catalytic subunit. Furthermore, contributes_to may be used for any non-catalytic subunit, whether the subunit is essential for the activity of the complex or not.

G.4. Computed Feature type of genes

The Feature type field of a Gene Report contains a single controlled vocabulary term from the Sequence Ontology (SO), which aims to describe the key type of the gene.

The single term in this field is computed by FlyBase from the full list of SO terms listed in the SEQUENCE ONTOLOGY: Class of gene section of the Gene Report, according to the following rules:

  • If the gene is a "foreign" gene (a gene that is derived from a species outside the family Drosophilidae, but has been analysed in Drosophila), the relevant SO term - e.g. engineered_foreign_gene - is displayed in the Feature type section, regardless of any other terms in the list.
  • if the gene represents an engineered object, for example a fusion gene between two D>melanogaster genes, the relevant SO term - e.g. engineered_fusion_gene - is displayed in the Feature type section, regardless of any other terms in the list.
  • if the gene represents a natural transposon, or a gene encoded by a natural transposon, the relevant SO term - e.g. transposable_element - is displayed in the Feature type section, regardless of any other terms in the list.
  • if the gene is a pseudogene, the relevant SO term - pseudogene_attribute - is displayed in the Feature type section, regardless of any other terms in the list.
  • if a SO term which describes what type of molecule the gene encodes has been assigned to the gene - e.g. miRNA_gene, tRNA_gene or protein_coding_gene - that SO term is displayed in the Feature type section, regardless of any other terms in the list.

G.5. Computed cytological data

G.5.1. Computed cytological locations of objects which have been mapped to the genome.

Objects which have been precisely mapped to the genome (such as genes with annotations, or insertions of transposable elements with flanking sequence) have an inferred cytological location which is computed by FlyBase based on their sequence location.

The system used is based on estimates that Sorsa published a few years ago of the size in kb of each polytene band. These estimates can be summed to give the length (according to Sorsa) in kb of a region between two very well-mapped entities ('anchors') that are also identified on the genome. The genome sequence gives a different number for that length, so we then apply a scaling factor, i.e. we calculate the cytology of each mapped object in the region between the anchors by interpolation from its sequence coordinates. The anchors we use are a set of over 1200 P-element insertions that have been localised on the genome by sequencing flanking DNA and on polytene chromosomes by Todd Laverty of the Berkeley Drosophila Genome Project. The scaling works out to be slightly different for each inter-anchor region, but we estimate that even in the middle of a region the error in the computed location should never be more than a band or two. As the remaining gaps in the genome sequence are filled, some currently unmappable stretches of sequence (especially near centromeres) will be joined up with the main sequence, and this will shift all the coordinates. Smaller changes will occur as a result of other gap-filling in the middle of arms. These will be reflected in updates to the map locations.

FlyBase currently only computes cytological data in this way for objects that have been mapped to the D.melanogaster genome.

Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report:

Gene Report

  • Cytogenetic map field of the GENOMIC LOCATION section.
  • FLYBASE COMPUTED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section.

Insertion Report

  • Cytological location (computed by FlyBase) field of the DETAILED MAPPING DATA section, marked in parentheses with "inferred by FlyBase from sequence location".

GBrowse

  • Cytologic band evidence tier of the GBrowse display.

G.5.2. Computed cytological location of insertions based on the gene in which it is inserted.

Insertions of transposable elements that do not have flanking sequence may have a computed cytological location which is based on the computed cytological location of the gene into which they have inserted in the genome (displayed in the Affected gene(s) section of the Insertion Report).

If the affected gene has a computed cytological location based on its sequence location (as described in 1. above) then this is displayed in the Insertion Report in the following field:

  • Cytological location (computed by FlyBase) field of the DETAILED MAPPING DATA section, marked in parentheses with "near gene of known cytology".

G.5.3. Computed cytological locations of objects based upon data from the literature.

Genes that do not have a computed cytology based on their mapping to the genome (described in 1. above) may instead have a computed cytology based upon data from the literature. Aberrations may also have computed cytological breakpoints based upon data from the literature.

Five categories of information are used to compute the cytological location of genes and aberration breakpoints:

  • Polytene localization of genes by chromosome in situ hybridization (reported in the EXPERIMENTALLY DETERMINED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section of the relevant Gene Report).
  • Polytene localization of aberration breakpoints (orcein data) (reported in the Breakpoints field of the NATURE OF THE ABERRATION section of the relevant Aberration Report).
  • Genetic (recombination) mapping data on gene order (reported in the EXPERIMENTALLY DETERMINED RECOMBINATION DATA section of the DETAILED MAPPING DATA section of the relevant Gene Report).
  • Complementation data between alleles and aberrations (reported in the GENE DELETION & DUPLICATION DATA section of the relevant Aberration Report).
  • Molecular data on gene order (reported in the MOLECULAR MAP DATA section of the DETAILED MAPPING DATA section of the relevant Gene Report) and proximity.

Recombination, complementation and molecular information does not reveal polytene locations directly, but can be combined with orcein and in situ data to derive inferred polytene locations. FlyBase has produced software which produces a synthesis of the primary data, resulting in a computed cytological location that is a best guess of the polytene location of each gene or aberration breakpoint for which any relevant data are known to FlyBase. However, since this type of analysis is non-trivial when conducted on a large dataset, the statements computed in this way should be treated with caution, and users should also consult the five categories of information listed above to see the full extent of the primary data.

The computed cytological location is presented as a range of uncertainty, whose ends are either polytene bands (such as 22F1) or lettered subdivisions (such as 22F). Heterochromatic bands (such as h41) are also used.

Wherever possible, the computed range of uncertainty of a gene or breakpoint is the range consistent with ALL the data known to FlyBase. Thus, if in one publication a gene has been reported to lie in 35B1-4, and in another publication it is reported to lie in 35B3-6, and there is no other relevant information in FlyBase, the computed location will be 35B3-4. More complex situations arise from complementation and recombination data. For example, if Df(1)xyz is stated to have its proximal breakpoint at 15A1-4, and Df(1)pqr is stated to have its distal breakpoint at 15A3-6, and the Deficiencies are known to overlap (because there is a gene, abc, that they both delete), then both those breakpoints will be computed to lie in 15A3-4 -- as will the gene abc itself.

If however two publications report cytological ranges that do NOT overlap, a choice must be made regarding which report to prioritize. This is done case-by-case, going back to the original literature. Certain guidelines are used: for example, genetic data on deficiencies are usually favored over cytological data, since point lesions very near to a deficiency are rare. However, inevitably some decisions are wrong -- especially when there is nothing to favor one report over another.

Because of the inherent complexity of these computations, the basis for the computed range is often not obvious at first sight. FlyBase therefore includes one-line descriptions of the primary data from which each end of the range was determined.

Some examples:

For gene abc:

Computed cytological location: 15A3-4
Left limit from inclusion in Df(1)pqr (FBrf0012345)
Right limit from inclusion in Df(1)xyz (FBrf0054321)

For Df(1)xyz:

Computed breakpoints: 14D;15A3-4
Limits of break 1 from polytene analysis (FBrf0013579)
Left limit of break 2 from inclusion of abc (FBrf0056789)
Right limit of break 2 from polytene analysis (FBrf0098765)

For Df(1)pqr:

Computed breakpoints: 15A3-4;15D
Left limit of break 1 from polytene analysis (FBrf0034567)
Limits of break 2 from polytene analysis (FBrf0097531)

Note that there is no requirement that any two data items derive from the same reference.

Notation

If a computed cytological range is inferred from recombination data (for genes) or complementation (for breakpoints) they are enclosed in square brackets when no range (even a wider one) can be determined by other means (thus square brackets specifically denote the unavailability of any direct data). This is most commonly found for breakpoints of cytologically invisible deficiencies and for genes which were mapped by recombination but never cloned or mapped by complementation.

'One-ended' limits. The commonest example of this is when a deficiency is stated to delete certain genes, thus giving it a minimum extent, but no flanking undeleted genes are specified, so no 'maximum extent' can be computed. In such cases, if there is also no explicit cytology for the deficiency (and if it is also not stated to be cytologically invisible -- see below) the 'half-open' range is denoted by 'less than' and 'greater than' signs, as follows:

For a deficiency that deletes three genes, all localized to 28D-E:

Computed breakpoints: <28E;>28D
Right limit of break 1 from inclusion of abc (FBrf0076543)
Left limit of break 2 from inclusion of abc (FBrf0056789)

Note that there is no 'limit line' for the left limit of break 1 or the right limit of break 2. Note also the superficially odd, but logically sound, mention of 28E for the left break and 28D for the right break.

Proximity rather than order

There are two cases in which locations are computed based on close proximity of a pair of objects, rather than on their chromosomal order. One is when two genes are reported to lie within 20kb or less on a molecular map. For example, if a gene xyz is stated to lie in 22F1-2 and a second gene, pqr, is stated to lie a few kilobases away from xyz (and there is no other relevant information in FlyBase), the computed location of pqr will be 22F1-2, even if there is no information on the chromosomal order of the two genes.

The other case concerns cytologically invisible deficiencies. If a deficiency is stated to be cytologically invisible, the computation makes the assumption that it is less than a band in extent, so that the ranges of uncertainty of the left and right breakpoint should be identical. For example: if the deficiency in the previous example, which deletes a gene in 28D-E, were said to be cytologically invisible then its computed data would appear as follows:

Computed breakpoints: [28D-E];[28D-E]

Left limit of break 1 from cytological invisibility (FBrf0002468)
Right limit of break 1 from inclusion of abc (FBrf0076543)
Left limit of break 2 from inclusion of abc (FBrf0056789)
Right limit of break 2 from cytological invisibility (FBrf0002468)

Note the use of square brackets as described under "Notation", since this is a case where no explicit cytology is available. A statement that a deficiency is less than 20kb long is, for this purpose, treated as a statement that it is cytologically invisible.

Cytology computed in this way is currently displayed on FlyBase in the following places on the relevant Report:

Gene Report

  • FLYBASE COMPUTED CYTOLOGICAL LOCATION section of the DETAILED MAPPING DATA section.

Note: the one-line description of the primary data from which the range was determined is displayed in the Evidence for location column of the above section.

Aberration Report

  • Computed Breakpoints include field.

Note: the one-line description of the primary data from which the range was determined is displayed in the COMMENTS ON CYTOLOGY section.

G.5.4. Tools

Map-based searches using CytoSearch use computed cytological locations, rather than the primary data reported in the literature. For this reason, it is always advisable to search using a slightly broader range than the one of interest, so as to match entities which have been placed by multiple investigators in slightly varying locations.

The Cytolocation Advanced Search option in GBrowse uses computed cytological locations of objects which have been mapped to the genome (as described in 1. above).

G.6. Personal communications to FlyBase

The policy of FlyBase with respect to the incorporation of unpublished data into the database is as follows. Data will only be considered for curation if available to FlyBase in written or electronic form. FlyBase will not capture data from oral presentations at meetings or seminars, from posters or by word of mouth (we will, however, curate published abstracts). If colleagues wish unpublished data to be considered for incorporation into FlyBase then those data must be submitted to FlyBase in writing or by using the contact FlyBase form (electronic submissions are strongly preferred). Each personal communication will be assigned a FlyBase reference (FBrf) identifier number, and the data will be tied to this citation in the database. These references will appear in the FlyBase bibliographic files, and become citable publications upon entry into the public FlyBase database. Personal communications received in written form (i.e. not electronically) will be archived by FlyBase. For personal communications that have been sent by e-mail, the full text of the communication will be present within the Reference Report. We encourage the citation of these personal communications in the literature in the form:

Gelbart, W.M. (1994). Personal communication to FlyBase.<http://flybase.org/reports/FBrf0075300.html>

Personal communications are incorporated into the FlyBase bibliography and can be searched using either the QuickSearch or the QueryBuilder tool.

G.7. Gene Model Annotation Guidelines

Annotation guidelines have been updated; see section G.8., below.

G.7.1. Criteria for Annotation

Purpose: To determine whether existing gene models are correct and complete and to determine if there is evidence for additional genes or transcripts not already represented by the existing models.

Determine whether a protein-coding gene exists in a region.

Gene prediction algorithms are sufficiently robust that this is rarely an issue for larger genes (200aa or greater), unless the gene consists of many small dispersed exons. To make a judgment in cases of small genes or genes comprised of small exons, available evidence is examined further. Three types of evidence are considered:

  • Matches to cDNA sequence data (BDGP cDNA/EST data or data generated by the community). Considered more significant if it includes an intron with consensus splice sites.
  • Gene prediction data, including conserved protein signatures.
  • BLASTX homology; matches with expected value less than 1 x e-7 are considered.

For gene models with only one of these three types of supporting data, models with a predicted CDS greater than 100aa are created or retained. If there are two or more types of supporting data, a gene model is created if the predicted CDS exceeds 50aa. If there is BLASTX homology to a similar small gene in other species, a smaller size limit is accepted.

Is there one gene or several?

Gene splits or merges are a common annotation correction and are based upon cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment indicating that a merge or split has occurred, along with an indication of the type of data supporting the change, is placed in the annotation record.

FlyBase considers transcripts for which any portion of the predicted protein is in common to represent a single gene. An overlap of a single amino acid is sufficient, but overlap of a multiphasic coding exon (different ORFs used) is not.

Determine the structure of the transcript(s).

Internal intron-exon structures are based primarily upon EST/cDNA data. If these data are absent, we rely on gene prediction data. In a few cases, approximate gene structures are inferred from BLASTX alignments. In practice, many annotations are based upon a combination of these data types. Examples:

  • When cDNA/ESTs only cover the termini, internal structures will be based upon gene prediction data. The 5' terminus of a transcript is extended to the start of the overlapping EST that extends furthest 5'. Unspliced ESTs generally are not considered.
  • If there is no 5' cDNA/EST data, the transcript is extended to the first in-frame ATG consistent with the gene prediction or BLASTX data. Similarly, if an annotation supported by 5' EST data does not contain an in-frame start codon, the annotation is extended to the first such start.
  • The 3' terminus is extended to the 3' end of a complete cDNA, if available, or to the 3' end of an overlapping 3' EST. Unspliced ESTs generally are not considered.
  • Starting in 2008: full-length cDNAs are checked for terminal polyA's. Annotated transcripts are extended 3' to the last non-A nucleotide and the following comment appended: "Transcript terminates at site supported by polyadenylated cDNA."
  • If there is no 3' cDNA/EST data, the transcript is extended to the first stop codon consistent with the gene prediction data or BLASTX alignment.
  • Whenever splice sites other than GT/AG are annotated, a comment is appended to the transcript.

Determine the extent of the coding region.

The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. But, in cases supported by the literature (including conservation patterns across Drosophila species), a non-ATG translation start site, or a downstream ATG may be used.

In some cases, especially for annotations supported only by BLASTX data, it is not possible to identify a likely ATG start codon. In such cases, translation is started at the 5'-most internal in-frame codon and an explanatory comment is added.

The following policy was established in 2008: Any change to an existing transcript that results in a change to the CDS will be accompanied by changes to the transcript and protein symbols and IDs.

How many alternative transcripts exist?

We annotate as many alternative transcripts as are supported by cDNA/EST and community data. We will also annotate an alternative transcript if there is convincing gene prediction evidence and/or BLASTX evidence.

If non-contiguous EST data support alternative exons in several regions of the gene, it is not always possible to determine which of all possible combinations actually exist in vivo. The number of such alternative transcripts to be created is at the discretion of the annotator; generally, when there are more than 6-8 transcripts, all alternative exons are represented, but not all possible combinations. In such cases, alterative termini are usually associated with the most common transcript pattern.

Protein conservation data may support additional internal exons. These are assessed for homology to adjacent exons, thus indicating a pattern of exon shuffling.

Partial annotations are avoided except in extreme circumstances; the exception is failure to find a likely ATG start codon if it is encoded in a small or distant exon (see above).

Curator comments (see section G.7.5.).

The Apollo annotation tool allows for the inclusion of comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including controlled comments as well as free text comments. The collection of controlled comments was developed during the initial re-annotation stages, and is used as often as possible to facilitate consistency and to provide a means of tracking or querying for various atypical gene structures. For example, all predicted splices that fail to use the canonical GT/AG donor and acceptor splice site dinucleotides are noted, as are genes that have been reported to make use of non-ATG translation starts, genes that have a dicistronic transcript, and genes known to be or appearing to be mutant in the sequenced strain.

Many of the controlled comments address the weaknesses or anomalies in the annotation: an unusual alternative transcript supported by a single EST, incomplete supporting data requiring extension of a gene model to the nearest translation start or stop, or than an ATG translation start codon could not be identified. Genes that are split or merged are commented and the type of evidence supporting the change indicated. Finally, cDNA clones that failed to accurately reflect the annotation (typically those that are incomplete or appear to include intronic sequences) are designated as problematic and have a comment attached.

If such comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will only appear on the report if there are comments attached to the transcript.

G.7.2. Evidence used for gene model annotation as of March 2007

Since the publication of the description of the r3.1 reannotation effort (Misra, et al., 2002), a number of new and expanded data sets allow much more accurate assessment of gene models in D. melanogaster. These include:

  • Expanded collection of high quality full-length cDNA sequences provided by the BDGP
  • Expanded collection of predicted protein sequences, especially from insects (provided to and aligned for FlyBase by NCBI).
  • Additional EST collections, including from Exelixis
  • Additional gene prediction algorithms including Augustus (Stanke, et la. 2006, BMC Bioinformatics 7:62), Contrast (Gross, Do, and Batzoglou, 2005, BCATS 2005 Symposium Proceedings, p. 82), GeneID (Parra, Blanco, and Guigo, 2000, Genome Research, 10: 511-515), NCBI gnomon (Souvorov, et al. 2006), and SNAP (Korf, 2004, BMC Bioinformatics 5:59).
  • The exon prediction algorithm, CONGO, based on conserved protein-coding signatures. (submitted by M. Lin and M. Kellis)
  • Proteomics analysis contributed by the Center for Model Organism Proteomes, SystemsX and Research Priority Project of the University of Zurich, Switzerland.
  • Community submissions of corrections and other data, including non-coding RNA gene models.

G.7.3. Exceptional cases

For genes with data supporting atypical collections of transcripts, a useful rule of thumb is to consider the data for each transcript in isolation, ignoring other transcripts annotated for the same gene and adjacent genes. This helps reduce our bias against the new and unusual.

The majority of annotation comments (see section G.7.5.) flag exceptional types of annotations

Dicistronic annotations: Adhering to the definition described above, the proteins encoded by a multicistronic transcript are considered to represent different genes. Preferrably, a dicistronic transcript is supported by more than one spanning cDNA or EST; care must be taken not to misinterpret cases of overlapping UTR's. Alternative explanations, such as a mutant in the strain, must be ruled out. Each postulated protein should have additional support, especially if it is small.

Atypical cDNAs with retained introns: Initially, cDNAs with retained introns were flagged as problematic clones and a corresponding transcript annotation was not created. But given the unexpected frequency with which we observe such cDNAs, and the fact that there are well characterized systems [such as su(w[a])]with experimental support for such transcripts, we have changed our treatment of these cases. A transcript is created and the following comment appended: "Based on cDNA(s) that contain premature stop codon; may or may not produce functional polypeptide."

Non-ATG starts, stop-codon suppression, and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data.

Mutations in the strain: These are now relatively easily assessed, since comparisons to closely related species are informative.

Pseudogenes: A non-functional gene that (1) has a related gene in the genome and (2) has more than one compromising lesion, is classified as a pseudogene. (If there is only one lesion, it is described as a mutant in the strain.) Retrotransposed pseudogenes appear to be rare. (Retrogenes exist -- at least 98 have been identified -- but nearly all appear to be functional.)

Chimeric genes: These are occasionally created at the site of an aberration (usually a tandem duplication). These are currently annotated as protein-coding genes, usually based on gene prediction algorithms; need to be reassessed and a consistent comment applied.

Stretching the definition of a gene: As described above, transcripts for which any portion of the predicted protein is in common, even a single amino acid, are considered to be products of a single gene. A number of genes with very complex alternative splicing patterns have pairs of transcripts with coding regions that fail to overlap each other, but both of which overlap the coding region of a third transcript. We flag these cases with the following comment: "Gene model includes transcripts encoding non-overlapping portions of the full CDS."

Ambiguous cases: Generally, if the case is ambiguous, rather than creating an exceptional annotation a comment is added. Examples: "May be component of a dicistronic gene; available data inconclusive." "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length); possible stop-codon suppression."

G.7.4. Assessment of supporting data

One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones.

We make note of problematic clones, most commonly those that are chimeric, that contain an RTase error resulting in a frameshift, or that appear to be genomically primed. We use the description "Suspect" for clones that appear to be aberrantly spliced (or are unspliced) and do not support a CDS of any size. This information may be found on the FlyBase cDNA clone reports in the "Known Problems" field.

G.7.5. Current list of "canned" annotation comments

An "AnnotationComment" is one that applies to the whole gene model; A "TranscriptComment" applies to a specific transcript.

AnnotationComments

  • "gene_with_dicistronic_processed_transcript ; SO:0000722"
  • "May be component of a dicistronic gene; available data inconclusive."
  • "gene_with_dicistronic_primary_transcript ; SO:0000721"
  • "Shares 5' UTR with upstream gene.
  • "Shares 5' UTR with downstream gene.
  • "Gene merge based on protein alignment (BLASTX) data."
  • "Gene merge based on EST/cDNA data."
  • "Gene split based on protein alignment (BLASTX) data."
  • "Gene split based on EST/cDNA data."
  • "Known mutation in sequenced strain."
  • "Probable mutation in sequenced strain: premature stop."
  • "Probable mutation in sequenced strain: [*]."
  • "Gene model includes transcripts encoding non-overlapping portions of the full CDS."
  • "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length)."
  • "3' UTR contains an ORF that is conserved among close Drosophila species (potential peptide [*] aa in length); possible stop-codon suppression."
  • "Multiphase exon postulated: exon reading frame varies in alternative transcripts."

TranscriptComments

  • "GC splice donor site postulated."
  • "Unconventional splice site postulated."
  • "5' exon not determined (no ATG translation start identified)."
  • "Transcript terminates at site supported by polyadenylated cDNA."
  • "Unconventional (non-ATG) translation start supported by [*]."
  • "Stop codon suppression supported by [*]."
  • "Occurrence of translational frameshift supported by [*]."
  • "Downstream translation start supported by [*]."
  • "Transcript model based on protein alignment (BLASTX); no experimental evidence for splice sites."
  • "Monocistronic transcript; alternative dicistronic transcript(s) exist."
  • "Dicistronic transcript; alternative monocistronic transcript(s) exist."
  • "Dicistronic transcript."
  • "Based on cDNA(s) that contain premature stop codon; may or may not produce functional polypeptide."
  • "Transcript postulated to overlap transposable element."

G.8. Gene Model Annotation Guidelines (2012)

Note on annotation of different gene classes

Originally, manual annotation efforts concentrated on models of protein-coding genes. With the availability of RNA-Seq coverage data, long non-coding RNAs (lncRNAs) are a growing class of new gene models requiring manual annotation.

FlyBase relies on outside expert annotations for the various small non-protein-coding classes. Annotations of tRNAs have been stable since r3.2 and were based on tRNAscan (Lowe and Eddy, 1997, NAR 25:955-964). Annotations of miRNAs are based on data compiled by miRBase (www.mirbase.org) and periodically updated. Annotations of other small RNA classes are based primarily on published sources; these include snoRNAs (FBrf0199239 and others), snRNAs (FBrf0128209, FBrf0193533 and others), and 5SrRNA genes (FBrf0041596).

Note on transcript and protein ID changes

The following policy was established in 2008: Any change to an existing transcript that results in a change to the CDS will be accompanied by changes to the transcript and protein symbols and IDs.

Implementation of changes to annotation guidelines

Aside from minor changes, the previous annotation guidelines were written in 2007 (see section G.7., above) and stood us in good stead until various new classes of high throughput data started to be made available. Reannotation of existing gene models based on the new 2012 guidelines is a work in progress. The release number corresponding to the last time a gene model was reviewed is now noted in the Gene Model Comment section; these guidelines are reflected in gene models reviewed during release 5.45 and later.

Curator comments (see section G.8.5.).

The Apollo annotation tool allows for the inclusion of comments associated with an annotated gene or a specific transcript of an annotated gene. We make extensive use of this capability, including standardized comments as well as free text comments. In the sections that follow, there are frequent references to curator comments; further explanation and a complete list of current standardized comments may be found in section G.8.5.


G.8.1. Types of data that inform gene model annotation (2012)


Initial annotation efforts relied upon three primary data types:
  • cDNA sequence data (high-throughput cDNA/EST data or data generated by the community)
  • Gene prediction data, including conserved protein signatures
  • BLASTX homologies

Starting in 2010, new high-throughput data sets, primarily from the modENCODE project, have had a significant impact on gene model annotation:
  • RNA-Seq coverage data
  • Stranded RNA-Seq coverage data
  • RNA-Seq exon junction data
  • Transcription start site (TSS) data
  • Translation stop-codon read-through predictions

These data can be viewed in GBrowse in the aligned evidence tracks; more information about a specific dataset may be found via links in the GBrowse data tracks listings and in the FlyBase collection reports. Some smaller-scale datasets may not be presented in GBrowse; in these cases, an explanation and reference are provided in the gene or transcript comment section (see ‘Curator comments’). Exceptions include data from publications, including supplementary data, which have not been submitted to a sequence database.


G.8.2. Rules and criteria for annotation (2012)

Classification of a new gene as coding or non-coding

Most new annotations are based on RNA-Seq junction or coverage data. These data were used by modENCODE to isolate cDNAs by inverse-PCR, so there may also be new cDNA data.

Many new annotations are small genes that may be non-coding (lncRNA genes) or encode small polypeptides. Current knowledge of both of these categories is rudimentary, thus FlyBase annotators often must make judgment calls. A primary consideration in this process is whether a potential ORF shows a pattern of conservation among the species in the melanogaster subgroup. New annotations that are difficult to categorize are flagged with a comment stating that the opposite case may be true; see ‘Curator comments,’ below.

A cDNA or an RNA-Seq junction may support the possibility of an antisense RNA gene. If there is also support from stranded RNA-Seq data, a non-coding gene annotation is created and a comment identifying it as antisense is appended; see ‘Curator comments,’ below.

Structure of the transcript(s)

Transcription start site: TSS’s rarely consist of a single definitive nucleotide location.
  • For cases with data from modENCODE mapping the TSS frequency distributions (FBrf0213250), the 5’ ends of all overlapping transcripts are set to the 90% TSS point. This is the point at which a summation algorithm hits 0.9 (starting from the 3’-most TSS and moving 5’).
  • If no modENCODE TSS data are available, the 5’ extent of the 5’-most EST or cDNA is used for all overlapping transcripts.
  • Short-capped RNA data may be used (FBrf0209722); if so, a comment is appended (see ‘Curator comments’).
  • If none of the above are available, but there are robust RNA-Seq data, an estimate based on RNA-Seq coverage data is made.

Internal intron/exon structure:
  • cDNAs are the primary data source for internal gene structure, with alternative transcripts based also on EST and RNA-Seq junction data (see ‘Alternative transcripts and the permutation problem,’ below).
  • Some gene models are still primarily supported by gene prediction or protein alignment data, but these have significantly dropped in number.
  • Non-canonical splices require a high level of support. With the exception of the AT-AC splice pair, thus far all supported non-canonical splice sites vary by only one nucleotide from the most common, GT-AG, and the first nucleotide of all donor sites is ‘G”. Whenever splice sites other than GT/AG or GC/AG are annotated, a comment is appended to the transcript.

3’ terminus: 3’ UTRs may have many polyadenylation sites; no attempt is made to annotate transcripts representing all possibilities.
  • If a polyadenylated cDNA is available, most transcripts are extended 3' to the last non-A nucleotide of the cDNA. A comment is added to each transcript so defined (see ‘Curator comments’).
  • If RNA-Seq coverage data support 3’ UTR sequences beyond that present in a cDNA, at least one transcript is extended 3’ to the approximate terminus supported by the RNA-Seq data and an explanatory comment is added (see ‘Curator comments’).

Annotation of the coding region

The Apollo annotation tool sets the translation start site to the 5'-most in-frame ATG. In cases supported by the literature (including conservation patterns across Drosophila species), a non-ATG translation start site or a downstream ATG may be used; an explanatory comment is appended (see ‘Curator comments’).

The “Exceptional cases’ section below discusses non-ATG starts, stop-codon readthrough and other atypical phenomena affecting the defined coding region.

Partial annotations are avoided except in heterochromatic regions where there may be sequence gaps or genomic sequence mis-assembly. In the past, there were some cases for which it was not possible to identify a likely ATG start codon; translation was started at the 5'-most internal in-frame codon and an explanatory comment added. All such cases in the euchromatin have now been resolved.

Alternative transcripts and the permutation problem

Alternative transcripts are annotated based on cDNA/EST data, RNA-Seq data, and community data. Originally, we also annotated an alternative transcript if there was convincing gene prediction evidence and/or BLASTX evidence, however, almost all alternative transcripts are now supported by RNA-based data.

Frequently, RNA-Seq junction data support many alternative splices within the 5’ UTR of a gene. For a given TSS, all such splices may not be annotated. If this is the case, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

RNA-Seq junctions that are of much lower frequency than alternative junctions may not be annotated (see ‘Assessment of supporting data,’ below). If this is the case, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

If non-contiguous data, such as RNA-Seq junction, EST, and TSS data, support alternative exons in several regions of a gene, it is usually not possible to determine which of all possible combinations actually exist in vivo. We call this the “permutation problem.” Combinations supported by full-length cDNAs are annotated. The number of additional transcripts to be created is at the discretion of the annotator. Excluding low-frequency junctions, all alternative splices within the CDS and all promoters are represented, but not necessarily all possible combinations. If all combinations are not represented, a comment is included in the Gene Model Comment section (see ‘Curator comments’).

Alternative transcripts: cases that disrupt the CDS

Due to a retained intron or alternative splice, cDNA, EST, or RNA-Seq junction data may support an alternative transcript that would result in a premature stop codon or a downstream start – it usually cannot be determined which. A higher level of support is required for the annotation of such a transcript (multiple cDNA/ESTs or a high-frequency junction), and a comment is appended to the transcript (see ‘Curator comments’). However, rather than continue to annotate truncated proteins corresponding to these transcripts, FlyBase is developing a proposal to reclassify them as non-coding transcripts produced from protein-coding gene loci.

Merges, splits, and “splerges”

Gene splits or merges are a common annotation correction and are based upon RNA-Seq coverage or junction data, cDNA/EST data, BLASTX homologies, or corrections submitted by the community. A comment is placed in the gene record indicating that a merge or split has occurred, in which release, along with an indication of the type of data supporting the change.

Generally, FlyBase considers transcripts for which any portion of the predicted protein is in common to represent a single gene. Examples are considered on a case-by-case basis; some exceptions to this rule have been made for well-characterized genes that exist as separate unrelated entities in other phylogenetic groups.


G.8.3. Exceptional cases (2012)

Many of the curator comments (see section G.8.5.) flag exceptional types of annotations.

Dicistronic and polycistronic annotations: Proteins encoded by a multicistronic transcript are considered to represent different genes. Preferrably, a polycistronic transcript is supported by more than one spanning cDNA or EST; care must be taken not to misinterpret cases of overlapping UTR's. Alternative explanations, such as a mutant in the strain or stop-codon suppression, must be ruled out. Each postulated protein should have additional support. This is not a particularly rare “exceptional” class: there are currently more than 130 dicistronic pairs annotated, 4 tricistronic sets and 2 tetracistronic sets.

Stop-codon suppression/readthrough: This class requires support from the literature, including evolutionary comparative data. As a result of work from one of the modENCODE groups (FBrf0216845), this is no longer a rare class: there are currently more than 300 genes with a transcript annotated with a stop-codon readthrough.

Non-ATG starts and translational frameshifts: These require a high level of support, such as detailed treatment in the literature or unambiguous homology data. There are currently 11 genes annotated with a non-ATG start and one with a translational frameshift (Oda). To date, all non-canonical starts vary from ‘AUG’ by one nucleotide.

Trans-spliced transcripts: One gene in Dmel undergoes extensive trans-splicing (mdg4); others may undergo lower levels. If there is sufficient evidence, the trans-splicing precursors should also be annotated.

Pseudogenes: A non-functional gene that (1) has a related gene in the genome and (2) has more than one compromising lesion, is classified as a pseudogene. If there is only one lesion, it is described as a mutant in the strain (see below); polymorphic pseudogenes are treated as mutations in the strain. Retrotransposed pseudogenes are relatively rare: 5 are currently flagged as such. (Retrogenes exist -- at least 98 have been identified -- but nearly all appear to be functional.)

Mutations in the strain: These are now relatively easily assessed, since there is sequence information for multiple Dmel wild-type strains and for closely related species. Cases for which some wild-type strains carry a functional allele and others carry the mutant allele are flagged with a comment that they represent polymorphic pseudogenes.

Chimeric genes: These are occasionally created at the site of an aberration (usually a tandem duplication). Evidence for expression of such a gene is often ambiguous, since ESTs and RNA-Seq data corresponding to the component genes also may align to the chimeric copy. If a CDS appears to be supported, the gene is classified an coding, but is flagged as “Gene model uncertain.” If it appears unlikely that a protein product is produced, the gene is classified as a pseudogene, and also flagged as “Gene model uncertain.”

Stretching the definition of a gene: As described above, transcripts for which any portion of the predicted protein is in common are usually considered to be products of a single gene. A number of genes with very complex alternative splicing patterns have pairs of transcripts with coding regions that fail to overlap each other, but both of which overlap the coding region of a third transcript. We flag these cases with the following comment: "Gene model includes transcripts encoding non-overlapping portions of the full CDS."

Ambiguous cases: Generally, if a case is ambiguous, rather than creating an exceptional annotation a comment is added.


G.8.4. Assessment of supporting data (2012)

One of the most difficult aspects of annotation is assessing the validity of the supporting data. Most problems are recognized only after a period of time, during which a pattern emerges. Examples: the RE and RH cDNA libraries have a higher frequency of genomically primed clones; the GenScan prediction algorithm tends to overpredict, creating long annotations with many small exons (it is more appropriate for vertebrates); unspliced ESTs must be viewed with caution, especially those from the Exelixis libraries; the IP cDNAs exhibit a higher frequency of chimeric or atypically spliced clones.

The new high throughput RNA-Seq, RNA-Seq junction, and TSS data must be viewed with the same caveats. We are still developing annotation guidelines for dealing with these datasets; it is an advantage that they include some quantitative measures of validity and/or frequency. Low-frequency junctions, for example, may not be used in a gene model; if this is the case a comment is added (see ‘Curator comments’). Any type of aligned data is problematic in regions of repeats; this is true of RNA-Seq coverage and junction data.


G.8.5. Curator comments: Current list of standardized annotation comments (2012)

If curator comments exist for a particular annotation, they can be found on the Gene Report, in the GENE MODEL AND FEATURES section, in a field labeled Comments on Gene Model. If comments exist for a particular transcript, they can be found on the annotated transcript report in a section called COMMENTS. This section will appear on the transcript report only if there are comments associated with the transcript. SO terms and ID numbers are included whenever appropriate (see TermLink).

Annotation Comments (apply to the whole gene model)

  • Annotated transcripts do not represent all possible combinations of alternative exons and/or alternative promoters.
  • Annotated transcripts do not represent all supported alternative splices within 5' UTR.
  • Low-frequency RNA-Seq exon junction(s) not annotated.
  • Supported by RNA-Seq data.
  • Supported by strand-specific RNA-Seq data.
  • Probable lncRNA gene; may encode small polypeptide(s).
  • Possible non-coding RNA gene.
  • Antisense: overlaps [] on opposite strand.
  • Antisense (in part): overlaps [] on opposite strand.
  • gene_with_dicistronic_mRNA ; SO:0000722
  • gene_with_polycistronic_transcript ; SO:0000690
  • May be component of a dicistronic gene; available data inconclusive.
  • Shares 5' UTR with upstream gene.
  • Shares 5' UTR with downstream gene.
  • Gene model includes transcripts encoding non-overlapping portions of the full CDS.
  • Pseudogene similar to []; proximate; partial; created by tandem duplication.
  • Pseudogene similar to []; transposed.
  • Apparent introns not annotated: probable artifact due to repetitive sequence.
  • miRNA(s) located within the transcribed region of this non-coding RNA gene.
  • Alternative translation stop created by use of multiphasic reading frames within coding region.
  • Variable use of small exon; supported combination results in frameshift and premature stop in downstream exon.
  • Multiphase exon postulated: exon reading frame differs in alternative transcripts.
  • Multiphase exon postulated: reading frame of first coding exon differs in alternative transcripts.
  • Mutation in sequenced strain: [*].
  • Polymorphic pseudogene: intact in some individuals or strains, disrupted by mutation in others.
  • Gene model uncertain: []
  • Gene model uncertain: chimeric gene.
  • Gene model is incomplete due to []
  • gene_with_transcript_with_translational_frameshift ; SO:0000712
  • Translational frameshifting postulated (FBrfnnnnnnn): -1 [+1] frameshift reflected in aa sequence of predicted polypeptide[s].
  • gene_with_stop_codon_redefined_as_selenocysteine ; SO:0000710
  • Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn).
  • gene_with_stop_codon_read_through ; SO:0000697
  • Stop-codon suppression (Uxx) postulated; FBrfnnnnnnn.
  • gene_with_unconventional_translation_start_codon ; SO:0001739
  • gene_with_translation_start_codon_CUG ; SO:0001740
  • Unconventional translation start (XYZ) postulated; FBrfnnnnnnn.
  • gene_with_trans_spliced_transcript ; SO:0000459
  • Multiphase exon postulated: this gene shares a region of coding sequence with an overlapping gene, but different reading frames are utilized in the overlapping coding region.
  • Bidrectional region of coding sequence postulated: a portion of the CDS of this gene overlaps a portion of the CDS of a gene on opposite strand.

Transcript Comments (apply to a specific transcript)

  • Transcript terminates at site supported by polyadenylated cDNA.
  • Extended 3' UTR based on RNA-Seq and/or EST data.
  • UTR(s) based on RNA-Seq data.
  • Transcriptional initiation is supported by short-capped RNA data (FBrf0209722).
  • Evidence supports alternative splice leading to premature stop codon and/or downstream start; may or may not produce functional polypeptide.
  • Based on cDNA(s) with retained intron; results in premature stop codon and/or downstream start; may or may not produce functional polypeptide.
  • Unconventional splice site postulated (XY-WZ).
  • Non-coding alternative transcript supported, [retained intron/alternative splice] (FBrfnnnnnnn).
  • Truncated polypeptide supported, [retained intron/alternative splice] (FBrfnnnnnnn).
  • Truncated polypeptide supported, [alternative downstream AUG/alternative terminal exon] (FBrfnnnnnnn).
  • Monocistronic transcript; alternative dicistronic transcript(s) exist.
  • Dicistronic transcript; alternative monocistronic transcript(s) exist.
  • Dicistronic transcript.
  • Polycistronic transcript.
  • Unconventional splice site invoked (XY-WZ); sequence altered due to transposon insertion; this splice may not occur in vivo.
  • Unconventional splice site(s) invoked due to gap in genomic sequence; this splice does not occur in vivo.
  • Stop-codon suppression (UGA as Sec) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
  • Stop-codon suppression (Uxx) postulated (FBrfnnnnnnn); reflected in aa sequence of predicted polypeptide.
  • Unconventional translation start postulated (XYZ encoding Met); FBrfnnnnnnn.
  • Downstream translation start supported by comparative analysis across Drosophila species.
  • Downstream translation start supported by [FBrfnnnnnnn].
  • Transcript postulated to overlap transposable element.

G.9. What does the annotation evidence score mean?

The current implementation of the evidence scoring system is based on assessment of three different classes of evidence used to inform transcript annotations. These are

  • 1) gene prediction algorithms,
  • 2) aligned nucleotide sequences, and
  • 3) overlapping regions of protein similarity.

Note that, in the future, we plan to refine this scoring metric to include support based on comparative genomics and proteomic analyses, as well as to potentially provide details on the quantity and quality of each type of support.

Each transcript gets a score that is based on the sum of the following categories:

  • 1 point if one or more aligned EST sequences are fully consistent with the annotated transcript.
  • 2 points if an annotated exon intersects a region of aligned protein similarity
    (note that similarity to self is excluded)
  • 4 points if there is any gene prediction that is fully consistent with the annotated transcript
  • 8 points if one or more aligned cDNAs are fully consistent with the annotated transcript.

The points assigned for each type of evidence allow one to easily and unambiguously determine what types of evidence exist that support a particular transcript annotation as each possible combination of supporting types receives a unique score.

For example, to identify all transcripts with cDNA support one would look for all transcripts with a score greater than or equal to 8. If instead you wanted to identify transcripts with no aligned nucleotide support you would search for transcripts with scores of 0,2,4 or 6. And to identify those transcripts with both supporting ESTs and gene prediction support but without a full length cDNA or protein similarity you would seach for transcripts with a score equal to 5.

Support means different things for different classes of evidence.

For gene prediction support the ends of the predicted gene model must either match or be within the annotated CDS of a transcript and the internal predicted exon/intron junctions must match the annotated junctions along the entire length of the prediction.

The rules are the same for EST and cDNA alignments except that the assessment is based on the entire annotated transcript and not just the coding region.

For protein similarity a positive score is simply based on a region of aligned protein sequence overlapping any annotated CDS exon of an annotated transcript on the same strand. This simplistic assessment likely produces a fair number of false positives and we hope to refine this aspect of assessment to provide more meaningful confidence values.