A Database of Drosophila Genes & Genomes

FB2013_03, released May 7th, 2013
 

Dataset mE_Transcription_Start_Sites

General Information
Name mE_Transcription_Start_Sites Species D. melanogaster
Dataset type genomic sequence feature FlyBase ID FBlc0000202
Source & Content
Consists of
Genomic sequences identified by integrative analysis of ESTs, CAGE or RLM-RACE.
Created by
Available from
Not available as reagents.
Strain
Stage & tissue
Stage
Tissue/Position (including subcellular localization)
Reference
Comment:0-24 hr AEL
Cell Line
hide Recent Updates
Description
What does this section display?
This section contains items that were added to this record for each release. It currently only tracks new links between this FlyBase report and other FlyBase data classes (e.g. genes, references, stocks) or controlled vocabulary terms (e.g. GO, anatomy terms).
What does this section not display?
This section does not currently display links that were removed or gene model changes.
Update Feed
Click the icon below to subscribe to this FlyBase record and receive updates automatically through your feed reader.
FB2013_03
FB2013_02
All updates Click here to see a list of all updates to this record from FB2010_08 and on.
hide Description & Members
Description
Genomic sequences identified as transcription start sites (TSS); a synthesis of expressed sequence tags (ESTs), cap analysis of gene expression tags (CAGE) and RNA ligase mediated rapid amplification of cDNA ends (RLM-RACE).
Parent collections
Component collection(s)
Number in collection
Comment on number in collection
Members
hide Experimental protocol
Vector
Sample preparation
See component data set reports for details.
Collection preparation
See component data set reports for details.
Mode of assay
See component data set reports for details.
Assay platform
See component data set reports for details.
Data analysis
Characterization of TSS distributions: within each TSS the distribution of tags from each of the three assays was modeled as a multinomial distribution, each bin corresponding to a single nucleotide. Each assay tended to provide tag distributions "shifted" by 1 or 2 bp from each other assay. The smoothed distributions across the three assays were combined to obtain consensus probability density functions (PDFs) for each TSS. A shape index (SI) was calculated for each TSS; the SI is analogous to the thermodynamic entropy of a system and quantifies the number of states occupied by the system (the tag heights and locations) and the total possible states (the entire promoter region). A shape index value of -1 was somewhat arbitrarily chosen to separate TSSs into discrete classes: 2337 "peaked" TSSs with SI > -1, 6607 "broad" TSSs with SI <= -1 and 3456 TSSs designated as "unclassified" due to either low tag count (2487 TSSs) or class-instability (982 TSSs).
Classifying TSS evidentiary support: TSSs were grouped based on evidentiary support into either validated (V), supported (S) or RACE-only (R). The validated set (8694 TSSs) is defined by two or more data types (5477 TSSs have all three data types). The supported set (3062 TSSs) is defined by either a CAGE peak or at least three RACE reads overlapping a 5' UTR. The RACE-only set (698 TSSs) is defined by three or more RACE reads with no support from an overlapping 5' UTR. The majority of unsupported CAGE peaks are likely associated with other phenomena, and not with bona fide transcription initiation sites.
Identification of tag clusters: an iterative hierarchical clustering procedure was devised to group tags into TSS regions and applied to the RE EST, CAGE, and RACE data sets independently. These clusters were then integrated to produce consensus clusters based on the tags from all three data sets. 12,454 TSSs (promoters) were identified, of which 11,672 TSSs were associated with 8037 gene annotations (FlyBase release 5.12, October 2008) by a progressive strategy: first, peaks were associated with 5' UTRs, then with regions within 100 bp of a 5' transcript end, followed by 3' UTRs, introns, protein-coding exons and finally other annotations (e.g., pseudogenes and regions within 100 bp of a 3' end). The remaining TSSs were classified as intergenic.
Mapping of tags to the genome: 66,169 cap-trapped and normalized RE ESTs (FBrf0152058) were reanalyzed to ensure accurate vector trimming and genomic alignment, of which 61,429 RE ESTs were mapped uniquely to the genome. An EST was associated with a gene if an EST alignment shared genomic coordinates with either the start or stop codon, or the start or end coordinate of any exon. See component collections for details of RACE and CAGE tag mapping.
hide Additional data
More information is available under:
Associated files
Additional sites
hide Comments
modENCODE Transcription Start Sites
Core promoter sequence motifs are differentially enriched in the peaked and broad classes of promoters.
Genes with peaked promoters have a marked and highly significant tendency to be expressed in spatially and temporally restricted patterns, and genes with broad promoters do not.
CAGE peaks within 3' UTRs appear to be associated with cytoplasmic transcript degradation products, and not independent promoters.
hide Synonyms & Secondary IDs
Reported As
Symbol Synonym
Integrated promoters
mE_Transcription_Start_Sites
 
Secondary FlyBase IDs
    hide References ( 2 )
    Research paper
    Hoskins et al., 2011, Genome Res. 21(2): 182--192
    Genome-wide analysis of promoter architecture in Drosophila melanogaster. [FBrf0213090]
    Supplementary material
    Hoskins et al., 2011, Genome Res. 21(2):
    Supplemental Data File 3. Integrated promoters. [FBrf0213250]