Evolutionary signatures
for protein-coding genes
By studying evolution, we found this:
- Evolutionary signatures: Reading Frame
Conservation, Codon Substitution Biases, Ka/Ks rate
- Scaling to multiple species: increased
separation between genes and non-coding regions
- Combining these signatures:
Classification.
Revisiting the D. melanogaster
protein-coding gene catalog
Purpose here is not to study gene evolution
(see Eisen paper), but rather to leverage the 12 genomes to improve gene
annotation quality.
- People
involved:
- MIT (Manolis, Mike Lin)
- FlyBase (Bill Gelbart, Peili Zhang, and
FlyBase-Harvard curators)
- BDGP (Sue Celniker, Joe Carlson)
- Identification of novel protein-coding sequence
- Method
- CONGO similar in effect to Siepel's Exoniphy,
but uses more flexible discriminative algorithms
- Predictions
- Experimental results/validations
- BDGP August iPCR runs - new cDNAs in GenBank
- BDGP heterochromatin cDNAs (unpublished data)
- FB4.2 vs 4.3
- FB human curation (~400 annotation changes)
- Intersection with Affy transfrags (so-so
enrichment)
- Mass spec (ETH Zurich/Sandra Lovenich, Erich
Bruggner, Konrad Basler, Ernst Hafen)
- To Do:
- who are those genes (Dscam etc.), how many
exons prior to this (alternative splicing!)
- is length of exon mult. of 3, or does it
otherwise fit with splicing requirements
- What is the expected fraction of novel exons
to be in an intron (1/3 ?)
- Dubious genes
- Method: identify genes where we can't find any
comparative evidence to believe they are real, by multiple metrics and in
multiple alignment sets
- Properties of set
- length/RFC distribution
- lack of GO terms, lack of names
- lack of cDNA/EST evidence
- many single exon/single ORF
- For some of those that are transcribed, we
predict conserved noncoding elements in the transcripts (RNA genes,
microRNA genes)
- "Confirmed" hypothetical genes
- Method
- Identify genes without cDNA/EST evidence, but
strong evolutionary evidence in syntenic alignments
- Limited definition of "confirmed"
(can be sure protein-coding, but can't really confirm gene structure, or
identify alt. splicing variation)
- Corrections and adjustments to existing
annotations
- Translation start adjustments: evolutionary
evidence suggests translation starts at a downstream ATG
- Transcript model corrections: by detecting
frameshifts near an intron
- ORF corrections: wrong ORF currently annotated
- Summary: revised gene catalog
Discovery of non-canonical genic phenomena
- People
involved:
- MIT (Manolis Kellis, Mike Lin)
- Harvard (Bill Gelbart, Andy Schroeder, others
from FlyBase-Harvard)
- UCSC (Jakob Pedersen on RNA involvement in
recoding)
- Translational readthrough: observe
protein-coding signatures continuing straight past stop codon
- Frameshifts: observe adjacent windows conserved
in different frames (not near an intron)
- Polycistronics/uORFs: observe well-conserved
disjoint ORFs in known transcript models
Conserved non-coding
regions
- People
involved:
- MIT (Manolis Kellis, Mik Lin, Huy, Alex Stark,
Pouya, Leo)
- Harvard (Bill Gelbart)
- CSHL (Greg Hannon, Julius Brennecke)
- Whitehead (Dave Bartel, Graham Ruby)
- UCSC (Jakob Pedersen)
- ÒultraconservedÓ elements
- 1851 elements > 60nts 100% conserved in at
least 11 species
- Enriched in intron/exon boundaries and
intergenic regions
- Intron CNEs enriched for transcription factor
genes
- intron/exon CNEs enriched in nervous system
proteins/channels.
- Overlap with known enhancer elements?
- Blasts/Blast-pipeline (other species) is there
(2days)
- Blast to Dmel for other fam. Members: data
there 2 days)
- RNA genes
- tRNAs
- snoRNAs, snRNAs, rRNAs
- New types of RNA genes
- Secondary structure properties of mRNAs
- microRNAs: conservation-based identification of
Drosophila miRNAs
- prediction selects against exons, transposon
and repeat sequence
- top prediction have miRNA-like features not
used for prediction
- top novel hairpins are validated by library
cloning (Bartel, Hannon) with 90% accurary
- 28 novel mirnas validated, 9 prev. predicted
are confirmed by clonining, 6 are corrected
- new miRNAs-family members, new miRNA families
- targets for new miRNAs
- miRNAs in the introns of msi, kis, E2F, cdc2D
- mature mirna 5'ends can be predicted with high
accuracy, exceptions highlight importance of star sequence (though this
is not a general trend!)
- prediction accuracy scales with branch-length,
prediction of clade-specific miRNAs with high accurary is currently
impossible
- estimate of (conserved) Drosophila miRNAs <
150
Gene regulation
- People
involved
- MIT (Manolis Kellis, Alex Stark)
- Promoter motifs
- Properties of known regulatory motifs
- Signatures for motif discovery
- Computational validation (against known motifs,
tissue-specific expression, GO, positional bias, no strand bias)
- 3Õ UTR motifs
- Role in miRNA regulation
- Role in identification of new miRNA genes
- Other elements Pumillo (PUF) binding sites
(incl nanos)
- Identification of gene targets
- Transcriptional regulation
- Targets of miRNA genes
- Motif combinations and grammars
- Towards motif-based gene regulatory networks
Discussion: Assessing power to
identify functional elements with 12 genomes
- Evolutionary signatures
- Protein-coding genes
- miRNA genes
- motifs
- CNEs