All regions of 60 nt or greater (predicted ORF of 20 aa) and some smaller regions were viewed in FlyBase GBrowse with tracks containing existing gene models and RNA-Seq coverage data visible. With the RNA-Seq data available to FlyBase, we were not able to confirm expression in many cases. Analysis of conservation of the putative coding ORFs greater than 24 aa was extended to include additional sequenced species within the Sophophora subgenus. This extended analysis failed to show conservation in some cases. We have found that assessment of conservation becomes uninformative for smaller ORFs, because conservation at the nucleotide level is often so high there is no discernable protein signature. These cut-offs exclude the majority of the smORFs identified (324 of the 401 are less than 25 aa).
Of the 50 ORFs we found to be supported by RNA-Seq data and to show conservation consistent with a protein-coding extent, many correspond to alternative or 5’ exons of larger genes; these include 5 new cases, which were incorporated into new transcript isoforms. A few of the protein-coding extents identified correspond to ORFs that have been incorporated into larger proteins via stop-codon readthrough (FBrf0216845). Nine appear to encode small polypeptides; 4 of these had been previously identified. Three of the newly identified small polypeptides are encoded within UTRs of larger genes. Details may be found in the associated file (link below) on the FlyBase ftp site.