Subject: CG_but_not_EG genes.... Sorry to bother you again.... Here is an anlysis i performed on div. 1-3 genes, reported by Celera (in the version 1.0 of the Drosophila genome); but with no corresponding EDGP prediction. Some are small and/or underpredictions. The rest are wrong (i think!). Please, let me know if there are any questions, etc... Best regards, takis ;) \----- Begin Included Message \----- Subject: CG but not EDGP Takis Mellanie and I have just realised that the CG genes not predicted by EDGP are non-random. There are runs of sequential or nearly sequential CG numbers, which means that the genes are very close. This strikes us as peculiar and odd and suspicious. Have you any ideas ? Some are indeed very small and must be wrong, but by NO means all \! M & M CG genes not obviously predicted by EDGP: (genes between '===' lines are tandemly arranged in the genome) ============================================================================== == CG13376 X CT32708 1 213 (-) GeneScene Start: cnt_1:25974 (frame=-2) PVB: CG13376 small (60-70 a.a.) with low Genefinder score (sc=11.28). Similarly low score for the Genscan prediction. ============================================================================== == CG13373 X CT32703 3 1038 (+) GeneScene Start: cnt_1:230772 (frame=+3) PVB: CG13373 is in a partially triplicated region. WARNING: possible mis-assembly in Celera's sequence; in JG, gene CG13373 was supposed to be close to CG13376 (see above)! CG18275 X CT41450 1 187 (-) GeneScene Start: cnt_1:231936 (frame=-1) PVB: CG18275 is in the same partially triplicated region. CG3176 X receptor CT10645 1 65 (+) GeneScene Start: cnt_1:235283 (frame=+2) PVB: CG3176 is in the same partially triplicated region. CG18273 X CT41446 3 1449 (-) GeneScene PVB: CG18273 is gene EG:171D11.6 . CG18166 X CT40990 2 638 (-) GeneScene PVB: CG18166 is partial duplication of gene EG:171D11.6 (see above). ============================================================================== == CG13365 X CT32692 1 494 (-) GeneScene Start: cnt_1:365582 (frame=-2) PVB: CG13365 is in a partially duplicated region (as it shown from partially duplicated EST:AI294113 ). ============================================================================== == CG13362 X CT32687 1 768 (-) GeneScene PVB: No comment. I couldn't locate it. Too many partial hits. The 'gene' is obviously from repetitive region. >CG13362 MLPGAQYDPYGMTSYAAGRRHDSVSRQEVTAPTDLSNAPSSSRTTSTTP APTTTTTTTTTTTPAPTTRPSTTTTSTTPPPVPPPPQSTSSSFSEGPTS SLLRFGEYPAYNRRVNNLYNARPQYPYPDYFNYQPQQQTVVSEQGVSSN SRIQFVPCMCPVSMPSFVSSSTAATLPSQLSTSSTSFVSQPAARHIEGQ ELEAEVDGETDNEGEEEDEDEEQGQGQEQDQGQSQSHSLEGIAIKAQTE RQDITSDSPV CG13361 X CT32686 2 1335 (-) GeneScene Start: cnt_1:497318 (frame=-2) PVB: CG13361 probably corresponds to predictions BACR19J1.e (Genefinder; score=34.92) and/or BACR19J1.gs.3 (Genscan). EST:AA140953 which covers one of the predicted exons, extends into a region with stop codons in all three frames. This made me believe that the EST should be in a UTR region; thus there should be no (coding) gene CG13361. ============================================================================== == CG14635 X CT34396 1 360 (+) GeneScene Start: cnt_1:583083 (frame=+3) PVB: CG14635 is small and i have no prediction in this area (not in this strand anyway). ============================================================================== == CG14633 X CT34393 1 831 (-) GeneScene Start: cnt_1:619985 (frame=-2) PVB: CG14633 corresponds to prediction BACR7A4.q (Genefinder; score=33.27). It was not reported due to lack of additional evidence. CG11663 X CT34392 2 573 (+) GeneScene Start: cnt_1:632851 (frame=+1) PVB: The only prediction in the area is BACR7A4.u (Genefinder; sc=27.99). The prediction was not reported due to low score and lack of any other supportive evidence(s). Moreover, the conceptually translated peptide is of very very low complexity. >CG11663 MANPKSSGGNKSKGKGHQHRQSQQNSHQQQQQQQQQQSQQSQQPQMQTQI TPAPVASTNLNTPTATPLASHPSEDTLALAAAVAASIPAAPLARPLPDRR TTTPAVVTTTSNSSSETRNASENLATSRTASAAVAASENRRGILQRLFGW SS CG14632 X CT34391 4 1676 (-) GeneScene Start: cnt_1:632537 (part of it; frame=-2) PVB: No predictions in this area; but the gene is of low complexity. >CG14632 MSDEVPLGRLSHIFDTLTNLQQQQHLRSQEQLHSQQHPHSQLQPEPQQS SAEIRRRSASSSPSPSASASASTSGRATPSLGEVAGSGYLHTFPSHFYH HQVHHLQQHSQPPSLPTQLGAARGSQSLQGSPLLAKRATSFSGQIPLAQ GRFTASGTTAASGAIGLPASTPNSPRLLPRRAPRPPPIPAKPNQVKADQ QSKDAQARNSTTTTVQATVNPVLAALDAPDAPWPHFSTLTEHLDVHQVN NYGQALPQINWQERCLELQLELHRSKNQAGRIRDMLREKETLFS ============================================================================== == CG11639 X transcription CT34386 2 405 (-) GeneScene Start: cnt_1:749792 (frame=-2) PVB: CG11639 is gene EG:BACR7A4.7 . CG14631 X CT34389 1 399 (+) GeneScene Start: cnt_1:750360 (frame=+3) PVB: CG14631 corresponds to prediction BACR7A4.ag (Genefinder sc=27.98). It was not reported due to low score and lack of other supportive evidence. ============================================================================== == CG11393 X CT31813 1 353 (-) GeneScene Start: cnt_1:881702 (frame=-2) PVB: CG11393 is a very small gene (52 a.a.). By the coordinates of the hit, it must be included in EG:BACR42I17.1 region. I am very confident about EG:BACR42I17.1 . It is supported by protein and EST hits and it contains motif PS00813 (IF4E). I don't know what's wrong with Celera's prediction. ============================================================================== == CG11381 X transcription CT31772 1 1365 (+) GeneScene PVB: CG11381 is (part of) gene BACR42I17.9 (reported). ============================================================================== == CG14770 X CT34578 2 694 (+) GeneScene Start: cnt_1:1086807 (frame=+3) PVB: CG14770 corresponds to Genscan prediction 132E8.gs.2. The Genscan score is low; there is no Genefinder prediction in this region and no other supportive evidence. Thus, it was not reported. ============================================================================== == CG14771 X transcription CT34579 3 2318 (+) GeneScene CG14772 X CT34580 2 673 (+) GeneScene PVB: There are two huge predictions in this region ( cnt_1:1095100-1120000 ). One from Genefinder and one from Genescan. Very untypical for Drosophila genes (many small exons, big introns). The two predictions do not agree in many of the exons; and none includes region indicated from an EST cluster (e.g. EST:AI062494 ). The general genome organisation/evidences, made me suspicious and thus i reported only the exons i was confident about, based on protein similarity hits (gene EG:132E8.4 ). ============================================================================== == CG14778 X CT34588 4 743 (+) GeneScene Start: cnt_1:1187350 (frame=+1) PVB: CG14778 corresponds to Genefinder prediction 80H7.i. It was not reported due to low score (sc=34.6) and lack of further supportive evidence(s). ============================================================================== == CG3080 X CT10352 3 2540 (+) GeneScene Start: cnt_1:1314702 (frame=+3) PVB: I am not sure what CG3080 corresponds to. It looks like it's the 'tail' of a Genscan prediction. But there is no supportive evidence in the whole region. CG3729 X CT12497 2 845 (-) GeneScene PVB: Small (101 a.a.) and repetitive. I cannot locate it. Should be Genscan prediction 25D2.gs.1; which has very low score and no suportive evidence. >CG3729 MLCYVSLTIRRLHSLAPHCQLDAALDAVHWPLAPGPCPPSAIWHPPSPL IQMLCRSAAIKITRQTTAAEQLKQKKKKKKEKEKRSGKRQQRKRKSGRG G ============================================================================== == CG14797 X CT34609 3 787 (+) GeneScene Start: cnt_1:1397298 (frame=+3) PVB: CG14797 probably corresponds to (non-reported) prediction 9D2.gs.1 (Genscan). This prediction as well as a similar (in all but one exons) from Genefinder (pred. gene 9D2.j; sc=27.95) were not reported due to their low score and lack of any other supportive evidences. ============================================================================== == CG14798 X CT34610 2 376 (+) GeneScene Start: cnt_1:1406487 (frame=+3) PVB: CG14798 probably corresponds to the (non-reported) predicted gene 9D2.n (Genefinder; sc=23.26). The prediction was not reported due to the low score, the absence of other supporting evidence(s) and the overlap of its third exon with (reported) gene EG:9D2.3 (on the reverse strand). ============================================================================== == CG14810 X CT34623 1 556 (-) GeneScene Start: cnt_1:1528333 (frame=-3) PVB: CG14810 probably corresponds to one of the (non-reported) predictions 30B7.gs.6 (Genscan) or 30B7.b (Genefinder; sc=19.37). The gene is small, with low score and no other supporting evidence. Moreover it is located in a region with a number of small tandem and inverted repeats (and remnants of transp. elements). CG14811 X CT34624 1 651 (-) GeneScene Start: cnt_1:1530386 (frame=-2) PVB: CG14811 probably corresponds to (non-reported) predictions 30B7.a (Genefinder; sc=26.47) and 30B7.gs.7 (Genscan). The prediction was not reported due to the low score and the absence of other supporting evidence(s). CG14799 X CT34612 2 1018 (+) GeneScene Start: cnt_1:1533897 (frame=+3) PVB: CG14799 corresponds to (non-reported) predictions 30B7.h (Genefinder; sc=8.34) and 30B7.gs.8 (Genscan). The prediction was not reported due to the low score and the absence of other supporting evidence(s). ============================================================================== == CG14800 X CT34613 3 1260 (+) GeneScene Start: cnt_1:1551750 (frame=+3) PVB: The closest prediction to CG14800 is 131F2.gs.1 (Genscan). Only the last exon(s) was reported from this prediction ( EG:131F2.2 ), based on protein similarity hits (e.g. SW:BCT5_BOVIN ). ============================================================================== == CG14806 X CT34619 3 663 (+) GeneScene Start: cnt_1:1588425 (frame=+3) PVB: The only predictions i can find in this area are 63B12.e (Genefinder; sc=15.36) and 63B12.gs.10. The predictions were not reported due to the low score and the absence of other supporting evidence(s). ============================================================================== == CG14819 X CT34632 5 2022 (-) GeneScene Start: cnt_1:1610637 (frame=-1) PVB: CG14819 is probably is 'merged' prediction of 86E4.s (Genefinder; sc=18.67) and 86E4.s (Genefinder; sc=18.94). The predictions were not reported due to the low score and the absence of other supporting evidence(s). Moreover, EST:AA695846 does not agree with prediction 86E4.s. ============================================================================== == CG18082 X CT40582 1 222 (+) GeneScene Start: cnt_1:1919235 (frame=+3) PVB: CG18082 is small (~70 a.a.), with low score and no supporting evidences (pred. gene 30B8.a; Genefinder sc=10.05). CG14052 X CT33613 3 1785 (+) GeneScene Start: cnt_1:1919541 (frame=+3) PVB: CG14052 corresponds to prediction 30B8.b (Genefinder; sc=24.88). The prediction was not reported due to the low score and the absence of other supporting evidence(s). Also, the predicted gene is of low complexity. CG18850 Start: cnt_1:1921912 (frame=-3) PVB: CG18850 corresponds to prediction 30B8.t (Genefinder; sc=18.05). The prediction was not reported due to the low score and the absence of other supporting evidence(s). Also, the predicted gene is of low complexity. WARNING! Most importantly, in EDGP sequence THE PREDICTION OVERLAPS WITH GENE EG:30B8.4 (also known as pecanex; pcx; FBgn0003048)! However, in the JG this gene is predicted between the previous two. ============================================================================== == CG3091 X binding or CT9997 6 1492 (-) GeneScene Start: cnt_1:1949751 (frame=-1) PVB: CG3091 is (partly) predicted gene 30B8.l (Genefinder; sc=34.63). This prediction was supported by EST(s), but it was not reported because it is consisted of a partial duplication of gene EG:30B8.3 (without the protein similarity thoug; different translation frame). ============================================================================== == CG3073 X enzyme CT9957 3 1834 (-) GeneScene Start: cnt_1:1957696 (frame=-3) PVB: CG3073 is probably consisted of the joint Genscan predictions 25E8.gs.1 and 30B8.gs.8. This gene is supported by EST hits, therefore I SHOULD HAD REPORTED IT. The reason i missed it, is that it is located in between two cosmids. ============================================================================== == CG14049 X CT33608 2 402 (-) GeneScene Start: cnt_1:2019302 (frame=-2) PVB: CG14049 corresponds to prediction BACH48C10.gs.3 and (partly) to prediction BACH48C10.e (Genefinder; sc=11.99). The predictions were not reported due to the low score and the absence of other supporting evidence(s). ============================================================================== == CG7894 X cell adhesion CT23737 3 6037 (-) GeneScene Start: cnt_1:2129993 (frame=-2) PVB: CG7894 is probably consisted of the joint Genefinder predictions BACR25B3.n (sc=37.86) and BACH59J11.gs.6 (sc=9.99). This gene is supported by EST hits and (weak) protein similarity hits, therefore I SHOULD HAD REPORTED IT. The reason i missed it, is that it is located in between two BACs. ============================================================================== == CG8310 X transporter CT24372 3 821 (-) GeneScene Start: cnt_1:2257964 (frame=-2) PVB: CG8310 is gene EG:BACR25B3.4 . CG8636 X translation CT25021 2 959 (-) GeneScen Start: cnt_1:2288033 (frame=-2) PVB: CG8636 corresponds to prediction BACR7C10.n (Genefinder; sc=61.87). It is supported by (weak) protein hits (corresponding to translation factors) as well as ESTs. However, one of the EST hits (AA820554) is in close proximity (100bp) from the end of the transposable element Burdock ( EG:BACR7C10.5 ) that was also found there. Both the nature of the protein hits and the close proximity with the transposable element, made me suspicious that this may not be a 'real' Drosophila gene. Thus, i did not report it. ============================================================================== ==