IntroductionThis document provides a high level overview of the options available with the FlyBase BLAST service. For more information about BLAST please see the extensive documentation provided by the NCBI (BLAST docs). Please note that some of the documentation provided on that site pertains only to the NCBI's BLAST interface.
- Compares an amino acid query sequence against a protein sequence database
- Compares a nucleotide query sequence against a nucleotide sequence database
- Compares a nucleotide query sequence translated in all reading frames against a protein sequence database
- Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames
- Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
The query sequence to be used for a BLAST search should be pasted in the 'Sequence' text area. It accepts a number of different types of input and automatically determines the format or the input. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3) below. Accepted input types are FASTA, bare sequence, or sequence identifiers .
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line (defline) is distinguished from the sequence data by a greater-than (">") symbol at the beginning. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:
>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
Blank lines are not allowed in the middle of FASTA input.
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:
A adenosine C cytidine G guanine T thymidine N A/G/C/T (any) U uridine K G/T (keto) S G/C (strong) Y T/C (pyrimidine) M A/C (amino) W A/T (weak) R G/A (purine) B G/T/C D G/A/T H A/C/T V G/C/A - gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:
A alanine P proline B aspartate/asparagine Q glutamine C cystine R arginine D aspartate S serine E glutamate T threonine F phenylalanine U selenocysteine G glycine V valine H histidine W tryptophan I isoleucine Y tyrosine K lysine Z glutamate/glutamine L leucine X any M methionine * translation stop N asparagine - gap of indeterminate length
¹ The degenerate nucleotide codes in red are treated as mismatches in nucleotide alignment. Too many such degenerate codes within an input nucleotide query will cause blast.cgi to reject the input. For protein queries, too many nucleotide-like code (A,C,G,T,N) may also cause similar rejection.
² For protein code, U is replaced by X first before the search since it is not specified in any scoring matrices.
³ BLAST will not take "-" in the query. To represent gaps, use a string of N or X instead.
- Bare Sequence
This may be just lines of sequence data, without the FASTA definition line, e.g.:
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP
It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:
1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn 61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek 121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels 181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp
Blank lines are not allowed in the middle of bare sequence input.
This function allows users to upload a text file containing queries formatted in the formats outlined above. Long sequences should be uploaded through this option to avoid possible broswer buffer size limit.
- Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 25 descriptions. See also Expect.
- Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 25. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see Expect below), only the matches ascribed the greatest statistical significance are reported.
- This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (1) means that 1 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the Expect threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. See the NCBI's BLAST FAQ for more information.
- The best way to identify an unknown sequence is to see if that sequence already exists in a public database. If the database sequence is a well-characterized sequence, then one will have access to a wealth of biological information. MEGABLAST, discontiguous-megablast, and blastn all can be used to accomplish this goal. However, MEGABLAST is specifically designed to efficiently find long alignments between very similar sequences and thus is the best tool to use to find the identical match to your query sequence. In addition to the expect value significance cut-off, MEGABLAST also provides an adjustable percent identity cut-off for the alignment, which provides cut-off in addition to the significance cut-off threshold set by Expect value.
- Legacy BLAST Engine
- In BLAST version 2.2.13 the NCBI introduced a new BLAST engine which became the default in 2.2.14. For more information on this new engine please see the NCBI's page. If you wish to use the legacy BLAST engine check this option.
- Filter (Low-complexity)
- This function mask off segments of the query sequence that have low compositional complexity, as determined
by the SEG program of Wootton and Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST
program of Tatusov and Lipman. Filtering can eliminate statistically significant but biologically uninteresting
reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the
more biologically interesting regions of the query sequence available for specific matching against database sequences.
Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs. It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT or refseq, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. This will also lead to search error when default setting is used. See http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#filter for more information.
- Word size
- BLAST is a heuristic that works by finding word-matches between the query and database sequences. One may think of this process as finding "hot-spots" that BLAST can then use to initiate extensions that might eventually lead to full-blown alignments. For nucleotide-nucleotide searches (i.e., "blastn") an exact match of the entire word is required before an extension is initiated, so that one normally regulates the sensitivity and speed of the search by increasing or decreasing the word-size. For other BLAST searches non-exact word matches are taken into account based upon the similarity between words. The amount of similarity can be varied so one normally uses just the word-sizes 2 and 3 for these searches.
- Query Genetic Code
- Genetic code to be used in blastx translation of the query.
- A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with. See BLAST substitution matrices for more information.
- Other options
- Accepted Parameters for the "Other options" field
-G Cost to open gap [Integer]: default = 5 for nucleotides/ 11 for proteins
-E Cost to extend gap [Integer]: default = 2 for nucleotides/ 1 for proteins
-q Penalty for nucleotide mismatch [Integer]: default = -3
-r reward for nucleotide match [Integer]: default = 1
-y Dropoff (X) for blast extensions in bits: default = 20 for blastn/ 7 for others
-X X dropoff value for gapped alignment (in bits): default = 15 for all programs, not applicable to blastn
-Z final X dropoff value for gapped alignment (in bits): 50 for blastn 25 for others
-V Force use of the legacy BLAST engine (T/F): default = F