Method and apparatus for high-performance sequence comparison

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S241000, C702S020000

Reexamination Certificate

active

06691109

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to a method for searching multiple query sequences against one or more sequence databases. More specifically, the invention relates to a computer-implemented method and apparatus that provide high-performance, high-speed, remotely accessible sequence comparison searches.
BACKGROUND OF THE INVENTION
Sequence similarity is an observable quantity that may be expressed as, for example, a percentage. Comparison of newly identified sequences against known sequences often provides clues about the function of the sequences. If the sequence is a protein sequence, the sequence comparison may also provide clues as to the three-dimensional structure adopted by the protein sequence. Sequence similarity may also lead to inferences on the evolutionary relatedness, or the homology, of the sequences.
Current sequence databases are already immense and have continued to grow at an exponential rate. For example, the human genome project and other large scale nucleotide sequencing objectives have resulted in a large amount of sequence information available in both private and public databases. Sequence similarity searching is not simply used to compare a single sequence against the sequences in a single database, but is also used to compare or screen large numbers of new sequences against multiple databases. Moreover, sequence alignment and database searches are performed tens of thousands of times per day around the world. Therefore, the ability to quickly and precisely compare new sequence data against such sequence databases is becoming more and more important.
There are many different methods for comparing sequences. Some methods, such as those based on the analysis of transformational grammars (cf. Durbin, et al.,
Biological Sequence Analysis,
Cambridge University Press (1998), Chapter 9), compare sequences by comparing the properties of the mathematical algorithms that may be used to generate the sequences in question. However, most common methods involve the use of sequence alignment at some point in the comparison process. Sequence alignment provides an explicit mapping between the residues of two or more sequences. When only two sequences are compared, the process is called pairwise alignment, but there are also methods of constructing multiple alignments that involve aligning more than two sequences.
The production of a sequence alignment result may be generically divided into two separate problems. The first problem is the alignment of the query sequence with the sequences in the databases being searched. The second problem is ranking or scoring of the aligned sequences. The results of the sequence alignment search are then reported as a ranked hit list followed by a series of individual sequence alignments, plus various scores and statistics.
There are various programs and algorithms available for performing database sequence similarity searching. For a basic discussion of bioinformatics and sequence similarity searching, see
BIOINFORMATICS: A Practical Guide to the Analysis of Genes and Proteins,
Baxevanis and Ouellette eds., Wiley-Interscience (1998) and
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids,
Durbin et al., Cambridge University Press (1998). One of the first used algorithms for performing sequence alignment searching was incorporated into the FASTA program. (Lipman and Pearson, “Rapid and sensitive protein similarity searches,” Science, Vol. 227, PP. 1435-1441 (1985); Pearson and Lipman, “Improved tools for biological sequence comparison,” Proc. Natl. Acad. Sci., Vol. 85, pp. 2444-2448 (1988)). The FASTA program performs optimized searches for local alignments using a substitution matrix. In order to improve the speed of the search, the program uses an observed pattern or small matches, termed “word” hits, to identify potential matches before performing the more time-consuming optimization search.
A popular algorithm for sequence similarity searching is the BLAST (Basic Local Alignment Search Tool) algorithm, which is employed in programs such as blastp, blastn, blastx, tblastn, and tblastx. (Altschul et al., “Local alignment statistics,” Methods Enzymol., Vol. 266, pp. 460-480 (1996); Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs,” Nucl. Acids Res., Vol. 25, pp. 3389-3402 (1997); Karlin et al., “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci., Vol. 87, pp. 2264-2268 (1990); Karlin et al., “Applications and statistics for multiple high-scoring segments in molecular sequences,” Proc. Natl. Acad. Sci., Vol. 90, pp. 5873-5877 (1993)). The approach used by the BLAST program is to first identify segments, with or without gaps, that are similar in a query sequence and a database sequence, then to evaluate the statistical significance of all such matches that are identified, and finally to summarize only those matches that satisfy a preselected threshold of significance.
The blastp program compares an amino acid query sequence against a protein sequence database, while the blastn program compares a nucleotide query sequence against a nucleotide sequence database. The blastx program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. A protein query sequence is compared against a nucleotide sequence database dynamically translated in all six reading frames (both strands) by the tblastn program, and tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The program blastall, one of the implementations of BLAST, can be used to perform all five flavors of the BLAST comparison.
The BLAST program can be downloaded from the NCBI and run locally as a full executable. It can be used to run BLAST searches against private local databases or downloaded copies of the NCBI databases. The 1.4 and later versions of BLAST are capable of being run in parallel using shared memory multiprocessors. (N. Camp, “High-Throughput BLAST,” Silicon Graphics, Inc., September 1988, www.sgi.com/chembio/resources/papers/HTBlast/HT_Whitepaper.html)
Silicon Graphics, Inc. (“SGI”) has developed an alternative parallel system for running multiple BLAST searches. (N. Camp, “High-Throughput BLAST,” Silicon Graphics, Inc., September 1988, sgi.com/chembio/resources/papers/HTBlast/HT_Whitepaper.html). The system consists of a modified BLAST executable and a driver, and is called High-Throughput BLAST. (“HT BLAST”). HT BLAST allows multiple sequences to be compared against multiple databases by only a single invocation of code. The output of HT BLAST is a summary of the High Scoring Pair information generated during the search. Through a single invocation of code, HT BLAST saves on startup overhead through the reuse of data structures and elimination of the need to remap the databases. HT-BLAST also removes all parallel constructs from BLAST, allowing for increased single-processor speed. Parallelism has then been relocated to the driver which distributes blocks of sequences to multiple processors running HT BLAST. HT BLAST uses a dynamically scheduled loop to maintain load balance. As the independent tasks are blocks of sequences compared to multiple databases, the parallel grain-size can be much greater than it is for unmodified BLAST. Thus, scaling to large numbers of processors is accomplished even for short sequences and small databases.
HT BLAST, however, is run on a single multiprocessor mainframe. The method and apparatus of the instant invention allows a sequence similarity searching program, such as the BLAST executable, to be run on multiple, networked, heterogeneous machines. Moreover, HT-BLAST does not allow for dividing up collections of databases both by treating individual databases separately and by partitioning the individual databases. The method and apparatus of the instant invention do not require a shar

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for high-performance sequence comparison does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for high-performance sequence comparison, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for high-performance sequence comparison will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3295991

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.