Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
2000-09-08
2004-01-20
Brusca, John S. (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
C435S006120, C365S094000
Reexamination Certificate
active
06681186
ABSTRACT:
BACKGROUND
A. Field of the Invention
This invention relates to deoxyribonucleic acid (DNA) sequencing. More specifically, this invention provides a method and system for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms.
B. General Description of the Area of Research
With the advent of the Human Genome Project and its massive undertaking to sequence the entire human genome, researchers have been turning to automated DNA sequencers to process vast amounts of DNA sequence information. DNA, or deoxyribonucleic acid, is one of the most important information-carrying molecules in cells. DNA is composed of four different types of monomers, called nucleotides, which are in turn composed of bases linked with a sugar and a phosphate group. The four bases are adenine (A), cytosine (C), guanine (G), and thymine (T). The original state of a DNA fragment is a double helix of two antiparallel chains with complementary nucleotide sequences. The coded information of a DNA sequence is determined by the order of the four bases in either of these chains.
A common approach to obtaining information from DNA is the Sanger method. In this method, single-stranded DNA fragments are used as templates from which a series of nested subfragment sets is generated. (F.Sanger, et al., “DNA Sequencing With Chain-Terminating Inhibitors”,
Proceedings of the National Academy of Sciences of the USA
, vol. 74, pp.5463-5467 (1977)). The subfragments start at the same end of the template, and a fraction of the subfragments of each length are caused to terminate by incorporation of chemically modified bases, thereby forming subfragment sets in increments of one nucleotide. In the popular “four-color” method, the terminating bases are labeled by one of four fluorescent dyes specific to the terminating base type, A, C, G or T. (L. M. Smith, et al., “Fluorescence Detection In Automated DNA Sequence Analysis”,
Nature
, vol. 321, pp. 674-679 (1986)). The resulting mixture of sets of subfragments represents all of the possible sublengths of the template, with each set of subfragments labeled by a fluorescent dye corresponding to its terminating base type. To determine the sequence of the template, the subfragments are sorted by length using electrophoresis. In this process shorter subfragments migrate faster than longer subfragments in an applied electric field. Because subfragments are created in increments of one nucleotide, they pass through an electrophoretic cell one at a time in the order of the nucleotides in the template. The terminating base types are identified by the wavelength at which they fluoresce. A real-time fluorescent detection of migrating bands of the subfragments is then performed as the subfragments pass through a detection zone. The light collected is processed with a set of spectral filters that attempt to isolate the signals from the four dyes.
In automated DNA sequencing, these raw signals are then analyzed by a signal processing software. The steps of signal processing may include downsampling of the data to 1 Hz if necessary, primer data removal, baseline adjustment, noise filtering, multicomponent transformation, dye mobility shift correction, signal normalization, etc. (see, e.g., M. C. Giddings, et al., “A Software System For Data Analysis In Automated DNA Sequencing”,
Genome Research
, vol. 8, pp. 644-665 (1998)). Processing the raw data produces analyzed electropherograms with clearly defined peaks. The analyzed data in the form of electropherograms are then processed using a base calling program. The base calling program infers a sequence of bases in the DNA fragment. This sequence of bases is also referred to as a read and is usually about 1,000 bases long. Not all of the called bases are used in subsequent processing. The statistically averaged error produced by any base calling program is usually low, i.e., below 1%, for bases located near middle of a read and increases significantly toward the beginning and, especially, toward the end of a read. To characterize a reliable, or high quality part of a read, a threshold of 1% base calling error is commonly accepted. That is, only that part of the read having an average base calling error of 1% or less will be subsequently used. Alternatively, this may be characterized in terms of the quality values assigned to bases, where the quality is the measure of reliability of the base call. According to a commonly used definition of quality values, a quality value of 20 or higher corresponds to a probability of error of 1% or less. In practice, when sequencing, the correct sequence is not known in advance, so reliable predictions of quality values for newly sequenced fragments based on previous training or calibration on a data set with a known correct sequence are desirable.
C. Prior Art
1. ABI Base Caller
The ABI Base Caller is a part of DNA Sequencing Analysis software produced by Applied Biosystems of Foster City, Calif. This program takes raw electropherograms as input, processes them to produce analyzed electropherograms having well defined and evenly spaced peaks, and then detects and classifies peaks in the analyzed electropherograms as a sequence of bases. The program outputs the results to a binary file called an ABI sample file. The output includes the raw and analyzed electropherograms for each of the four traces, the array of called bases and the array of locations assigned to the bases in an electropherogram. The output does not include an estimate of quality values, because the ABI Base Caller program does not estimate the reliability of base calls.
The ABI Base Caller was chronologically the first and is still one of the best base calling programs available. The base calls produced by the ABI Base Caller, however, are not very accurate toward the end of a read, where peaks in an analyzed electropherogram become wider and significantly overlap. In this part of the read, the ABI Base Caller produces a considerable amount of mismatch errors, unknown base calls that are denoted as N's, and overlooks some base calls resulting in deletion errors.
2. Phred
Phred is the first base calling software program to achieve a lower error rate than the ABI software, and is especially effective at the end of a read. Phred takes analyzed electropherograms produced by the ABI Base Caller as input, calls the bases and assigns quality values to the called bases. (see B. Ewing, et al., “Base Calling Of Automated Sequencer Traces Using Phred. I. Accuracy Assessment”,
Genome Research
, vol. 8(3), pp. 175-185 (1998); B. Ewing and P. Green, “Base-Calling Of Automated Sequencer Traces Using Phred. II. Error Probabilities”,
Genome Research
, vol. 8(3), pp. 186-194 (1998)).
The base calling procedure in Phred consists of four phases: locating the predicted peaks, locating the observed peaks, matching the observed and predicted peaks, and finding the missed peaks. In the first phase, Phred attempts to find the idealized locations that all of the base peaks that would have occurred in the absence of imperfections in the sequencing reactions, in the electrophoresis process, and in trace processing. The underlying premise of Phred is that under such idealized conditions, each trace consists of evenly spaced, non-overlapping peaks, corresponding to the labeled fragments that terminate at a particular base in the sequenced strand. To find the positions of predicted peaks, Phred first examines the four trace arrays that correspond to each of the four bases to detect the peaks. A detected predicted peak is identified as the location of the maximum value, or, if the maximum does not exist, the midpoint between the inflection points. The processed trace is then scanned to find the regions of uniform peak spacing and the average peak period. The average peak period corresponds to peak-to-peak spacing or inter-peak spacing. This is determined for each of the regions. Phred then uses Fourier methods to find the positions of the predicted peaks in between these
Arehart Alan B.
Curtin Michael D.
Denisov Gennady A.
Brusca John S.
Paracel, Inc.
Pasika Hugh J.
LandOfFree
System and method for improving the accuracy of DNA... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for improving the accuracy of DNA..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for improving the accuracy of DNA... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3258068