Data processing: measuring – calibrating – or testing – Measurement system in a specific environment – Biological or biochemical
Reexamination Certificate
2002-03-27
2004-04-27
Brusca, John S. (Department: 1631)
Data processing: measuring, calibrating, or testing
Measurement system in a specific environment
Biological or biochemical
C702S027000, C702S030000
Reexamination Certificate
active
06728642
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method of classifying biological elements into functional families and identifying biologically active regions of the biological element.
2. Description of the Prior Art
Genomes carry all information of life from one generation to the next for every organism on earth. Each genome, which is a collection of DNA molecules, can be represented as a series of strings comprised of four letter symbols. Today, the genomes of a worm known as
C. elegans
, the fruit fly, the human and a weed known as Arabidopsis, as well as several dozen microbial genomes are available. Most of these data are accessible free of charge, encouraging the exploration of this data. However, it is not the genes, but the proteins they encode that actually perform the functions of living cells. A search for protein function requires that each protein and its structure be identified and characterized, and that every protein—protein interaction be characterized.
Classification of Proteins Proteins are the molecules constructed from linear sequences of smaller molecules called amino acids. There are twenty naturally occurring amino acids and they can be represented in a protein sequence as a string of alphabetic symbols. Protein molecules fold to form specific three dimensional shapes which specify their particular chemical function.
Analysis of protein sequences can provide insights into function and can also lead to knowledge regarding biologically active sites of the protein. While analysis of protein sequences is often performed directly on the symbolic representation of the amino acid sequence, patterns in the sequence are often too weak to be detected as patterns of symbols.
Alternative sequence analysis techniques can be performed by assigning numerical values to the amino acids in a protein. The numerical values are derived from the physico-chemical properties of the amino acid such as hydrophobicity, bulkiness, or electron-ion interaction potential (EIIP) and are relevant to structural folding or biological activity.
It has been recognized that proteins of a given family have a common characteristic frequency component related to their function which may be used to classify proteins into functional families.
Frequency Analysis Methods The Resonant Recognition Model is an attempt to use frequency analysis to determine the characteristic frequency components of a family of proteins.
The Resonant Recognition Model or RRM, is described by I. Cosic in “Macromolecular bioactivity: Is it resonant interaction between macromolecules?—theory and applications,”
IEEE Transactions on Biomedical Engineering
, vol. 41, December 1994. The RRM is a physico-mathematical model that analyzes the interaction of a protein and its target using digital signal processing methods. One application of this model involves prediction of a protein's biological function. In this technique a Fourier transform is applied to a numerical representation of a protein sequence and a peak frequency is determined for a particular protein's function. The aim of this method is to determine a single parameter that correlates with a biological function of genetic sequences. To determine such a parameter it is necessary to find common characteristics of sequences with the same biological function. The cross-spectral function determines common frequency components of two signals. For a discrete series, the cross-spectral function is defined as:
S
n
=X
n
Y*
n
, n=1,2, . . . ,N/2
where X
n
are the Discrete Fourier Transform (DFT) coefficients of the series X(n) and Y
n
* are the complex conjugate DFT coefficients of the series Y(n). Peak frequencies in the cross-spectral function define common frequency components for analyzed sequences. The common frequency components for a group of protein sequences can be defined as follows:
|M
n
|=|X
1
n
∥X
2
n
|. . . |XM
n
|, n=
1,2, . . . ,N/2
This methodology can be illustrated via an example. Fibroblast growth factors (FGF) constitute a family of proteins that affect the growth, differentiation, and survival of certain cells. The symbolic representations of two FGF amino acid sequences are shown below:
>FGF basic bovine
PALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGR
SEQ ID NO:1
VDGVREKSDPHIKLQLQAEERGVVSIKGVCANRYLAMKE
DGRLLASKCVTDECFFFERLESNNYNTYRSRKYSSWYVA
LKRTGQYKLGPKTGPGQKAILFLPMSAKS
>FGF acid bovine
PNLPLGNYKKPKLLYCSNGGYFLRILPDGTVDGTKDRSD
SEQ ID NO:2
QHIQLQLCAESIGEVYIKSTETGQFLAMDTDGLLYGSQT
PNEECLFLERLEENHYNTYISKKHAEKHWFVGLKKNGRS
KLGPRTHFGQKAILFLPLPVSSD
Symbolic representations, such as these, can be translated into numerical sequences using the EIIP index, described by K. Tomii and M. Kanehisa in “Analysis of amino acids and mutation matrices for sequence comparison and structure prediction of proteins,”
Protein Engineering
, vol. 9, January 1996.
V. Veljkovic, I. Cosic, B. Dimitrjevic, and D. Lalovic, in “Is it possible to analyze DNA and protein sequences by the methods of digital signal processing?,”
IEEE Transactions on Biomedical Engineering
, vol. 32, May 1985, have shown that the EIIP correlates with certain biological properties.
The graphical representation of the corresponding numerical sequences for the FGF proteins (SEQ ID NO:1 and SEQ ID NO:2) obtained by replacing every amino acid with its EIIP value can be see in
FIGS. 1A and 1B
. A DFT is performed on each numerical sequence. The resulting spectra are shown in
FIGS. 2A and 2B
. The cross-spectral function of the two FGF spectra generates the consensus spectrum shown in FIG.
3
. For the spectrum plots the x-axis represents the RRM frequencies and the y-axis are the normalized intensities. The prominent peak denotes the common frequency component for this family of proteins.
The presence of a peak frequency in a consensus spectrum implies that all the analyzed sequences have one frequency component in common. This frequency is related to the biological function provided the following conditions are met:
one peak only exists for a group of protein sequences sharing the same biological function;
no significant peak exists for biologically unrelated protein sequences;
peak frequencies are different for different biological functions.
However, since frequency analysis alone contains no spatial information, there is no indication as to which residues contribute to the frequency components. The RRM technique lacks the ability to reliably identify the individual amino acids that contribute to that peak frequency.
Spatial Analysis Methods Frequency analysis alone cannot handle the transitory nature of non-stationary signals. However, a time-frequency representation (or space-frequency representation as is synonymously known in the art. See Leon Cohen,
Time-Frequency Analysis
. Prentice Hall, 1995. P. 113) of a signal provides information about how the spectral content of the signal evolves with time (or space) and therefore provides a tool to analyze non-stationary signals.
In an attempt to provide spatial information relating to the proteins Q. Fang and I. Cosic in “Prediction of active sites of fibroblast growth factors using continuous wavelet transforms and the resonant recognition model,”
Proceedings of The Inaugural Conference of the Victorian Chapter of the IEEE EMBS,
1999 describe a method using a continuous wavelet transform to analyze the EIIP representations of protein sequences. The continuous wavelet transform (CWT) is one of the time-frequency or space-frequency representations. Because the CWT provides the same time/space resolution for each scale the CWT can be chosen to localize individual events such as active site identification. The amino acids that comprise the active site(s) are identified as the set of local extrema of the coefficients in the wavelet transform domain. The energy concentrated local extrema are the locations of sharp variation points of the EIIP and are proposed by Fang and Cosic as the most critical locations
Arce Gonzalo Ramiro
Bloch Karen Marie
Brusca John S.
E. I. du Pont de Nemours and Company
LandOfFree
Method of non-linear analysis of biological sequence data does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of non-linear analysis of biological sequence data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of non-linear analysis of biological sequence data will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3248932