Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-02-24
2001-09-25
Alam, Hosain T. (Department: 2172)
Data processing: database and file management or data structures
Database design
Data structure types
C704S009000
Reexamination Certificate
active
06295533
ABSTRACT:
FIELD OF THE INVENTION
This invention relates to accessing databases, and particularly to accessing heterogeneous relational databases.
BACKGROUND OF THE INVENTION
Databases are the principal way in which information is stored. The most commonly used type of database is a relational database, in which information is stored in tables called relations. Relational databases are described in
A First Course on Database Systems
by Ullman and Widom, Prentice Hall, 1997, and in
An Introduction to Database Systems
, by C. J. Date, Addison Wesley, 1995.
Each entry in a relation is typically a character string or a number. Generally relations are thought of as sets of tuples, a tuple corresponding to a single row in the table. The columns of a relation are called fields.
Commonly supported operations on relations include selection and join. Selection is the extraction of tuples that meet certain conditions. Two relations are joined on fields F1 and F2 by first taking their Cartesian product (the Cartesian product of two relations A and B is the set of all tuples a
1
, . . . , am, b
1
, . . . , b
n
, where a
1
, . . . , a
m
is a tuple from A, and b
1
, . . . , b
n
is a tuple from B) and then selecting all tuples such that F1=F2. This leads to a relation with two equivalent fields, so usually one of these is discarded.
Joining relations is the principal means of aggregating information that is spread across several relations. For example,
FIG. 1
shows two sample relations Q 101 and R 102, and the result of joining Q and R (the “Join” of Q and R) 103 on the fields named MovieID (the columns indicated by 104.) For reasons of efficiency, relations are usually joined on special fields that have been designated as keys, and database management systems are implemented so as to efficiently perform joins on fields that are keys.
In most databases, each tuple corresponds to an assertion about the world. For instance, the tuple<12:30, 11, “Queen of Outer Space (Zsa Zsa Gabor)”, 137>(the row indicated by 105) in the relation Q 101 of
FIG. 1
corresponds to the assertion “the movie named ‘Queen of Outer Space’, starring Zsa Zsa Gabor, will be shown at 12:30 on channel 11.”
Known systems can represent information that is uncertain in a database. One known method associates every tuple in the database with a real number indicating the probability that the corresponding assertion about the world is true. For instance, the tuple described above might be associated with the probability 0.9 if the preceding program was a major sporting event, such as the World Series. The uncertainty represented in this probability includes the possibility, for example, that the World Series program may extend beyond its designated time slot. Extensions to the database operations of join and selection useful for relations with uncertain information are also known. One method for representing uncertain information in a database is described in
Probabilistic Datalog—a Logic for Powerful Retrieval Methods
” by Norbert Fuhr, in Proceedings of the 1995 ACM SIGIR Conference on Research in Information Retrieval, pages 282-290, New York, 1995. Other methods are surveyed in
Uncertainty Management in Information Systems
, edited by Motro and Smets, Kluwer Academic Publishers, 1997. Database systems that have been extended in this way are called probabilistic databases.
Another way of storing information is with a text database. Here information is stored as a collection of documents, also known as a corpus. Each document is simply a textual document, typically in English or some other human language. One standard method for representing text in such a database so that it can be easily accessed by a computer is to represent each document as a so-called document vector. A document vector representation of a document is a vector with one component for each term appearing in the corpus. A term is typically a single word, a prefix of a word, or a phrase containing a small number of words or prefixes. The value of the component corresponding to a term is zero if that term does not appear in the document, and non-zero otherwise.
Generally the non-zero values are chosen so that words that are likely to be important have larger weights. For instance, word that occur many times is a document, or words that are rare in the corpus, have large weights. A similarity function can then be defined for document vectors, such that documents with the similar term weights have high similarities, and documents with different term weights have low similarity. Such a similarity function is called a term-based similarity metric.
An operation commonly supported by such text databases is called ranked retrieval. The user enters a query, which is a textual description of the documents he or she desires to be retrieved. This query is then converted into a document vector. The database system then presents to the user a list of documents in the database, ordered (for example) by decreasing similarity to the document vector that corresponds to the query.
As an example, the Review column (the column indicated by 107) of relation R 102 in
FIG. 1
might be instead stored in a text database. The answer to the user query “embarrassingly bad science fiction” might be a list containing the review of “Queen of Outer Space” as its first element, and the review of “Space Balls” as its second element.
In general, the user will only be interested in seeing a small number of the documents that are highly similar. Techniques are known for efficiently generating a reduced list of documents, say of size K, that contains all or most of the K documents that are most similar to the query vector, without generating as an intermediate result a list of all documents that have non-zero similarity to the query. Such techniques are described in Chapters 8 and 9 of
Automatic Text Processing
, edited by Gerard Salton, Addison Wesley, Reading, Mass., 1989, and in
Query Evaluation: Strategies and Optimizations
by Howard Turtle and James Flood in
Information Processing and Management
, 31(6):831-850, November 1995.
In some relational database management systems (RDBMS) relations are stored in a distributed fashion, i.e., different relations are stored on different computers. One issue which arises in distributed databases pertains to joining relations stored at different sites. In order for this join to be performed, it is necessary for the two relations to use comparable keys. For instance, consider two relations M and E, where each tuple in M encodes a single person's medical history, and each tuple in E encodes data pertaining to a single employee of some large company. Joining these relations is feasible if M and E both use social security numbers as keys. However, if E uses some entirely different identifier (say an employee number), then the join cannot be carried out, and there is no known way of aligning the tuples in E with those in M. To take another example, the relations Q 101 and R 102 of
FIG. 1
could not be joined unless they both contained a similar field, such as the MovieID field (column 104.)
In practice, the presence of incomparable key fields is often a problem in merging relations that are maintained by different organizations. A collection of relations that are maintained separately are called heterogeneous. The problem of providing access to a collection of heterogeneous relations is called data integration. The process of finding pairs of keys that are likely to be equivalent key matching is called key matching.
Techniques are known for coping with some sorts of key mismatches that arise in accessing heterogeneous databases. One technique is to normalize the keys. For instance, in the relations Q 101 and R 102 in
FIG. 1
, suppose that numeric MovieID's are not available, and it is desirable to join Q 101 and R 102 on strings that contain the name of the movie, specifically, the MovieName field (the column indicated by 106) of Q 101, and the underlined section of the Review field (the column indicated by 107) of R 102. One might
Alam Hosain T.
AT&T Corp.
Fleurantin Jean Bolte
Kenyon & Kenyon
LandOfFree
System and method for accessing heterogeneous databases does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for accessing heterogeneous databases, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for accessing heterogeneous databases will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2517197