Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-04-14
2004-08-24
Pardo, Thuy N. (Department: 2175)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000
Reexamination Certificate
active
06782380
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to methods and systems to index and search records stored in a language using nested fields, particularly those stored in the Extensible Markup language (XML). In particular, the present invention relates to an improved method and a computerized system to index and search documents and data in languages such as XML hat utilize nested fields.
BACKGROUND OF THE INVENTION
The Extensible Markup Language (XML) is a universally accepted format for representing structured data in textual form. It is widely adopted in enterprise databases, and particularly in databases and applications connected to the World Wide Web. The manipulation and exchange of structured data, e.g., spreadsheets, address books, financial a transactions, technical drawings, etc., is often challenging as the data is traditionally represented in platform or program dependent document formats. XML provides a set of rules and guidelines for designing text formats for such data; these XML text formats are unambiguous, platform-independent, and extensible.
An example of a simple XML document is provided as follows:
<record>
<name>
<first_name>
John
</first_name>
<last_name>
Smith
</last_name>
<
ame>
<address>
<street>
123 Smith Drive
</street>
<city>
New York
</city>
<state>
New York
</state>
</address>
</record>
Basic XML format includes tags with brackets, e.g., <city> begins a field and </city> ends a field. Thus, <city> New York </city> represents a field named “city” that contains the content “New York.” Fields can be nested, e.g., “city” is an element in the field “address,” as shown above. More complex syntax can be used for various types of data.
A key practical issue in realizing advantages afforded by XML is the need for an efficient search method. Easy data manipulation and exchange requires an effective method to handle computational intensive search operations for complex and concurrent queries, which are becoming common place in the use of networked enterprise databases and databases connected to the Internet.
Existing database management systems, such as relational database and object-oriented database systems, are generally equipped with mechanisms or facilities for rapidly retrieving selected records based on key fields in the database. Such facilities or mechanisms often depend upon the data and the schema, and therefore are specific to each database. A variety of complex data structures are implemented in databases to facilitate fast retrieval of data based on key fields; examples include binary trees, B-trees, and red-black trees. Additionally, various types of indices are built for certain key words or fields that are frequently queried in a database to enable fast searching on those words and fields.
Existing full-text indices allow rapid searches on any word in a body of text. They are commonly used by Internet search engines such as Hotbot and Alta Vista to enable a user to quickly identify a particular Web site. Although they vary considerably in their implementation, full-text indices essentially consist of a table of words in alphabetical order, with pointers or links to the corresponding locations of the words in a database or a file. Generally a full-text index also supports wildcard (represented by “*”) searches that locate words based on a partial match. For example, a search for “appl*” will find “apply,” “appliance,” etc.
Neither of these existing technologies provides an efficient way to search XML. Since XML represents structural data in a textual format, it lends itself only to a slow, sequential scan of the text in a search of a particular record. Standard full-text indexing provides only an incomplete solution because the field context of each word is not preserved. For example, a standard full-text index of the sample XML document above supports a search for “Smith,” but not for “Smith” only in the “address” field. That is, one cannot locate an address with “Smith” in it using a full-index search; such a search will find all records in any field that has “Smith” in it. Some full-text indexing systems have the ability to search for a word associated with a particular property or field of a document (such as “Author is John Smith”), but this still does not provide a way to search based on the structural context of a word in an XML file, which involves several nested field qualifiers.
Therefore, much needed is an improved full text indexing mechanism for searching XML data, which is capable of distinguishing between “Smith” in the last_name field and “Smith” in the street field, or between “New York” in the city field and “New York” in the state field. Such a mechanism should also preserve information on nested fields, so that the street field is recognized as an element within the address field, and the last_name field is recognized as an element of the name field. The queries such as “address contains New York” (search for any record that contains New York in the address field or any field under the address field) and “address/city contains New York” (search for any record that contains New York in the city field that is part of an address field) should rapidly retrieve the qualified records using such an improved indexing and searching mechanism. To make fast and effective searches possible, certain external data structures need to be constructed to preserve the inherent structure information in the XML data and to provide a short cut to locate particular items.
However, the current state of the art only provides limited alternatives for indexing and searching XML data. One approach is to create separate indices for each sub-fields, which preserves the structural information of the data but drastically increases the overhead and therefore is not desirable. Another approach is to use a directed graph to represent the nested fields. (Goldman R. et al., Lore: a database management system for XML, 2000) The search through a directed graph can be extremely computationally intensive and costly as the complexity of the data, hence complexity of the graph, grows. Both approaches result in an index structure whose complexity is comparable with that of the XML data itself. A more efficient and cost-saving indexing and searching method is desired.
SUMMARY OF THE INVENTION
To resolve the above problems, the present invention is directed to an improved method and a computer system for indexing and searching records in a language utilizing nested fields, such as XML. The present invention discloses an indexing and searching engine that constructs an improved full-text search index on the input XML data and then performs searches using the index. The indexing and searching engine according to the preferred embodiment of this invention supports exact matches and partial matches using a wildcard character.
In accordance with one aspect of the present invention, the method transforms the problem of indexing and searching nested field records, including XML data, into the problem of full-text indexing and searching of plain text documents. The input XML data is changed into a form that encodes the field structural information by suffixing each word with its corresponding field qualifiers in their nested entirety, or alternatively, by suffixing each word with a numerical code pattern that represents the word's corresponding field qualifiers in their nested entirety. The resulting encoded words are then stored in a full-text index structure.
In accordance with another aspect of the present invention, wildcard matching may be used to perform searches with or without field qualifiers. To search using a wildcard without field qualifiers allows identifying a record including a particular word regardless the field of the record, whereas to search using a wildcard with field qualifiers allows identifying a record including a particular word in a designated field or fields that share certain level of similarly nested structure.
In
LandOfFree
Method and system for indexing and searching contents of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for indexing and searching contents of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for indexing and searching contents of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3302561