Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2000-08-31
2004-02-24
Robinson, Greta (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000
Reexamination Certificate
active
06697801
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to methods for hierarchically parsing and indexing text.
BACKGROUND OF THE INVENTION
Parsing and indexing text are paramount concerns in the search and retrieval industry. Text improperly parsed cannot be indexed properly, and a poorly constructed index will yield poor search results and correspondingly poor answer-set accuracy. Moreover, parsing and indexing text efficiently and accurately impacts a wide variety of information technologies, such as data mining, data abstracting, data extracting, data linking, data compression, data presentation, data visualization, data intelligence, and the like.
Search efficiency is often measured in terms of search performance and search accuracy. Because users desire nearly instantaneous results to their searches, user search queries are often conducted against small organized indexes, rather than against a composite body of text in its native format. Indexes improve search performance because search engines can detect matches or hits in structured indexes much more efficiently than in a body of natively formatted text documents.
Search accuracy is primarily governed by the rules employed during index generation. Some rules reduce the amount of information in the index to improve search speed and/or reduce index size. Other rules add information to the index to improve the quality of search results.
An indexer can increase speed and reduce index size by excluding certain words, symbols, and characters from the index. The excluded words are typically those that occur frequently, like ‘the’ and ‘and’. These excluded words are sometimes referred to as stop words. Punctuation and capitalization, as well as symbol characters like the dollar “$”, percent “%” and pound “#”, and other characters considered to be non-word characters are also typically ignored. While these exclusionary rules do serve to improve search engine response time, search accuracy is sacrificed. Under these rules, a search engine may not be able to respond at all to certain queries, like “to be, or not to be”, which may be comprised entirely of non-indexed text.
Existing parsing and indexing techniques partially deal with the organization and representation of text at various levels. Some examples include web site domain names, file system organizations, and documents decomposed as chapters, pages, and paragraphs. Some of these levels are linguistically oriented, such as the representation of noun phrases and grammatical constructs, and other levels focus on the character strings themselves, and may identify how sequences of characters of different types are grouped together into strings and sub-strings. Little progress has been made with respect to the parsing of strings and substrings, which has made search and retrieval particularly problematic and correspondingly less accurate.
Search accuracy is increasingly important as the body of available information continues to expand. The accuracy sacrificed by excluding certain words and characters is a cause of growing frustration for search engine users. As computer processing power and storage capacity increase, the cost of increasing search accuracy decreases.
To improve accuracy, the text being indexed and the search queries themselves are often parsed to identify character strings representing words. Identifying word boundaries presents a number of problems for software implemented parsers and linguistic analyzers. Word boundary parsing software will typically divide words when a symbol character is encountered in a character string. In certain cases, such as “CAD/CAM”, the parser will decompose the compound word “CAD/CAM” into two individual words, “CAD” and “CAM”. As a result of this lack of ability to detect these compound words as single entities, queries for them yield slightly less precise results.
The changing nature of the text being indexed and searched may also impact search accuracy. Symbol characters typically ignored by indexers are becoming increasingly prominent. Consider the elements of an e-mail or World Wide Web address. The “at” (@) sign, the dot (.), the colon (:), and the slash (/) have all become commonplace. A search for someone's e-mail address, for example, will yield much more accurate results if the indexer does not ignore the at “@” symbol and the dot“.”.
When a search engine returns a hit to the user, it is useful to be able to see the surrounding text (the sentence or paragraph, for example) which contains the words that matched the query. This is referred to as “showing the hit in context”. The context of the hit may either be derived from the index itself, or re-extracted from the original document. It is typically less costly to reconstruct a portion of a document from the index, which is already at hand, than it is to locate and retrieve it from the original document.
Any text excluded during index generation would naturally be unavailable for reconstructing the context of a hit. The more completely the text is indexed, the more closely a reconstructed portion of the text will match the original.
Even an indexer that does not exclude any words or symbols from the index may not be able to reconstruct a hit in context. This is the case when the indexer records only a reference to the source document for each piece of text indexed, and not their relative positions in the document. Knowing the relative positions of each piece of text is required so that they may be reassembled in the proper order.
For example, if an indexer of this type encountered the phrase “apple pie” while indexing a document called “Mom's Recipes”, it would generate two entries, “apple” and “pie”. Each entry would be stored with a reference to “Mom's Recipes”, but with no indication that “apple” came before “pie” or that in fact, the two entries were adjacent to one another.
This lack of knowledge with respect to the ordering of text pieces in the source also disallows proximity searches, where the query specifies that certain terms must occur within a certain distance of each other.
Current computing resources enable indexers to include previously excluded words and symbols and their positions in the source, but that is not all. Indexers may also augment the index with additional information to improve search intelligence. For example, using currently available linguistics technology, an indexer may associate thesauri terms, morphological word roots and forms, phonetic and soundex representations, and alternate spellings with the words being indexed. Advanced indexers may also associate concepts, classifications, and categories with the indexed words, permitting more advanced searches and improving the overall quality and relevance of the search results.
SUMMARY OF THE INVENTION
Accordingly, an object of the invention is to provide methods of hierarchically parsing and indexing text. By parsing and indexing text at a level above what is ordinarily considered a word, and including stop words, symbol characters, and formatting characters along with the hierarchical relationships between the various text pieces, searching, retrieving, mining, abstracting, extracting, visualizing, and presenting the text becomes more useful and accurate.
Additional objectives, advantages and novel features of the invention will be set forth in the description that follows and, in part, will become apparent to those skilled in the art upon examining or practicing the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims. To achieve the foregoing and other objects and in accordance with the purpose of the present invention, methods of hierarchically parsing and indexing text are provided.
A method of indexing text using a set of executable instructions is provided, comprising receiving one or more characters and recognizing the characters as a first level text entity. Further, lower level text entities are recognized as sub-parts
Eldredge Michael A.
Johnson Russell C.
Millet Ronald P.
Pratt John P.
Tietjen Bruce R.
Dinsmore & Shohl LLP
Le Debbie M.
Novell Inc.
Robinson Greta
LandOfFree
Methods of hierarchically parsing and indexing text does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods of hierarchically parsing and indexing text, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods of hierarchically parsing and indexing text will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3334665