Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1998-12-07
2002-01-08
Black, Thomas (Department: 2171)
Data processing: database and file management or data structures
Database design
Data structure types
C704S001000
Reexamination Certificate
active
06338057
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention lies in the field of methods and apparatus for data management and retrieval and finds particular application to the field of methods and apparatus for identifying key data items within data sets.
2. Related Art
Recent advances in technology, such as CD-ROMs, Intranets and the World Wide Web have provided a vast increase in the volume of information resources that are available in electronic format.
A problem associated with these increasing information resources is that of locating and identifying data sets (e.g. magazine articles, news articles, technical disclosures and other information) of interest to the individual user of these systems.
Information retrieval tools such as search engines and Web guides are one means for assisting users to locate data sets of interest. Proactive tools and services (e.g. News groups, broadcast services such as the POINTCAST™ system available on the Internet at www.pointcast.com or tools like the JASPER agent detailed in the applicant's co-pending international patent application PCT GB96/00132 (U.S. application Ser. No. 08/875,091 filed Jul. 22, 1997, now U.S. Pat. No. 5,931,907) the subject matter of which is incorporated herein by reference) may also be used to identify information that may be of interest to individual users.
In order for these information retrieval and management tools to be effective, either a summary or a set of key words is often identified for any data set located by the tool, so that users can form an impression of the subject matter of the data set by reviewing this set of key words or by reviewing the summary.
Summarising tools typically use the key words that occur within a data set as a means of generating a summary. Key words are typically identified by stripping out conjunctures such as “and”, “with”, and other so-called low value words such as “it”, “are”, “they” etc, all of which do not tend to be indicative of the subject matter of the data set being investigated by the summarising tool.
Increasingly key words and key phrases are also being used by information retrieval and management tools as a means of indicating a user's preference for different types of information. Such techniques are known as “profiling” and the profiles can be generated automatically by a tool in response to a user indicating that a data set is of interest, for example by bookmarking a Web page or by downloading data from a Web page.
Advanced profiling tools also use similarity matrices and clustering techniques to identify data sets of relevance to a user's profile. The JASPER tool, referred to above, is an example of such a tool that uses profiling techniques for this purpose.
In the Applicant's co-pending European patent application number EP 97306878.6 (corresponding to U.S. application 09/155,172 filed Sep. 22, 1998), the subject matter of which is incorporated herein by reference, a means of identifying key terms consisting of several consecutive words is disclosed. These key terms are used as well as individual key words within a similarity matrix. This enables terms such as “Information Technology” and “World Wide Web” to be recognised as terms in their own right rather than as two or three separate key words.
However these techniques for identifying key words and phrases are less than optimal because they eliminate conjunctive words and other low value words in order to identify the key words and phrases of a particular data set. They only identify phrases which contain high value words alone, such as “information retrieval”. However, conjunctive terms often provide a great deal of contextual information.
For example, in the English language, the phrase “bread and butter” has two meanings. The first relates to food and the second relates to a person's livelihood or a person's means of survival. Similarly, in the English language, the term “bread and water” again relates to food and also has a second meaning that is often used to imply hardship.
An information retrieval or management tool that eliminates all conjunctive words during the process of identifying key words and phrases in a block of text would reduce the phrases “bread and butter” and “bread and water” to a list of key words consisting of “bread”, “butter”, “water”. In such a list, the second meanings of hardship and a person's livelihood are lost.
A further problem is that names such as “Bank of England”, “Stratford on Avon” or terms such as “black and white”, “on and off” are reduced to their constituent, higher value words, thus altering the information returned by the tool.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention there is provided an apparatus for managing data sets, having: an input means for receiving data sets as input; means adapted to identify, within a said data set, a first set of words comprising one or more word groups of one or more words, conforming to a predetermined distribution pattern within said data set, wherein said words in said word groups occur consecutively in the data set; means adapted to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said data set; means adapted to eliminate said sub-set of words from said first set thereby forming a set of key terms of said data set; and output means for outputting at least one said key term.
According to a second aspect of the present invention there is provided a method of managing data sets, including the steps of:
1) receiving a data set as input;
2) identifying a first set of words conforming to a first distribution pattern within said data set, said first set comprising one or more word groups of one or more words, wherein said words in said word groups occur consecutively in the data set;
3) identifying a sub-set of word groups in said first set, said sub-set conforming to a second distribution pattern within said data-set;
4) eliminating said sub-set from said first set thereby identifying a set of key terms;
5) outputting said key terms.
Thus embodiments of the present invention identify, within a received data set, a first set of word groups of one or more words according to a first pattern within the data set and then identify a second pattern of word groups from within the first set. The key terms are those groups of one or more words within the first set that do not conform to the second pattern.
The approach of identifying, within the data set, patterns of word groups, enables key terms to be extracted without first eliminating low value words. This has the advantage that conjunctive words and other low value words can be retained within the data set so that terms such as “on and off”, “bread and water” and “chief of staff” can be identified as key terms in their own right.
This improves the quality of the key terms extracted and also allows key terms of arbitrary length to be identified.
Preferably said first distribution pattern requires that each word group in the first set occurs more than once in said data set and preferably said second distribution pattern requires that each word group in the sub-set comprises a word or a string of words that occurs within a larger word group in the first set.
Thus embodiments of the present invention pick out any repeated words and phrases, and then eliminate any word or phrase already contained in a longer one. For instance, if a document refers to “Internet search engines” more than once, the whole phrase will become a key term but “Internet” and “search engine” on their own would be eliminated, as would “search” and “engine” as single words.
Preferably said first aspect includes means for modifying said word groups, adapted to remove low value words occurring before the first high value word in a word group and adapted to remove low value words occurring after the last high value word in a word group. In the trivial case of a word group composed of a single, low value word, the word group it
Black Thomas
British Telecommunications public limited company
Nixon & Vanderhye P.C.
Wang Mary
LandOfFree
Information management and retrieval does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Information management and retrieval, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Information management and retrieval will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2859828