Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
1999-07-06
2001-10-23
Choules, Jack (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C707S793000, C704S004000, C704S008000, C704S009000
Reexamination Certificate
active
06308172
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to discovering trends in text databases. More particularly, the invention concerns the analysis of databases to find user specified trends in documenting text by employing phrase identification using sequential patterns and trend identification using shape queries.
2. Description of the Related Art
Database technology has been used with great success in traditional business data processing. However, there is a increasing desire to use this technology in new application domains. For example, one such application domain that has acquired considerable significance is that of database text analysis (sometimes referred to as “mining”).
Several approaches to different database content analysis techniques have been proposed as discussed in Feldman et al., “Knowledge Discovery in Textual Databases (KDT)”,
Proc. of the
1
st Int'l. Conf. on Knowledge Discovery in Databases and Data Mining
, 1995; Feldman et al., “Mining Associations in Text in the Presence of Background Knowledge”,
Proc. of the
2
nd Int'l. Conf. on Knowledge Discovery on Databases and Data Mining
, 1996; Renouf, A., “Making Sense of Text: Automated Approaches to Meaning Extraction”, 17
th Int'l. On-Line Information Meeting Proceedings
, 1993a; Srikant et al., “Mining Sequential Patterns: Generalizations and Performance Improvements”,
Proc. of the
5
th Int'l. Conf. on Extending Database Technology
(
EDBT
), 1996. As new database content analysis techniques are discovered, an increasing number of organizations are creating ultra large databases (measured in gigabytes and even terabytes) of business data, such as consumer data, transactional histories, sales records, and historical documents. For example, U.S. Patents dating from 1970 may now be found in a computer database which forms a potential gold mine of valuable business information.
A few suggestions have been made by database content analysis practitioners concerning discovering interesting patterns and trend analyses on text documents. For example, analyzing trends involving the comparison of concept distributions using old data with distributions using new data has been suggested in Feldman, 1995, supra. In Feldman, 1996, supra, associations between the key words or concepts labeling documents using background knowledge about relationships among the key words is described. The knowledge base is used to supply unary or binary relations amongst the key words labeling the documents.
More specifically, using words and phrases to describe themes and concepts in text documents is now being studied by the information retrieval community. For example, mathematical models treating word associations as weighted vectors that represent “concepts” found within documents has been proposed. This “vector” approach allows a query to identify and retrieve a document even when the query and the document share no words, but do share a similar concept. The technique is referred to as Latent Semantic Indexing (LSI) and is discussed in Deerwester et al., “Indexing by Latent Semantic Analysis”,
Journal of the American Society for Information Science
, 41(6):391-407, 1990. However, one problem with the LSI model is the amount of time it takes to “build” the model.
The use of words and phrases to build more advanced queries to discover trends in databases is of recent advent. Various techniques, such as identifying phrases as concepts and as relationships between concepts, where the quality of text categorization is improved by using word clusters and phrases, has been proposed. However, one problem in implementing such phrase-based database content analysis techniques is their implementation in existing databases. The database systems of today offer little functionality to support such “mining”applications, and machine learning techniques perform poorly when applied to very large databases. The difficulty in implementation of a phrase-based analysis method is one reason why the discovery of trends in text databases has not evolved as quickly as might be expected.
Although these trend-finding methods constitute a significant advance and in some instances enjoy commercial success today the assignee of the present application has continually sought to improve the performance and efficiency of these data analysis systems. The problem with presently known methods is that trends in databases may not be easily and efficiently discovered using current techniques.
SUMMARY OF THE INVENTION
Broadly, the present invention concerns a method and apparatus used to discover trends in text databases. More particularly, the invention concerns the analysis of the contents of text databases to find user specified trends. The method employs sequential pattern phrase identification and uses shape queries to identify trends in the data.
In one embodiment, the invention may be implemented to provide a method to access and partition a database, identify words and phrases contained in text documents of the partition, and discover trends based upon the frequency with which the phrases appear. A practical example of the implementation of the present invention best summarizes the invention.
In the example, assume the present invention is connected to a database containing all granted U.S. Patents. The patent data is retrieved using a dynamically generated Structured Query Language (SQL) query based upon selection criteria specified by the user. In one embodiment. the selection criteria may be specified by the user using a graphic user interface (GUI). The present invention allows the selection of patents in a specific classification or by key words appearing in the title or abstract of each patent in the database. Once retrieved. a histogram displaying the number of patents for each year may be shown on the GUI and the user may then “partition” the database, i.e., specify a range of years upon which the present invention will be implemented.
The user can also chose the maximum and minimum gap desired between words in the phrases to be mined as well as the minimum support all phrases must meet for each time period between the start and ending years. Once the user has specified a range upon which the method will focus, the text data contained within that range is “cleansed” in one embodiment to remove unwanted symbols and stop words. Transaction IDs are assigned to the words in the text documents depending on their placement within each document contained within the data range. The transaction IDs encode both the position of each word within the document as well as representing sentence, paragraph, and section breaks, and are represented in one embodiment as long integers with the sentence boundaries using the 10
3
location, the paragraph boundaries using the 10
5
location, and the section boundaries using the 10
7
location. By specifying the minimum gap of 10
3
, for instance, phrases will consist of words each from different but sequential sentences.
Assuming partitioning and cleansing has occurred as discussed above, each partition containing patent documents is passed over by the present invention using a generalized sequential pattern method to generate those phrases in each partition that meet a minimum support threshold as specified by the user. The resulting phrases may be cached in one embodiment so that different shaped queries can be run using the data. The shape query engine used in the present invention takes the set of partitioned phrases and selects those that match the given shape query. In another embodiment. once a shaped-query has been defined either internally or using a graphical editor. the shape query is rewritten into a standard definition language (SDL). The SDL is used to determine user specified trends which are present in the partitioned database.
In another embodiment the user may define his own shape by using a visual shape editor. In any event, the query may take the form of requesting a trend in phrase usage in patents such as “recent upwards trend”, “recent spikes in usa
Agrawal Rakesh
Lent Brian Scott
Srikant Ramakrishnan
Channavajjala Srirama
Choules Jack
Gray Cary Ware & Freidenrich
International Business Machines - Corporation
LandOfFree
Method and apparatus for partitioning a database upon a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for partitioning a database upon a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for partitioning a database upon a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2611486