System and method for extracting knowledge from documents

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06754654

ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
This invention relates in general to the field of data processing. More specifically, this invention relates to automated systems and methods for analyzing collections of documents to extract important information from the collections.
BACKGROUND OF THE INVENTION
An enormous amount of information is contained in data processing systems around the world. For example, a single large business organization typically has multiple banks of e-mail servers containing millions of e-mail messages for thousands of employees. In addition, organizations often have thousands of personnel records stored on one or more different systems, such as mini or mainframe computer systems. Additional kinds of information typically kept include marketing materials, technical reports, business memoranda, and so on, stored in various types of computer systems.
For instance, organizations typically use different programs to create and modify different kinds of information and typically use many different kinds of hardware, operating systems, file systems, and data formats to store the information. When stored, the information is typically organized into discrete records containing closely related data items. For example, a typical e-mail server stores each e-mail message as a separate row in a single database file, with multiple columns within the row holding the data that constitutes the message. Likewise, some personnel systems store each employee's personnel data as related records in one or more files, with multiple fields in each record containing information such as employee name, start date, etc. Similarly, a Web server may store each Web page as lines of text in a file or a group of related files. However, despite the differences in file format and such used for different types of information, each e-mail message, each Web page, each employee's personnel data, and each similar collection of information is referred to as a “document.”
When organization databases grow to contain thousands or millions of documents, traditional tools for retrieving data, such as search and sort functions, lose much of their practical utility. For example, when millions of e-mail messages are available, searching for a particular message or for a message relating to a particular topic is like trying to find a needle in a haystack. In such a situation, the individual performing the search is faced with too much information (TMI), and the knowledge embedded within the stored information remains largely untapped.
In recent years, some businesses have attempted to utilize the large pools of information on their data processing systems to greater advantage by analyzing that information with techniques known generally as data mining. As defined by the Microsoft Press Computer Dictionary, data mining is “the process of identifying commercially useful patterns or relationships in databases or other computer repositories through the use of advanced statistical tools” (4th ed., p. 125).
As one example, a cluster tool organizes documents into groups based on the contents of the documents. For instance, a business with customer complaint e-mails could identify areas of concern by using a cluster tool to group related customer complaints together. By contrast, traditional search techniques require the user to know in advance what characteristics are important. For example, with a traditional search function, an automobile manufacturer specifies a specific term, such as “engine,” to determine whether engine complaints are numerous. A cluster tool, on the other hand, groups complaints into subject areas, thereby highlighting areas of concern that the manufacturer might not otherwise think to explore.
However, a number of disadvantages are associated with conventional data mining systems, including shortcomings relating to the amount of time required to produce results, the pertinence of the results to the organization using those results, and the ability to analyze documents from different time periods, particularly when the analysis involves documents that have been archived.
SUMMARY OF THE INVENTION
Embodiments of the present invention provide a system and method for extracting knowledge from documents. In one embodiment, a data mining system according to the present invention includes a data retrieving component, a data integrating component, and a query manager. The data retrieving component and the data integrating component cooperate to generate intermediate data, such as marked-up documents, key term vectors, and/or data cubes, based on raw documents, such as e-mail messages, associated with an organization. The query manager uses the intermediate data to respond to queries relating to the raw documents.
In another embodiment, the data integrating component generates and stores the intermediate data automatically and substantially independently of the query manager. For instance, the intermediate data may be generated and stored according to a sampling period.
In another embodiment, the data retrieving component identifies which raw documents are pertinent to the organization, based on characteristic data for the organization (i.e., organization data), such as personnel records. In this embodiment, the data retrieving component filters the raw documents by generating marked-up documents for the raw documents identified as pertinent. For example, if processing e-mail messages, the data retrieving component may generate marked-up documents only for e-mail messages which were both sent and received by members of the organization.
Additional embodiments provide other technological solutions which facilitate knowledge extraction.


REFERENCES:
patent: 6182091 (2001-01-01), Pitkow et al.
patent: 6510406 (2003-01-01), Marchisio
Jiawei Han, Towards on-line analytical mining in large databases, 1998, ACM Press, vol. 27, Issue 1, pp. 97-107.*
IBM Intelligent Miner for Text,Fact Sheet obtained from internet at <http://www-4.ibm.com/software/data/iminer/fortext/download/factsheet.pdf>, 1999.
The Trillium Control Center(Figure) obtained form internet at <http://www.trilliumsoft.com/softwareanim.htm>, printed Feb. 6, 2001.
Gray, Jim et al.,DataCube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals;Microsoft Research, Advanced Technology Division, Microsoft Corporation, pp. 1-9, obtained from internet at <http://citeseer.nj.nec.com/cache/papers2/cs/12668/ftp:zSzzSzftp.research.microsoft.comzSzpubzSztrzSztr-95-22.pdf/gray96data.pdf> (printed May 7, 2001)., Feb. 5, 1995 and revised Oct. 18, 1995.
Barbara, Daniel and Wu, Xintao, George Mason University,The Role of Approximations in Maintaining and Using Aggregate Views,IEEE Computer Society, vol. 22 No. 4, pp. 15-21., Dec. 1999.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for extracting knowledge from documents does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for extracting knowledge from documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for extracting knowledge from documents will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3339165

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.