Taxonomy generation for document collections

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06446061

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a method within the area of information mining within a multitude of documents stored on computer systems. More particularly, the invention relates to a computerized method of generating a content taxonomy of a multitude of electronic documents.
BACKGROUND OF THE INVENTION
Organizations generate and collect large volumes of data, which they use in daily operations. Yet many companies are unable to capitalize fully on the value of this data because information implicit in the data is not easy to discern. Operational systems record transactions as they occur, day and night, and store the transaction data in files and databases. Documents are produced and placed in shared files or in repositories provided by document management systems. The growth of the Internet, and its increased worldwide acceptance as a core channel both for communication among individuals and for business operations, has multiplied the sources of information and therefore the opportunities for obtaining competitive advantages. Business Intelligence Solutions is the term that describes the processes that together are used to enable improved decision making. Information mining is the process of data mining and/or text mining. It uses advanced technology for gleaning valuable insights from these sources that enable the business user making the right business decisions and thus obtaining the competitive advantages required to thrive in today's competitive environment. Information Mining in general generates previously unknown, comprehensible, and actionable information from any source, including transactions, documents, e-mail, web pages, and other, and using it to make crucial business decisions.
Data is the raw material. It can be a set of discrete facts about events, and in that case, it is most usefully described as structured records of transactions, and it is usually of numeric or literal type. But documents and Web pages are also a source of an unstructured data, delivered as a stream of bits which can be decodified as words and sentences of text in a certain language. Industry analysts estimate that unstructured data represent 80% of an enterprise information compared to 20% from structured data; it comprises data from different sources, such as text, image, video, and audio; text, is however, the most predominant variety of unstructured data.
The IBM Intelligent Miner Family is a set of offerings that enables the business professional and in general any knowledge worker to use the computer to generate meaningful information and useful insights from both structured data and text. Although the general problems to solve (e.g.. clustering, classification) are similar for the different data types, the technology used in each case is different, because it needs to be optimized to the media involved, the user needs, and to the best use of the computing resources. For that reason, the IBM Intelligent Family is comprised of two specialized products: the IBM Intelligent Miner for Data, and the IBM Intelligent Miner for Text.
Information mining has been defined as the process of generating previously unknown, comprehensible, and actionable information from any source. This definition exposes the fundamental differences between information mining and the traditional approaches to data analysis such as query and reporting and online analytical processing (OLAP) for structured data, and from full text search for textual data. In essence, information mining is distinguished by the fact that it is aimed at the discovery of information and knowledge, without a previously formulated hypothesis. By definition, the information discovered through the mining process must have been previously unknown, that is, it is unlikely that the information could have been hypothesized in advance. For structured data, the interchangeable terms “data mining” and “knowledge discovery in databases” describe a multidisciplinary field of research that include machine learning, statistics, database technology, rule based systems, neural networks, and visualization. “Text mining” technology is also based on different approaches of the same technologies; moreover it exploits techniques of computational linguistics.
Both data mining and text mining share key concepts of knowledge extraction, such as the discovery of which features are important for clustering, that is, finding groups of similar objects that differ significantly from other objects. They also share the concept of classification, which refers to finding out to which class it belongs a certain database record, in the case of data mining, or to a document, in the case of text mining. The classification schema can be discovered automatically through clustering techniques (the machine finds the groups or clusters and assigns to each cluster a generalized title or cluster label that becomes the class name). In other cases the taxonomy can be provided by the user, and the process is called categorization.
Many of the technologies and tools developed in information mining are dedicated to the task ol discovery and extraction of information or knowledge from text documents, called feature extraction. The basic pieces of information in text—such as the language of the text or company names or dates mentioned—are called features. Information extraction from unconstrained text is the extraction of the linguistic items that provide representative or otherwise relevant information about the document content. These features are used to assign documents to categories in a given scheme, group documents by subject, focus on specific parts of information within documents, or improve the quality of information retrieval systems. The extracted features can also serve as meta data about the analyzed documents. Extracting implicit data from text can be interesting for many reasons; for instance:
to highlight important information e.g. to highlight important terms in documents. This can give a quick impression whether the document is of any interest.
to find names of competitors e.g. when doing a case study in a certain business area one can do a names extraction on the documents that one has received from different sources and then sort them by names of competitors.
to find and store key concepts. This could replace a text retrieval system where huge indexes are not appropriate but only a few key concepts of the underlying document collection should be stored in a database.
to use related topics for query refinement e.g. store the key concepts found in a database and build an application for query refinement on top of it. Thus topics that are related to the users' initial queries can be suggested to help them refine their queries.
Feature extraction from texts, and the harvesting of crisp and vague information, require sophisticated knowledge models, which tend to become domain specific. A recent research prototype has been disclosed by J. Mothe, T. Dkaki, B. Dousset, “Mining Information in Order to Extract Hidden and Strategic Information”, Proceedings of Computer-Assisted Information Searching on Internet, RIAO97, pp 32-51, June 1997.
A further technology of major importance in information mining is dedicated to the task of clustering of documents. Within a collection of objects a cluster could be defmed as a group of objects whose members are more similar to each other than to the members of any other group. In information mining clustering is used to segment a document collection into subsets, the clusters, with the members of each cluster being similar with respect to certain interesting features. For clustering no predefined taxonomy or classification schemes are necessary. This automatic analysis of information can be used for several different purposes:
to provide an overview of the contents of a large document collection;
to identify hidden structures between groups of objects e.g. clustering allows that related documents are all connected by hyper links;
to ease the process of browsing to find similar or related information e.g.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Taxonomy generation for document collections does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Taxonomy generation for document collections, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Taxonomy generation for document collections will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2902983

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.