Method and system for classifying semi-structured documents

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C370S503000

Reexamination Certificate

active

06606620

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of automated information retrieval in the context of document characterization and classification. Particularly, the present invention relates to a system and associated method for classifying semi-structured data maintained in systems that are linked together over an associated network such as the Internet. More specifically, this invention pertains to a computer software product for dynamically categorizing and classifying documents by taking advantage of both textual information as well as latent information embedded in the structure or schema of the documents, in order to classify their contents with a high degree of precision. This invention incorporates a structured vector model, and relies on a document classifier that assumes a structured vector model.
BACKGROUND OF THE INVENTION
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. The phenomenal growth of the WWW has led to the proliferation of data in semi-structured formats such as HTML and XML. There is a pressing need to support efficient and effective information retrieval, search and filtering. An accurate classifier is an essential component of building a semi-structured database system.
Currently, users navigate Web pages by means of computer software programs/search tools that commonly fall into two broad categories: net directories and search engines. Net directories provide a hierarchical classification of documents based on a manual classification of Web page materials and data. Search engines use a keyword-based search methodology to return to the user a set of pages that contain a given keyword or words. Both search tools suffer from significant limitations. Net directories are precise but are very limited in scope and expensive to maintain, primarily because of the requirement for human effort to build and maintain them. Search engines are more capable of covering the expanse of the Web but suffer from low precision and in their current embodiments, are reaching their logical limits. Search engines may provide to the user a null return or, conversely, a multitude of responses, the majority of which are irrelevant.
A number of techniques have been applied to the problem. Among them: statistical decision theory, machine learning, and data mining. Probabilistic classifiers use the joint probabilities of words and categories to estimate the probability of a document falling in a given category. These are the so-called term-based classifiers. Neural networks have been applied to text categorization. Decision tree algorithms have been adapted for data mining purposes.
The problems associated with automated document classification are manifold. The nuances and ambiguity inherent in language contribute greatly to the lack of precision in searches and difficulty of achieving successful automated classification of documents. For example, it is quite easy for an English-speaking individual to differentiate between the meanings of the word “course” in the phrase “golf course” and the phrase “of course.” A pure, term-based classifier, incapable of interpreting contextual meaning, would wrongly lump the words into the same category and reach a flawed conclusion about a document that contained the two phrases. Another difficulty facing automatic classifiers is the fact that all terms are not equal from a class standpoint.
Certain terms are good discriminators because they occur significantly more in one class than another. Other terms must be considered noise because they occur in all classes almost indifferently. The effective classifier must be able to effectively differentiate good discriminators from noise. Yet another difficulty for classifiers is the evaluation of document structure and relative importance of sections within the document. As an example, for a classifier dealing with resumes, sections on education and job skills would need to be recognized as being more important than hobbies or personal background.
These and other language problems represent difficulties for automated classification of documents of any type, but the World Wide Web introduces its own set of problems as well. Among these problems are the following:
1.
Web documents are extremely diverse in content, structure, style
and format, partly because of their diverse authorship. Many
of the techniques that have been developed are only effective on
documents with homogeneous corpora.
2.
A significant fraction of Web documents are hypertext documents,
often divided into pages that are connected by hyperlinks.
Documents used for most existing Information Retrieval (IR)
studies are self-contained and cannot deal with the links.
3.
Most popular web document formats such as HTML or XML are
semi-structured, implying either an explicit or implicit,
though not fixed, schema. Previous Information Retrieval (IR)
efforts have focused on flat (unstructured) documents. The
markups and formatting cues in the document can mislead the
classifiers; removing or ignoring them means that only part
of the original information is available for classification.
The challenges, then, are to deal with the problems inherent in all documents but to also deal with the special problems associated with Web documents, in particular those with a semi-structured format.
As noted, semi-structured data are data that do not have a fixed schema. Semi-structured data, however, have a schema, either implicit or explicit, but do not have to conform to a fixed schema. By extension, semi-structure documents are text files that contain semi-structured data. Examples include documents in HTML and XML and, thus, represent a large fraction of the documents on the Web.
The exploitation of the features inherent in such documents is a key to attaining and obtaining better information retrieval is not new. For example, one classifier has been designed to specifically take advantage of the hyperlinks available in HTML. Reference is made to Soumen Chakrabarti, et al., “Enhanced Hypertext Categorization Using Hyperlinks,” Proc. of ACM SIGMOD Conference, pages 307-318, Seattle, Wash., 1998.
In this manner, the classifier can evaluate for both and non-local data information to better categorize a document. However, there are more features of semi-structured documents that can be used for classification along with new techniques for evaluating the information gleaned from the documents.
Currently, there exists no other classifier that takes full advantage of the information available in semi-structured documents to produce accurate classification of such documents residing on the World Wide Web. The need for such a classifier has heretofore remained unsatisfied.
SUMMARY OF THE INVENTION
The text classifier for semi-structured documents and associated method of the present invention satisfy this need. In accordance with one embodiment, the system can dynamically and accurately classify documents with an implicit or explicit schema by taking advantage of the term-frequency and term distribution information inherent in the document. The system further uses a structured vector model that allows like terms to be grouped together and dissimilar terms to be segregated based on their frequency and distribution within the sub-vectors of the structure vector, thus achieving context sensitivity. The final decision for assigning the class of a document is based on a mathematical comparison of the similarity of the terms in the structured vector to those of the various class models.
The classifier of the present invention is capable of both learning and testing. In the learning phase the classifier develops models for classes with information it develops from the composite information gleaned from numerous training documents. Specifically, it develops a structured vector model for each training document. Then, within a given class of documents it adds and then normalizes

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and system for classifying semi-structured documents does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and system for classifying semi-structured documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for classifying semi-structured documents will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3128526

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.