Software and method for recognizing similarity of documents...

Data processing: speech signal processing – linguistics – language – Linguistics – Multilingual or national language support

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C704S005000

Reexamination Certificate

active

06519557

ABSTRACT:

CROSS-REFERENCE TO RELATED APPLICATIONS
Not Applicable
FIELD OF THE INVENTION
This invention pertains to multi-lingual document data warehousing. More particularly the invention pertains to a system and method that can identify duplicates or near duplicates of a document in two different languages.
BACKGROUND OF THE INVENTION
The Internet comprises a vast resource of information in the form of web pages. These web pages comprise text, graphics, video and other forms of information on a variety of topics the range of which is coextensive with the vast range of users' interests. The Internet is a global network and thus serves a diverse multi-lingual community.
In the interest of serving the Internet's multi-lingual community, large organizations and companies may have very large web sites, built up over many years by many people. The sites can be so large that no single person has extensive knowledge of the entire site architecture. These sites may often contain multi versions of documents written in different languages. In some cases different language versions of a web site may be located on different hosts or have separate domain names and be stored in separate directory structures. As the Internet continues to rapidly develop, there often arises the desire to revamp web sites. In the case of multi-lingual web resources (i.e., a single multi-lingual site, or multiple sites in different languages) a plan for revamping may include identifying different language versions of the same document as such. The plan might further include eliminating duplicative documents, in preference of using a real time machine translation function to present the web page to the user, or it might alternatively include adding cross references to the web pages to the different language versions.
A third party such as a search engine dot com might also want to identify different language versions of the same document so as to enable it to present information identifying different language versions to a user.
Because of the layout differences for some languages, for example, Japanese, often being written vertically rather that horizontally, and Hebrew being written from right to left rather than from left to wright, different language versions of the same web page may have a somewhat different Hyper Text Markup Language (HTML) structure in order to accommodate the layout of the particular language. Thus, a strict comparison on the basis of the HTML code structure alone cannot be relied on to identify different language versions of the same document.
The invention to be described makes use of machine translation. In connection therewith, it should be noted that machine translation does not produce an exact inverse function of the human language translation originally used to produce foreign language versions. There will be differences in the text output by a machine translation function and the original document. Therefore, direct string comparisons between the original document and the translation of the foreign language document back into the original language will not yield a match.
What is needed is a system for identifying duplicate versions of web pages which may be written in two different languages.
What is further needed is a system for identifying different language versions of a document, that can identify that the two documents are the same or similar notwithstanding slight differences in the formatting code (e.g., HTML) structure of the documents.
What is further needed is a system for identifying different language versions of the same document that is tolerant of the imperfections of machine translation.
SUMMARY OF THE INVENTION
Briefly, according to one aspect of the invention, a method of identifying different versions of the same structured document comprises steps of reading a first portion of text which occupies a first position in a first hierarchical structured document, reading a second portion of text which occupies a second position which is congruent to the first position in a second hierarchical structured document, and obtaining a quantitative measure of similarity of the first and second portions of text.


REFERENCES:
patent: 5040218 (1991-08-01), Vitale et al.
patent: 5062143 (1991-10-01), Schmitt
patent: 5371807 (1994-12-01), Register et al.
patent: 5392419 (1995-02-01), Walton
patent: 5418951 (1995-05-01), Damashek
patent: 5606690 (1997-02-01), Hunter et al.
patent: 5666442 (1997-09-01), Wheeler
patent: 5680628 (1997-10-01), Carus et al.
patent: 5724593 (1998-03-01), Hargrave, III et al.
patent: 5848386 (1998-12-01), Motoyama
patent: 5867811 (1999-02-01), O'Donoghue
patent: 5987403 (1999-11-01), Sugimara
patent: 6002998 (1999-12-01), Martino et al.
patent: 6064951 (2000-05-01), Park et al.
patent: 6098071 (2000-08-01), Aoyama et al.
patent: 6236958 (2001-05-01), Lange et al.
patent: 6324555 (2001-11-01), Sites

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Software and method for recognizing similarity of documents... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Software and method for recognizing similarity of documents..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Software and method for recognizing similarity of documents... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3182013

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.