Compression method, method for compressing entry word index...

Data processing: speech signal processing – linguistics – language – Linguistics – Translation machine

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S010000, C704S003000, C707S793000, C341S106000

Reexamination Certificate

active

06502064

ABSTRACT:

DETAILED DESCRIPTION OF THE INVENTION
1. Field of the Invention
The present invention relates to a machine translation method for translating or converting original text in a first language (foreign language; e.g., English) to translated text in a second language (native language; e.g., Japanese), and in particular to a machine translation method whereby a computer system performs a translation process using an electronically stored dictionary. More specifically, the present invention pertains to a method for compressing entry word index data in a dictionary; to entry word indexes for a dictionary that have been compressed; and to a method for searching for a word based on an entry word index that has been compressed.
2. Background Art
Over the years, much time and effort has been expended in the study of so-called “machine translation” (or “automatic translation”), a technique involving the use of the hardware resources of a computer system to translate text in one language to text in a second language.
Not long after the end of the Second World War, for example, following the development in 1946 of ENIAC, the first general purpose computer, many researchers became greatly interested in the possibility of using computers for “machine translation.” And over the next ten years universities and research institutes invested an enormous amount of time, energy and money in its study; but with generally unsatisfactory results.
Thereafter, interest in machine translation waned somewhat. But today, impelled by recent developments related to the use of the Internet, the focus is once again on machine translation; once again great interest is being shown in developing and producing software for this purpose. This has come about because many users of the Internet, outside the English speaking community of nations, either cannot read English or read it imperfectly, and since most text on Web pages are written in English, these users can not fully utilize the new global information system, the WWW (the World Wide Web). As a result, translation software that once, when first developed, was priced in the tens of millions of yen can now be purchased for tens of thousands of yen, and since such software is therefore more easily acquired by users, it is now widely used on personal computers. Of the machine translation software products that are presently available, some are specifically intended for the translation of text on the Internet, i.e., the translation of Web pages. One example of such a product is the “King of Translation,” which is sold by IBM Japan Co., Ltd.
In short, machine translation is a technique by which the processing capability of a computer system is applied for the translation of text written in a foreign language, such as English, into text in a native language, such as Japanese (or vice versa). For machine translation, a database is constructed by employing, as a model, the enormous amount of language knowledge that a human possesses (or is assumed to possess), and a translation engine, a type of data processor, is employed to refer to this database and to perform the actual translation.
An example database for a machine translation system is a dictionary. Recent machine translation systems prepare a dedicated dictionary for each genre, such as an art dictionary and a sports dictionary, in addition to a system dictionary that serves as a basic dictionary. Machine translation systems use dictionaries in accordance with the genre to which an object to be translated belongs, and thus the accuracy of a translation can be improved (see the specification in Japanese Unexamined Patent Application Hei 8-272755. and corresponding U.S. Pat. No. 6,119,078 issued on Sep. 12, 2000. Generally, a single machine translation dictionary is constituted by an entry word index portion and a main portion that describes translation data for entry words (includes “morpheme analysis data). The translation engine searches through the entry word indexes to acquire corresponding translation data.
For distribution, a machine translation system, i.e., the machine translation software, is generally recorded on a storage medium, such as a CD (compact disk) or a FD (floppy disk). To activate the machine translation software, an end user inserts into a drive unit of his or her computer system a CD or FD he or she purchased and installs a program recorded thereon in the computer.
An entry word index portion of a machine translation system is generally not stored in text form; usually it is compressed or encoded before being stored. This is done because an index portion that has an easily readable form may be employed or examined by a third party, especially by a competitor, and because the size of compressed entry word index data is reduced and can thus be held resident in memory. This last is important because the entry word index data must be accessed each time a word search is conducted, and when the entry word index data is resident in memory, the speed of a search is greatly increased. In particular, the sizes of entry word indexes for a machine translation system that prepares some dictionaries must be reduced, so that all of them can be held resident in memory. Conventionally, a common compression algorithm, such as “LHA” for the general-purpose personal computer (PC) or a compression command “compress” for UNIX, is employed to compress entry word index data, or only an encoding process is performed for the index data without being compressed. However, these conventional techniques have the following shortcomings.
First, time is required for compression and recovery processing. In particular, once entry word index data are compressed, a search for data can not be performed, and thus, two steps are required: the decompression of the entry word index data and a search for the resultant data. As a result, the search efficiency is deteriorated.
In addition, since the individual entry words are short character strings (20 to 30 bytes at most), the compression rate is not good.
Further, only the simple encoding of data does not reduce data size.
It is, therefore, one object of the present invention to provide a method for compressing entry word index data for a dictionary to be used for machine translation, compressed entry word indexes for a dictionary, and a method for searching for a word using the compressed entry word index data.
It is another object of the present invention to provide a compression method that enables a search for compressed data to be performed without a decompression process being required, entry word indexes for a dictionary to be generated by such a compression method, and a method for searching for a word using the compressed entry word index.
SUMMARY OF THE INVENTION
To achieve the above objects, according to a first aspect of the present invention, a compression method comprises the steps of: (a) extracting character strings, constituted by n (n is an integer greater than 1) or more characters that frequently appear in an object to be compressed, which consists of many words; (b) calculating compression contribution values for the individual extracted character strings; (c) assigning highly ranked character strings having a high compression contribution value to empty columns in a character translation code table; and (d) substituting for a corresponding character translation code the character strings that are registered in the character translation code table.
According to the compression method in the first aspect of the present invention, the object to be compressed may be the entry word index data in a dictionary used for machine translation.
At step (b), for calculating the compression contribution value, the compression contribution value may be represented by (n−k)×count, which is a product of (n−k), a compression value obtained by replacing a character string S having n characters with a character string having k characters (n>k), and count, the frequency at which the character string S of the object to be compressed appears.
The character tra

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Compression method, method for compressing entry word index... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Compression method, method for compressing entry word index..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Compression method, method for compressing entry word index... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2952905

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.