Method and apparatus for multi-language indexing

Data processing: speech signal processing – linguistics – language – Linguistics – Natural language

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000

Reexamination Certificate

active

06389387

ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method of and an apparatus for forming an index. The invention also relates to a storage medium storing a program for performing the method, an index, a storage medium containing the index and the use of the index to access documents.
The techniques disclosed herein may be used for information management. Examples of such applications include information retrieval systems, such as search engines, for accessing information on the internet or in office information systems, information filtering applications (also known as information routing systems) and information extraction applications.
DESCRIPTION OF THE RELATED ART
There are many data bases which contain documents in machine-readable form and which can be accessed to locate and retrieve information. Similarly, there are various known techniques for locating documents on the basis of subject matter. One example of this is the collection of published patent specifications. All patent specifications are indexed according to subject matter when the specification is published in accordance with the International Classification. The content of each patent specification is analyzed in accordance with the International Classification and the relevant classification numbers for the subject matter form part of the heading of both the printed patent specification and he machine-readable form.
In order to locate patent specifications, or indeed other documents, whose collections are similarly classified according to subject matter, it is necessary to select the correct international class and to apply this to a searching system. The searching system then locates all patent specifications which have been classified in the same class. However, a disadvantage of this system is that efficient use requires familiarity with and experience of using the International Classification system. Also, this technique relies on correct classification of patent specifications. Inexperienced use can result in relevant patent specifications being missed whereas incorrect classification can prevent a relevant patent specification from ever being located by this technique.
Another known technique for information retrieval relies on the selection of keywords which are then used to search for relevant documents such as patent specifications. In this case, it is necessary to identify words which are likely to appear in the relevant documents but which are unlikely to appear in irrelevant documents. Searching using keywords than reveals all documents which contain the keywords or combination of keywords.
There are several difficulties with this technique. For instance, in the case of subject matter without well-defined or stand terminology it may be difficult or impossible to select all keywords which might identify relevant documents. On the other hand, the use of more general keywords can lead to the disclosure of very large numbers of documents many of which are irrelevant. Further, such keywords can only be used for documents which are in the same language or which have been completely or partially translated or abstracted into the language of the keywords. The effectiveness of this technique in locating documents in other languages may therefore be poor or nonexistent.
D. A. Hull and G. Greffenstette, “Querying across languages: a Dictionary-Based Approach to Multilingual Information Retrieval”, 19
th
Annual International Conference on Research and Development in Information Retrieval (SIGIR '96), pages 49-57, 1996 and D. W. Oard and B. J. Dorr, “A Survey of Multilingual Text Retrieval”, Technical Report UMIACS-TR-96-19, University of Maryland, institute for Advanced Computer Studies, April 1996, disclose two techniques for performing multilingual information retrieval, one based on document translation and the other based on query translation. In each case, each translation is to be performed by a machine translation system. Thus, in the case of document translation, a machine translation system is used to translate all of a collection of documents into a target language so that queries for locating and retrieving information, for instance based on the keyword technique described hereinbefore, may be performed in the source (document) language or is the target language. In the other technique, the documents are not translated but each query is translated into the source or document language and the translations are used to search the document collection.
A disadvantage with query translation is that queries often comprise a few words and may not even be in a sentence context. Thus, automatic linguistic processing of such queries can be difficult and may lead to unsatisfactory results, such as failure to locate relevant documents and location of irrelevant documents.
The use of automatic machine translation to translate whole collections of documents to form an index is also problematic. The resources required in terms of computing time and additional storage medium capacity make this technique unattractive. Although such processing need not be performed in real time and, in particular, is not required as part of each information retrieval request, substantial resources are necessary and there may be a continuing requirement as further documents are added to the collection. Translation into several target languages multiplies the resource requirements.
Machine translation systems also perform tasks which are not useful to information retrieval and, in particular, to the forming of a multilingual index. For instance, in addition to translating words and groups of words contained in documents, machine translation systems also attempt to produce a good quality translation which is readable for human beings. If the translation is merely required for indexing, functions such as correct word ordering in the target language are unnecessary and are therefore wasteful of computing resources.
A further disadvantage with machine translation systems when used to translate documents into a target language for indexing purposes is that the effectiveness of the index may be seriously compromised. Some machine translation systems generate a single preferred translation of an input text. In other words, such systems attempt to identify and produce a single translation which is judged according to automatic criteria within the system as the best translation. If that translation is incorrect, then retrieval of information based on the incorrect translation will be ineffective because relevant documents may not be located and irrelevant documents may be located.
SUMMARY OF THE INVENTION
Other machine translation systems attempt to generate all possible translations of input text. Thus, even if the correct translation is present, there may be many other translations which are inappropriate or wrong. The use of such translations for information retrieval results in the generation of spurious matches on queries posed to the system so that very large numbers of irrelevant documents may be located together with the relevant documents.
According to a first aspect of the invention, there is provided a method of forming, for a plurality of documents, an index comprising indexing features, the method comprising the steps of:
identifying each of at least some of the terms present in the documents;
generating from each identified term at least one equivalent term which is different from but linguistically related to the identified term; and
forming for each of the identified terms and the equivalent terms an indexing feature comprising the identified term or the equivalent term and an identifier of the or each document in which the identified term or the identified term to which the equivalent term is equivalent occurs.
The expression “term” used herein means an individual word, a group of linked words which occur adjacent each other in a document (continuous collocation), or a group of words which are led to each other but which are divided into at least two subgroups of words separated in a document by one or

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for multi-language indexing does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for multi-language indexing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for multi-language indexing will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2912758

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.