Linguistic search system

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000, C707S793000, C704S009000

Reexamination Certificate

active

06202064

ABSTRACT:

BACKGROUND
The present invention relates to data processing, and more particularly to techniques for searching for information in a text database or corpus.
Most of the techniques in use to retrieve a piece of information in a text corpus are based on substring search (also known as full-text search). Because this basic string search mechanism is weak when the user wants to catch more than a simple sequence of characters various techniques have been developed by data providers to enhance the substring matching: wildcards, regular expressions, Boolean operators, proximity factor (e.g. words must be in the same sentence or no more than N words between two words) and stemming.
Existing techniques often try to achieve the similar goals: to allow the user to better express the variability of the natural language in which the string expression is to be searched in order not to miss any place where this expression appears.
However, known techniques suffer from several drawbacks: the end user has to learn the query language proposed by the search engine; no two search engines have the same query language; if the user doesn't think of all the possible variations of the searched expression, he can miss some relevant documents; and/or on the other hand, if the search expression is too “loose”, many irrelevant documents will be retrieved, generating noise.
The linguistic search techniques according to the present invention overcome at least some of the above mentioned problems. They rely both on the linguistic tools (such as a tokeniser, morphological analyser and disambiguator and the generation of complex regular expressions to match against the text database.
This mechanism has the advantages over a basic full text search engine that the end user doesn't need to learn an esoteric query language. He just has to type the multiword expression he is looking for in natural language.
A further advantage is that the retrieved documents will be much more relevant to the query from a linguistic point of view (although it doesn't ensure that all relevant documents will be retrieved from the point of view of the meaning).
A further advantage is that many variations will be captured by the linguistic processing. As a consequence, even a user who is not familiar with the language in which the searched documents are written doesn't have to know about the linguistic variation that might occur.
The linguistic search techniques according to the invention provide a new way to search for information in a text database. They enable users to find portions of a text which match multiword expressions given by the user. Matches include possible variations that are relevant with the initial criteria from a linguistic point of view including simple inflections like plural/singular, masculine/feminine or conjugated verbs and even more complex variations like the insertion of additional adjectives, adverbs, etc. in between the words specified by the user. This technique can complement conventional full text search engines by reducing the number of retrieved documents that are inconsistent with the query.
SUMMARY
The present invention provides a method of searching for information in a text database, comprising: (a) receiving at least one user input, the user input(s) defining a natural language expression including one or more words, (b) converting the natural language expression to a tagged form of the expression, the tagged form including said one or more words and, associated therewith, a part-of-speech tag, (c) applying to the tagged form one or more grammar rules of the language of the natural language expression, to derive a regular expression, and (d) analysing the text database to determine whether there is a match between said regular expression and a portion of said text database.
Preferably, step (b) comprises the step of: tagging the natural language expression by, for each word in said natural language expression, (b1) converting each word to its root form, and (b2) applying a part-of-speech tag to each word, thereby generating a complex tagged form.
Preferably, the part-of speech tag includes a syntactic category marker and a morphological feature marker, and wherein step (b) further comprises the step of: (b3) simplifying said complex tagged form by removing each morphological feature marker, to generate said tagged form.
Preferably, the method further includes the step of (e) determining the location of said text database of a match with said regular expression.
The invention further provides a programmable data processing apparatus when suitably programmed for carrying out the method of any of the appended claims, or according to any of the particular embodiments described herein.


REFERENCES:
patent: 4674066 (1987-06-01), Kucera
patent: 4688195 (1987-08-01), Thompson et al.
patent: 5278980 (1994-01-01), Pedersen et al.
patent: 5418716 (1995-05-01), Suematsu
patent: 5559693 (1996-09-01), Anick et al.
patent: 5625554 (1997-04-01), Cutting et al.
patent: 5715468 (1998-02-01), Budzinski
patent: 5717913 (1998-02-01), Driscoll
patent: 5794050 (1998-08-01), Dahlgren et al.
patent: 5937422 (1999-08-01), Nelson et al.
patent: 5983221 (1999-11-01), Christy
patent: 5995922 (1999-11-01), Penteroudakis
patent: 5999664 (1999-12-01), Mahoney et al.
patent: 0 268 367 A2 (1988-05-01), None
patent: 0 522 591 A2 (1993-01-01), None
patent: 0 597 630 A1 (1994-05-01), None
Bauer, D., Segond, F. and Zaenen, A. “Locolex: The Translation Rolls off Your Tongue”, in Proceedings of the 1995 Joint International Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (ACH/ALLC '95), Jul. 1995, Santa Barbara, Cal., pp. 6-9.
Draft version of Beltrametti, M., Julliard, L. and Renzetti, F. “Information retrieval and virtual libraries the Callimaque1model”, in Proceedings of CAIS '95, 1995.
Chanod, J.-P. and Tapanainen, P. “Creating a tagset, lexicon and guesser for a French tagger”, in Proceedings of ACL-SIGDAT, 1995, Dublin, Ireland, pp. 58-64.
Grefenstette, G. “SEXTANT: Extracting Semantics from Raw Text”,Integrated Computer-Aided Engineering, Received Oct. 13, 1992; accepted Jun. 29, 1993, 1(6) pp. 527-536 (1994).
Horowitz, P. “The art of electronics—2nded.”,Cambridge University Press, ©1989, pp. 673-678.
Jacquemin, C. and Royaute, J. “Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework”, SIGIR '94, Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, Jul. 3-6, 1994, Dublin, Ireland, pp. 132-141.
Jensen, K. “Chapter 3 PEG: The PLNLP English Grammar”,Natural Language Processing: The PLNLP Approach, Kluwer Academic Publishers, ©1993, pp. 29-45.
Montemagni, S. and Vanderwende, L. “Chapter 12 Structural Patterns versus String Patterns for Extracting Semantic Information from Dictionaries”,Natural Language Processing: The PLNLP Approach, Kluwer Academic Publishers, ©1993, pp. 149-159.
Sparck Jones, K. and Tait, J.I. “Automatic Search Term Variant Generation”,The Journal of Documentation Devoted to the Recording, Organization, and Dissemination of Specialized Knowledge, Aslib, vol. 40, 1984, pp. 50-66.
Karttunen, L. “Constructing Lexical Transducers”, Coling 94 the 15thInternational Conference on Computational Linguistics Proceedings vol. 1, Aug. 5-9, 1994, Kyoto, Japan, pp. 406-411.
McEnery, T. and Wilson, A. “Corpus Linguistics”,Edinburgh Textbooks in Empirical Linguistics, Edinburgh University Press, ©1996, reprinted 1997, pp. 117-145 and 189-192.
Antoniadis, G., et al, “A French Text Recognition Model For Information Retrieval System”, Proceedings of the International Conference on Research and Development in Information Retrieval, (SIGIR), Grenoble, Jun. 13-15, 1988, pp. 67-84.
Patent Abstracts of Japan, vol. 015, No. 180 (P-1199), May 9, 1991, Publication No. 03040067A.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Linguistic search system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Linguistic search system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Linguistic search system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2477678

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.