Data processing: speech signal processing – linguistics – language – Linguistics
Reexamination Certificate
1998-06-18
2003-01-07
Edouard, Patrick N. (Department: 2654)
Data processing: speech signal processing, linguistics, language
Linguistics
C704S009000, C707S793000
Reexamination Certificate
active
06505150
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to computational linguistics. In particular, the present invention relates to a method of automatically filtering searches of large, untagged, heterogeneous collections of machine-readable texts using text genre.
BACKGROUND OF THE INVENTION
The word “genre” usually functions as a literary substitute for “kind of text.” Text genre differs from the related concepts of text topic and document genre. Text genre and text topic are not wholly independent. Distinct text genres like newspaper stories, novels and scientific articles tend to largely deal with different ranges of topics; however, topical commonalties within each of these text genres are very broad and abstract. Additionally, any extensive collection of texts relating to a single topic almost always includes works of more than one text genre so that the formal similarities between them are limited to the presence of lexical items. While text genre as a concept is independent of document genre, the two genre types grow up in close historical association with dense functional interdependencies. For example, a single text genre may be associated with several document genres. A short story may appear in a magazine or anthology or a novel can be published serially in parts, reissued as a hard cover and later as a paper back. Similarly, a document genre like a newspaper may contain several text genres, like features, columns, advice-to-the-lovelorn, and crossword puzzles. These text genres might not read as they do if they did not appear in a newspaper, which licenses the use of context dependent words like “yesterday” and “local”. By virtue of their close association, material features of document genres often signal text genre. For example, a newspaper may use one font for the headlines of “hard news” and another in the headlines of analysis; a periodical may signal its topical content via paper stock; business and personal letters can be distinguished based upon page lay out; and so on. It is because digitization eliminates these material clues as to text and document genres that it is often difficult to retrieve relevant texts from heterogeneous digital text collections.
The boundaries between textual genres mirror the divisions of social life into distinct roles and activities—between public and private, generalist and specialist, work and recreation, etc. Genres provide the context that makes documents interpretable, and for this reason genre, no less than content, shapes the user's conception of relevance. For example, a researcher seeking information about supercolliders or Napoleon will care as much about text genre as content—she will want to know not just what the source says, but whether that source appears in a scholarly journal or in a popular magazine.
Until recently work on information retrieval and text classification has focused almost exclusively on the identification of topic, rather than on text genre. Two reasons explain this neglect. First, the traditional print-based document world did not perceive a need for genre classification because in this world genres are clearly marked, either intrinsically or by institutional and contextual features. A scientist looking in a library for an article about cold fusion need not worry about how to restrict his search to journal articles, which are catalogued and shelved so as to keep them distinct from popular science magazines. Second, early information retrieval work with on-line text databases focused on small, relatively homogeneous databases in which text genre was externally controlled, like encyclopedia or newspaper databases. The creation of large, heterogeneous, text databases, in which the lines between text genres are often unmarked, highlights the importance of genre classification of texts. Topic-based search tools alone cannot adequately winnow the domain of a reader's interest when searching a large heterogeneous database.
Applications of genre classification are not limited to the field of information retrieval. Several linguistic technologies could also profit from its application. Both automatic part of sentence taggers and sense taggers could benefit from genre classification because it is well known that the distribution of word senses varies enormously according to genre.
Discussions of literary classification stretch back to Aristotle. The literature on genre is rich with classificatory schemes and systems, some of which might be analyzed as simple attribute systems. These discussions tend to be vague and to focus exclusively on literary forms like the eclogue or the novel, and, to a lesser extent, on paraliterary forms like the newspaper crime report or the love letter. Classification discussions tend to ignore unliterary textual types such as annual reports, Email communications, and scientific abstracts. Moreover, none of these discussions make an effort to tie the abstract dimensions along which genres are distinguished to any formal features of the texts.
The only linguistic research specifically concerned with quantificational methods of genre classification of texts is that of Douglas Biber. His work includes:
Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings
, Language, 62(2):384-413, 1986;
Variation Across Speech and Writing
, Cambridge University Press, 1988;
The Multidimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and Finding
, Computers in the Humanities, 26(5-6):331-347, 1992;
Using Register-Diversified Corpora for General Language Studies, in Using Large Corpora
, pp. 179-202 (Susan Armstrong ed.) (1994); and with Edward Finegan,
Drift and the Evolution of English Style: A History of Three Genres
, Language, 65(1):93-124, 1989. Biber's work is descriptive, aimed at differentiating text genres functionally according to the types of linguistic features that each tends to exploit. He begins with a corpus that has been hand-divided into a number of distinct genres, such as “academic prose” and “general fiction.” He then ranks these genres along several textual “dimensions” or factors, typically three or five. Biber individuates his factors by applying factor analysis to a set of linguistic features, most of them syntactic or lexical. These factors include, for example, past-tense verbs, past participial clauses and “wh-” questions. He then assigns to his factors general meanings or functions by abstracting over the discourse functions that linguists have applied assigned to the individual components of each factor; e.g., as an “informative vs. involved” dimension, a “narrative vs. non-narrative” dimension, and so on. Note that these factors are not individuated according to their usefulness in classifying individual texts according to genre. A score that any text receives on a given factor or set of factors may not be greatly informative as its genre because there is considerable overlap between genres with regard to any individual factor.
Jussi Karlgren and Douglass Cutting describe their effort to apply some of Biber's results to automatic categorization of genre in
Recognizing Text Genres with Simple Metric Using Discriminant Analysis
, in
Proceedings of Coling '
94,
Volume II
, pp. 1071-1075, August 1994. They too begin with a corpus of hand-classified texts, the Brown corpus. The people who organized the Brown corpus describe their classifications as generic, but the fit between the texts and the genres a sophisticated reader would recognize is only approximate. Karigren and Cutting use either lexical or distributional features—the lexical features include first-person pronoun count and present-tense verb count, while the distributional features include long-word count and character per word average. They do not use punctuational or character level features. Using discriminant analysis, the authors classify the texts into various numbers of categories. When Karigren and Cutting used a number of functions equal to the number of categories assigned by hand, the fit bet
Kessler Brett L.
Nunberg Geoffrey D.
Pedersen Jan O.
Schuetze Hinrich
Edouard Patrick N.
Xerox Corporation
LandOfFree
Article and method of automatically filtering information... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Article and method of automatically filtering information..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Article and method of automatically filtering information... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3066869