Image analysis – Image transformation or preprocessing – Image storage or retrieval
Reexamination Certificate
1998-05-01
2001-05-22
Couso, Jose L. (Department: 2721)
Image analysis
Image transformation or preprocessing
Image storage or retrieval
C382S155000, C382S229000, C707S793000
Reexamination Certificate
active
06236768
ABSTRACT:
BACKGROUND OF THE INVENTION
The tremendous amounts of information now available even to casual computer users, particularly over large computer networks such as the Internet, have engendered numerous efforts to ease the burden of locating, filtering, and organizing such information. These include classification and prioritization systems for e-mail (see, e.g., Maes,
Commun. of ACM
37(7):30-40 (1994); Cohen, “Learning Rules that Classify E-mail,”
AAAI Spring Symposium on Machine Learning in Information Access
, March 1996), systems for filtering news downloaded from the Internet (see, e.g., Lang, “NewsWeeder: Learning to Filter Netnews,”
Machine Learning: Proc. of
12
th Int'l Conf
. (1995)), and schemes for organizing user-specific information such as notes, files, diaries, and calendars (see, e.g., Jones,
Int'l J. of Man
-
Machine Studies
25 at 191-228 (1986); Lamming et al., “Forget-me-not: Intimate Computing in Support of Human Memory,”
Proc. FRIEND
21, '94
Int'l Symp. on Next Generation Human Interface
(1994)).
Systems designed for information retrieval generally function in response to explicit user-provided queries. They do not, however, assist the user in formulating a query, nor can they assist users unable or unwilling to pose them. The Remembrance Agent (“RA”), described in Rhodes et al.,
Proc. of
1
st Int'l Conf. on Practical Application of Intelligent Agents and Multi
-
Agent Technology
at 487-495 (1996), is a computer progran that watches what a user is typing in a word processor (specifically the Emacs UNIX-based text editor) and continuously displays a list of documents that might be relevant to the document currently being written or read. For example, if a journalist is writing a newspaper article about a presidential campaign, the RA might suggest notes from a recent interview, an earlier article about the campaign, and a piece of e-mail from her editor suggesting revisions to a previous draft of the article.
The utility of the RA stems from the fact that currently available desktop computers are fast and powerful, so that most processing time is spent waiting for the user to hit the next keystroke, read the next page, or load the next packet off the network. The RA utilizes otherwise-wasted CPU cycles to perform continuous searches for information of possible interest to the user based on current context, providing a continuous, associative form of recall. Rather than distracting from the user's primary task, the RA serves to augment or enhance it.
The RA works in two stages. First, the user's collection of text documents is indexed into a database saved in a vector format. These form the reservoir of documents from which later suggestions of relevance are drawn; that is, stored documents will later be “suggested” as being relevant to a document currently being edited or read. The store documents can be any sort of text document (notes, Usenet entries, webpages, e-mail, etc.). This indexing is usually performed automatically every night, and the index files are stored in a database. After the database is created, the other stage of the RA is run from Emacs, periodically taking a sample of text from the working buffer. The RA finds documents “similar” to the current sample according to word similarities; that is, the more times a word in the current sample is duplicated in a candidate database document, the greater will be the assumed relevance of that database document. The RA displays one-line summaries of the best few documents at the bottom of the Emacs window. These summary lines contain a line number, a relevance ranking (from 0.0=not relevan to 1.0=extremely relevant), and header information to identify the document. The list updated at a rate selectable by the user (generally every few seconds), and the system is configured such that the entirety of a suggested document can be brought up by the user pressing the “Control-C” key combination and the line number to display.
Briefly, the concept behind the indexing scheme used in RA is that any given document may be represented by a multidimensional vector, each dimension or entry of which corresponds to a single word and is equal in magnitude to the number of times that word appears in the document. The number of dimensions is equal to the number of allowed or indexed words. The advantages gained by this representation are relatively speedy disk retrieval, and an easily computed quantity indicating similarity between two documents: the dot product of their (normalized) vectors.
The RA creates vectors in three steps:
1. Removal of common words (called stop words), identified in a list of stop words.
2. Stemming of words (changing “jumped” and “jumps” to “jump,” for example This is preferably accomplished using the Porter stemming algorithm, a standard method in the text-retrieval field.
3. Vectorization of the remaining text into a “document vector” (or “docvec”). Conceptually, a docvec is a multidimensional vector each entry of which indicates the number of times each word appears in the document.
For example, suppose a document contains only the words: “These remembrance agents are good agents.”
Step 1: Remove stop words
This converts the text to “Remembrance agents good agents”
Step 2: Stem words
This converts the text to “remembrance agent good agent”
Step 3: Make the document vector
This produces the vector:
000 . . . 121 . . . 000
Each position in the vector corresponds to an allowed word. The zeroes represent all allowed words not actually appearing in the text. The non-zero numerals indicate the number of times the corresponding word appears, e.g., a 1 for the words “good” and “remembr,” and a 2 for the word “agent”; thus, the numbers indicate the document “weight” for the word in question.
Step 4: Normalize the vector
Document vectors are normalized (i.e., divided by the magnitude of the vector). The vector magnitude is given by the square root of the sum of the squared weights. (In fact, the normalization step takes place in the context of other computations, as described more fully below.) Normalization facilitates meaningful comparison between the words in a query and the words in a document in terms of their relative importance; for example, a word mentioned a few times in a short document carries greater significance than the same word mentioned a few more times in a very long document.
In a more recent implementation of the RA, a fifth step is added to improve the is quality of matching beyond that attainable based solely on term frequency. In this fifth step, vectors are weighted by the inverse of the document frequency of the term, based c the assumption that words occurring frequently in a document should carry more weight than words occurring frequently in the entire indexed corpus (which are less distinguishing). More rigorously, the similarity between two word vectors is found by multiplying the document term weight (DTW) for each term by the query term weight (QTW) for th term, and summing these products:
relevance
=
ΣDTW
·
QTW
⁢
⁢
where
DTW
=
tf
·
log
⁢
N
n
Σ
vector
⁡
(
tf
i
·
log
⁢
N
n
i
)
2
⁢
and
QTW
=
(
0.5
+
0.5
⁢
tf
max
tf
)
·
log
⁢
N
n
Σ
query
⁡
(
(
0.5
+
0.5
⁢
tf
i
max
tf
)
·
log
⁢
N
n
i
)
2
The document term weight is computed on a document-by-document basis for each indexed word in the document vector. Because it does not change until new documents are added to the corpus, these computations may take place only when the corpus is indexed and re-indexed. The summation in the denominator covers all words in the document vector (i.e., all indexed words) that also appear in the current document for which DTW is computed (since a summation term is zero otherwise); this facilitates normalization. The term frequency tf refers to the number of times a particular term appears in the current document; N is the total number of documents in the corpus; and n is the number of documents in which the term appears. The sum
Maes Pattie E.
Pentland Alex P.
Rhodes Bradley J.
Starner Thad E.
Couso Jose L.
Desire Gregory
Massachusetts Institute of Technology
Testa Hurwitz & Thibeault LLP
LandOfFree
Method and apparatus for automated, context-dependent... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for automated, context-dependent..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for automated, context-dependent... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2504517