Method and apparatus for identifying spoof documents

Electrical computers and digital processing systems: multicomput – Computer network managing – Computer network monitoring

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S217000

Reexamination Certificate

active

06442606

ABSTRACT:

FIELD OF THE INVENTION
The present invention generally relates to data processing. The invention relates more specifically to identifying spoof documents among a large collection of electronic documents that are associated with, for example, an indexing system or search-and-retrieval system.
BACKGROUND OF THE INVENTION
The Internet, often simply called “the Net,” is a worldwide system of computer networks and, in a larger sense, the people using it. The Internet is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the Web”. The Web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is used to specify the contents and format of a hypermedia document (e.g., a Web page).
In this context, an HTML file is a file that contains the source code for a particular Web page. A Web page is the image that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or Web document may refer to either the source code for a particular Web page or the Web page itself.
Each page can contain imbedded references to images, audio, or other Web documents. A user, using a Web browser, browses for information by following references, known as hyperlinks, that are embedded in each of the documents. The HyperText Transfer Protocol (“HTTP”) is the protocol used to access a Web document.
Through the use of the Web, individuals have access to millions of pages of information. However a significant drawback with using the Web is that because there is so little organization to the Web, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them.
To address this problem, a mechanism known as a “search engine” has been developed to index a large number of Web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. Indexes are conceptually similar to the normal indexes that are typically found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. Values in one or more columns of a table are stored in an index, which is maintained separately from the actual database table. An “index word set” of a document is the set of words that are mapped to the document in an index. For documents that are not indexed, the index word set is empty.
Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one “spider” that “crawls” across the Internet to locate Web documents around the world. Upon locating a document, the spider stores the document's Uniform Resource Locator (URL), and follows any hyperlinks associated with the document to locate other Web documents. Second, each search engine contains an indexing mechanism that indexes certain information about the documents that were located by the spider. In general, index information is generated based on the contents of the HTML file. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users to search the databases in order to locate specific documents that contain information that is of interest to them. To provide up-to-date information, the spiders continually crawl across the Internet to identify both new and updated documents for indexing. When a new or updated page is identified, the search engine makes corresponding updates to the database so as to continually provide up-to-date information.
Electronic documents include both visible text portions and non-visible text portions. The visible text portions of an electronic document includes the textual information that is contained in the document and which is displayed to a user when the electronic document is rendered using an application such as a Web browser. The non-visible text portions include the textual information that is contained in the electronic document but which is not displayed, and therefore is not visible to a user when the document is rendered using an application such as a Web browser. For example,
FIG. 1A
illustrates an HTML file
100
that contains both visible text portions and non-visible text portions. The visible text portions include text data
108
which is displayed when HTML file
100
is rendered by a browser application such as Netscape Navigator® or MicroSoft Internet Explorer®. Alternatively, the non-visible text portions include title data
104
and comment data
106
. Also depicted in FIG.
1
A. are HTML tags
102
which represent codes that are used by browser applications to determine what information is to be made visible and how the visible information is to be structured and formatted when displayed. Title data
104
and comment data
106
, also referred to as metadata, include textual information, referred to herein as “metawords”, that may be included in an HTML file but which is not displayed when the document is rendered by a browser application. For example,
FIG. 1B
illustrates Web page
110
as seen by rendering HTML file
110
through the use of a browser application. As depicted, upon rendering HTML file
100
, the visible text portion (text data
108
) is displayed in Web page
110
and is therefore visible to the user. Alternatively, the non-visible text portions (title data
104
and comment data
106
), are not displayed in Web page
110
and therefore are not visible to the user.
Different search engines use different techniques to extract and index information contained on the Internet. For example, some search engines use indexing mechanisms that index every single word in each document, while others index only the first “N” number of words in each document.
Because certain non-visual portions of documents typically provide an accurate description of the visual contents of the document, many search engines index not only the visual text portion but also sections of the non-visible text portions. For example, the metadata associated with the tag <title> typically include title information that concisely and accurately describes the subject matter or contents of the particular document. Similarly, the metadata associated with the tag <comment> may include comment information that relates to the subject matter or contents of the particular document. An illustrated example is provided by title data
104
and comment data
106
of FIG.
1
A. Thus, by indexing a document based on the metadata that is associated with certain tags contained therein, the documents can be indexed in a way that accurately reflects its contents.
Because the results of a query search are highly dependent on the indexes that are used to process the query, it is critical that the indexes used in a search be accurate as possible. Therefore, it is important that the indexing mechanisms index each document based on those words or terms that most accurately describe the contents of the document. However, for certain Web marketers and site designers, there is a desire or motivation to have as many “hits” on their Web pages as possible. Thus, to increase the number of hits on a particular Web page, certain Web page developers have employed a technique known as ““spamdexing” to cause numerous non-representative index entries to be generated for their Web pages.
In this context, the term spamdexing is defined as adding additional words or terms to a document in order to affect how the document is indexed or otherwise treated. Spamdexing may be performed by adding unrelated visible text to a document, and/or by adding non-visible metadata. Documents in which spamdexing has been applied are genera

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for identifying spoof documents does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for identifying spoof documents, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for identifying spoof documents will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2927928

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.