Associating files of data

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C709S217000

Reexamination Certificate

active

06308176

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to associating large data files to information categories.
INTRODUCTION TO THE INVENTION
Traditionally, database technology has been dedicated to the organisation of numerical and tabular data and it is only recently, particularly with the expansion of the Internet, that demand has grown significantly for the retrieval of text-based files. Several facilities are available on the Internet, commonly referred to as “search engines” which assist in the location of information. The majority of these operate by performing what has become known as “free text” searching, in which a user specifies words which they believe are contained within the target file as a mechanism for instructing the system to retrieve files of interest.
Problems with this technique are well known to users of the available search engines, and a simple enquiry can generate hundreds of thousands of “hits”, the majority of which will tend to be totally irrelevant to the user's needs. Furthermore, other relevant files may be missed simply because they do not contain the specific chosen words.
As is well documented, a problem with the Internet is that the freedom of the Internet is also its downfall. Information is not classified before it is made available, therefore it is highly likely that even the simplest search will fail to identify relevant documentation and will take a considerable period of time to perform.
Procedures for classifying volumes of data so as to facilitate subsequent searching are known but these classification processes often involve manual intervention, thereby making them time consuming and prone to human error. Furthermore, except in circumstances where the documentation is considered to be extremely valuable and will continue to be required over a significant period of time, the cost of performing this manual exercise cannot be justified in terms of the commercial worth of the data sources being considered. Consequently, the problem results in much data being effectively inaccessible and outside the realm of searchable knowledge.
Procedures are known for processing a data file so as to determine whether the data file should be associated with a particular information category. The known processes require a machine readable association file (or outline file) and using this, it is possible for the incoming data file to be processed to produce a numerical score value defining the extent to which the data file is relevant to the associated category. Thereafter, decisions may be made as to whether the data file is to be associated with particular categories by performing respective threshold comparisons.
In practical systems, thousands of such outline files would be required in order to provide a useful level of categorisation. In the present applicant's co-pending British patent application number 98 08 808.1, in the present applicant's co-pending European patent application (DGC-P11-EP) and in the present Assignee's co-pending United States patent application (DGC-P11-US) a method of generating machine readable association files is described. A plurality of data files are manually selected as being examples of files which should be associated with a particular category. In addition, a plurality of files are selected manually which are considered not to be associated with a particular category. Having identified these files, the process identifies preferred term candidates from the associated files, weights these candidates with reference to files not associated with the category and applies terms to a machine readable association file by analysing the weighting values.
The resulting association files are particularly well suited to associating new data files which are of substantially similar size to the original source data files. Similarly, association files generated by more traditional techniques still tend to be well suited to input data files of a particular size but less well suited to incoming data files of differing sizes. Thus, if a new incoming data file is larger than the optimum file size, it is possible that many irrelevant files will be inappropriately categorised given that the processing of these files will result in an inappropriately high weighting value being calculated.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention, there is provided apparatus configured to associate files of data of a size greater than a predetermined size, comprising dividing means configured to divide a file into a plurality of file sections having a size substantially consistent with a preferred size; categorising means configured to categorise each of said file sections to produce sets of section associations; and processing means configured to process said sets of section associations to produce a set of category associations for the original undivided file.
In a preferred embodiment, the categorising means is configured to categorise each file section by processing a section in combination with association files. Preferably, the apparatus includes storage means for storing the association files as outline files and each of the stored outline files may relate to a respective category.
According to a second aspect of the present invention, there is provided a method of associating files of data of a size greater than a predetermined size, comprising steps of dividing a file into a plurality of file sections each having a size substantially consistent with a preferred size; categorising each of said file sections to produce sets of section associations; and processing said sets of section associations to produce a set of category associations for the original undivided file.
Preferably, the preferred size is smaller than the predetermined size.
In a preferred embodiment, tables are removed from a data file before a file is divided into sections. Preferably, an assessment is made as to whether it is desirable to increase size sections, whereafter the size of said sections are increased and the dividing process is repeated. Preferably, data files are continually received from data sources.


REFERENCES:
patent: 5604849 (1997-02-01), Artwick et al.
patent: 5717914 (1998-02-01), Husick et al.
patent: 5748954 (1998-05-01), Mauldin
Hearst et al, “Subtopic Structuring For Full-Length Document Access” (Jun. 27, 1993). Proceedings of SIGIR '93. Pittsburgh. PA. p. 59-68.*
IBM Technical Disclosure Bulletin; vol. 34, No. 4B; Sep. 1991; Intelligent Library Filter for Office.
Subtopic structureing for Full-Length Document Access; Marti Hearst et al.; Computer Science Division; UC Berkeley; Berkeley, California.
TCS: A Shell for Content-Based Text Categorization; Philip J. Hayes et al.; Carnegie Group; Pittsburge, Pennsylvania.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Associating files of data does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Associating files of data, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Associating files of data will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2570889

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.