Data processing: database and file management or data structures – Database design – Data structure types
Patent
1996-09-09
1999-06-01
Black, Thomas G.
Data processing: database and file management or data structures
Database design
Data structure types
707104, G06F 900
Patent
active
059096802
ABSTRACT:
A system and method for efficient document categorization are disclosed. In one embodiment, word length distribution information is used as a basis for categorization. Greater than 90% accuracy in classification may be achieved in, e.g., distinguishing newspaper articles from scientific journal articles. Word length distribution information may be developed without optical character recognition (OCR), permitting use of degraded document images.
REFERENCES:
patent: 4839853 (1989-06-01), Deerwester et al.
patent: 5159667 (1992-10-01), Borret et al.
patent: 5369577 (1994-11-01), Kadashevich et al.
patent: 5418951 (1995-05-01), Damashek
patent: 5465353 (1995-11-01), Hull et al.
patent: 5537586 (1996-07-01), Amram et al.
patent: 5542089 (1996-07-01), Lindsay et al.
patent: 5642522 (1997-06-01), Zaenen et al.
patent: 5689585 (1997-11-01), Bloomberg et al.
Inspec Abstract No. C79024302 Cluster Analysis of English Text, Toussaint & Shinghal, May 31, 1978.
M. Damashek, "Gauging similarity with n-grams: Language-independent categorization of text," Science 267,10 (Feb. 10, 1995), pp. 843-848.
R. Shinghal and G.T. Toussaint, "Cluster analysis of English text," Proceedings of the IEEE Computer Society Conference on Pattern Recognition and Image Processing, Chicago, Illinois, May 31-Jun. 2, 1978, pp. 164-172.
Chen and Haralick, "Extraction of Text Layout Structures on Document Images Based on Statistical Characterization," Proceedings of the SPIE Document Recognition II Conference (Feb. 1995).
J.J. Hull and Y. Li, "Word Recognition Result Interpretation Using the Vector Space Model for Information Retrieval," Proceedings of the IEEE Computer Society Conference on Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, Apr. 26-28, 1993, pp. 147-155.
J.M. Tenkle and R.C. Vogt, "Word Recognition for Information Retrieval in the Image Domain," Proceedings of the IEEE Computer Society Conference on Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, Apr. 26-28, 1993, pp. 105-122.
Black Thomas G.
Mills, III John G.
Ricoh Company Limited
Ricoh Corporation
LandOfFree
Document categorization by word length distribution analysis does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Document categorization by word length distribution analysis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document categorization by word length distribution analysis will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-962539