Document categorization by word length distribution analysis

Data processing: database and file management or data structures – Database design – Data structure types

Patent

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

707104, G06F 900

Patent

active

059096802

ABSTRACT:
A system and method for efficient document categorization are disclosed. In one embodiment, word length distribution information is used as a basis for categorization. Greater than 90% accuracy in classification may be achieved in, e.g., distinguishing newspaper articles from scientific journal articles. Word length distribution information may be developed without optical character recognition (OCR), permitting use of degraded document images.

REFERENCES:
patent: 4839853 (1989-06-01), Deerwester et al.
patent: 5159667 (1992-10-01), Borret et al.
patent: 5369577 (1994-11-01), Kadashevich et al.
patent: 5418951 (1995-05-01), Damashek
patent: 5465353 (1995-11-01), Hull et al.
patent: 5537586 (1996-07-01), Amram et al.
patent: 5542089 (1996-07-01), Lindsay et al.
patent: 5642522 (1997-06-01), Zaenen et al.
patent: 5689585 (1997-11-01), Bloomberg et al.
Inspec Abstract No. C79024302 Cluster Analysis of English Text, Toussaint & Shinghal, May 31, 1978.
M. Damashek, "Gauging similarity with n-grams: Language-independent categorization of text," Science 267,10 (Feb. 10, 1995), pp. 843-848.
R. Shinghal and G.T. Toussaint, "Cluster analysis of English text," Proceedings of the IEEE Computer Society Conference on Pattern Recognition and Image Processing, Chicago, Illinois, May 31-Jun. 2, 1978, pp. 164-172.
Chen and Haralick, "Extraction of Text Layout Structures on Document Images Based on Statistical Characterization," Proceedings of the SPIE Document Recognition II Conference (Feb. 1995).
J.J. Hull and Y. Li, "Word Recognition Result Interpretation Using the Vector Space Model for Information Retrieval," Proceedings of the IEEE Computer Society Conference on Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, Apr. 26-28, 1993, pp. 147-155.
J.M. Tenkle and R.C. Vogt, "Word Recognition for Information Retrieval in the Image Domain," Proceedings of the IEEE Computer Society Conference on Second Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, Apr. 26-28, 1993, pp. 105-122.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Document categorization by word length distribution analysis does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Document categorization by word length distribution analysis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Document categorization by word length distribution analysis will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-962539

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.