Data processing: database and file management or data structures – Database design – Data structure types
Patent
1997-06-27
2000-04-25
Ho, Ruay Lian
Data processing: database and file management or data structures
Database design
Data structure types
707100, G06F 1730
Patent
active
060555394
ABSTRACT:
A method and system for generating a decision-tree classifier from a training set of records, independent of the system memory size. The method includes the steps of: generating an attribute list for each attribute of the records, sorting the attribute lists for numeric attributes, and generating a decision tree by repeatedly partitioning the records using the attribute lists. For each node, split points are evaluated to determine the best split test for partitioning the records at the node. Preferably, a gini index and class histograms are used in determining the best splits. The gini index indicates how well a split point separates the records while the class histograms reflect the class distribution of the records at the node. Also, a hash table is built as the attribute list of the split attribute is divided among the child nodes, which is then used for splitting the remaining attribute lists of the node. The method reduces I/O read time by combining the read for partitioning the records at a node with the read required for determining the best split test for the child nodes. Further, it requires writes of the records only at one out of n levels of the decision tree where n.gtoreq.2. Finally, a novel data layout on disk minimizes disk seek time. The I/O optimizations work in a general environment for hierarchical data partitioning. They also work in a multi-processor environment. After the generation of the decision tree, any prior art pruning methods may be used for pruning the tree.
REFERENCES:
patent: 5787274 (1998-07-01), Agrawal et al.
patent: 5799300 (1998-08-01), Agrawal et al.
patent: 5799311 (1998-08-01), Agrawal et al.
John Shafer et al., "SPRINT: A Scalable Parallel Classifier for Data Mining", Proceedings of the 22nd VLDB Conference, Mumbai (Bombay), India, (1996). pp. 544-555.
R. Agrawal et al., "An Interval Classifier for Database Mining Applications", Proceedings of the 18th VLDB Conference, Vancouver, British Columbia, (1992), pp. 560-573 and cover page.
R. Agrawal et al., "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, vol. 5, No. 6, Dec. 1993, pp. 914-925.
L. Breiman et al., "Classification and Regression Trees", Wadsworth & Brooks/Cole Advanced Books & Software, Monterey, California, 1984 pp. 18-59.
M. Mehta et al., "MDL-based Decision Tree Pruning", Proceedings First International Conference on Knowledge Discovery & Data Mining, D (KDD-95), Aug. 1995, pp. 216-221.
J. R. Quinlan et al., "Inferring Decision Trees Using the Minimum Description Length Principle", Information and Computation, vol. 80, (1989), pp. 227-248.
M. Mehta et al., "SLIQ: Fast scalable Classifier for Data Mining", EDBT 96, Avignon, France, Mar. 1996, pp. 18-33.
D. J. DeWitt et al., "Parallel Sorting on Shared-Nothing Architecture using Probabilistic Splitting", IEEE International First Conference on Parallel and Distributed Informaton Systems, Dec. 4-6, 1991, Miami Beach, Florida, pp. 280-291 & cover sheet.
Mike James, "Classification Algorithms", (book), Wiley-Interscience Publicaton, (1985), Chapters 1-3.
J. Catlett, "Megainduction: Machine Learning on Very Large Databases", PHD Thesis, University of Sidney, issue Jun./Dec. 1991.
Singh Vineet
Srivastava Anurag
Ho Ruay Lian
International Business Machines - Corporation
Jordan Kevin M.
LandOfFree
Method to reduce I/O for hierarchical data partitioning methods does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method to reduce I/O for hierarchical data partitioning methods, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method to reduce I/O for hierarchical data partitioning methods will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1002118