Method and apparatus for parallel sorting using parallel...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S241000, C709S241000

Reexamination Certificate

active

06427148

ABSTRACT:

APPENDIX
This application includes Appendix A, which forms a part of this specification and which is herein incorporated by reference.
FIELD OF THE INVENTION
This invention relates generally to data processing and, specifically, to a method and apparatus that partitions data in conjunction with, for example, a parallel sorting method.
BACKGROUND OF THE INVENTION
As data processing has advanced in recent years, the amount of data stored and processed by computer and other data processing systems has grown enormously. Current applications, such as data mining systems and systems that perform data operations on very large databases, often need to process huge amounts of data (called a “data set”). Such large data sets can often be larger than the memory of the computer or computers that process them. For example, current data sets are often in the range of several terabytes (2
40
) or more, and it is anticipated that data sets will be even larger in the future. Current data processing systems require parallel external sorting techniques.
Various conventional methods have been devised to sort very large amounts of data, including data that is larger than the memory of the system doing the sorting. The standard text of Knuth, “The Art of Computer Programming, Vol. 3, Sorting and Searching,” Addison Wesley Longman Publishing, second edition, 1998, pp 252-380 discloses several conventional external sorting methods. In order to perform a parallel sort, it is necessary to determine a set of sort key values that will be used to divide the sorted data between the multiple processes or cpus involved in the sort. This problem is called “partitioning” or “selection.” Several conventional parallel sorts use a sampling method to determine the keys for the multiple processes.
As data sets grow ever larger, however, conventional sorting methods are often not fast enough and are not always efficient for all distributions of data. In addition, certain conventional methods do not work when the data to be sorted contains variable length records. What is needed is a new method of parallel sorting that is faster and more efficient that conventional parallel sorting methods and that operates correctly on a wide range of data distributions, as well as variable length records.
SUMMARY OF THE INVENTION
An embodiment of the present invention provides a method and apparatus for sorting very large data sets using a parallel merge sort. A described embodiment of the invention operates in a clustered computer system, although it is contemplated that the invention can be implemented for any appropriate distributed (or shared memory) computer system, such as a computer network or the internet. The method of the present invention can also be used to locate database quantiles or to partition other types of keys in near-minimum time (as discussed in further detail below). The method of the present invention can also be used to perform a distribution sort, as described in Appendix A, which is a part of this specification and is herein incorporated by reference.
Given sorted work files S
1
, . . . , S
P
, produced by P processes, the described embodiment of the method effectively implements a parallel merge onto respective output partitions O
1
, . . . , O
p
of the P processes. Because each of these output partitions O
j
has a finite size (1<=j<=p), the invention must quickly determine “splitting keys” for each output partition O
j
in such a way that the data in the work files will be split between the multiple output partitions O
j
without overrunning the size of any of the partitions O
j
. Once the splitting keys for each partition are determined, the processes exchange data so that the output partitions of each process contains data between the splitting keys associated with that output partition.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method of parallel sorting a large amount of data, performed by a plurality of processes of the data processing system and comprising: providing, for each process, a work file, each work file containing a respective portion of the data to be sorted, where the data within each work file is in sorted order; determining an initial upper and lower bound associated with each process; sending, by each of the processes in parallel, a plurality of messages to each of the other processes indicating current upper bounds of the sending process to determine an upper bound for the sending process; and performing, by the processes, a merge in which each of the processes creates an output partition containing data within its upper and lower bounds.
Advantages of the invention will be set forth, in part, in the description that follows and, in part, will be understood by those skilled in the art from the description herein. The advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims and equivalents.


REFERENCES:
patent: 5210870 (1993-05-01), Baum et al.
patent: 5287494 (1994-02-01), Garcia et al.
patent: 5355478 (1994-10-01), Brady et al.
patent: 5487166 (1996-01-01), Cossock
patent: 5852826 (1998-12-01), Graunke et al.
patent: 6003036 (1999-12-01), Martin
Guy E. Blelloch, C. Greg Plaxton, Charles E. Leiserson, Stephen J. Smith, Bruce M. Maggs and Marco Zagha, “A Comparison of Sorting Algorithms for the Connection Machine CM-2,” © 1991 ACM, pp. 3-16.
Web article by Sandia National Laboratories entitled: “‘Fastest sorting machine known . . . ’ Sandia and Compaq together set world record in large database sorting”, published by Sandia National Laboratories-News Releases, Nov. 10, 1998, http://www.sandia.gov/media/sort.htm, pp. 1-4.
Web article by Compaq entitled: “Compaq and Sandia Labs Shatter Performance Record with 72-Node Industry-Standard Cluster”, published by Compaq-Newsroom, Nov. 10, 1998, http://www.compaq.com
ewsroom/pr/pr101198b.html, pp. 1-3.
Tom Leighton, “Tight Bounds on the Complexity of Parallel Sorting,” Apr. 1985, IEEE Transactions on Computers, vol. c-24, No. 4, pp. 344-354.
Bülent Abali, Füsun Özgüner and Abdulla Bataineh, “Balance Parallel Sort on Hypercube Multiprocessors,” May 1993, IEEE Transactions on Parallel and Distributed Systems, vol. 4, No. 5, pp. 572-581.
Richard John Cole, “An Optimally Efficient Selection Algorithm.” Jan. 25, 1988, Information Processing Letters 26 (1987/88), pp. 295-299.
Alok Aggarwal and C. Greg Plaxton, “Optimal Parallel Sorting in Multi-Level Storage,” Department of Computer Sciences, The University of Texas at Austin, Mathematical Sciences Library, Stanford University, Stanford, CA 94305, pp. 1-23.
C. Greg Plaxton, “On The Network Complexity of Selection,” Aug. 1989, Report No. STAN-CS-89-1276, Mathematical Sciences Library, Stanford University, Stanford, CA 94305, Department of Computer Science, pp. 1-17.
Uzi Vishkin, “An Optimal Parallel Algorithm for Selection,” Dec. 1983, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, Ultra Computer Note 64, Technical Report 106, 10-page article.
P. Berthomé, A. Ferreira, B. M. Maggs, S. Perennes and C.G. Plaxton, “Sorting-Based Selection Algorithms for Hypercubic Networks,” 7thInternationa Parallel Processing Symposium, Apr. 13-16, 1993, Newport Beach, California, pp. 89-95.
Micah Akler, John w. Byers and Richard M. Karp, “Parallel Sorting With Limited Bandwidth,” International Computer Science Institute and Computer Science Division, UC Berkeley, SPAA '95, 7thAnnual ACM Symposium on Parallel Algorithms and Architectures, Jul. 17-19, 1995, pp. 129-136.
Michael Ian Shamos, “Geometry and Statistics: Problems at the Interface,” Departments of Computer Science and Mathematics, Carnegie-Mellon University, Algorithms and Complexity New Directions and Recent Results, 1976, pp. 251-280.
Edited By John H. Reif, “Synthesis of Parallel Algorithms, Randon Sampling Lemmas for Sorting and Selection,” pp. 415-417.
Balakrishna R. Iyer, Gary R. Ricard and Peter J. Varman, “P

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for parallel sorting using parallel... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for parallel sorting using parallel..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for parallel sorting using parallel... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2870217

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.