Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-11-20
2004-06-01
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C717S152000, C707S793000
Reexamination Certificate
active
06745344
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to debugging software programs and, re specifically, to techniques for debugging database systems.
BACKGROUND OF THE INVENTION
In a database system, an area of system memory is allocated and one or more rocesses are started to execute one or more transactions. The database server communicates ith connected user processes and performs tasks on behalf of the user. These tasks typically nclude the execution of transactions. The combination of the allocated system memory and the processes executing transactions is commonly termed a database “server” or “instance”.
Like most software systems, a database server has complicated shared memory structures. A shared memory structure contains data and control information for a portion of a database system. Because of software, hardware, or firmware bugs that may exist in a complex database system, shared memory structures may become logically incorrect. When structures become logically incorrect, the database is likely to fail. Database failure is typically discovered in the following ways: by checking consistency of structures; by verifying certain assumptions; or by running into corrupted pointers. Attempting to process corrupted pointers will lead to a “crash,” where normal database operation is no longer possible.
A major responsibility of the database administrator is to be prepared for the possibility of hardware, software, network, process, or system failure. When shared structures are presumed to be corrupted, the best course of action for a database administrator is to cease further processing of the database. If a failure occurs such that the operation of a database system is affected, the administrator must usually recover the database and return the database to normal operations as quickly as possible. Recovery should protect the database and associated users from unnecessary problems and avoid or reduce the possibility of having to duplicate work manually.
Recovery processes vary depending on the type of failure that occurred, the structures affected, and the type of recovery that is performed. If no files are lost or damaged, recovery may amount to no more than rebooting the database system. On the other hand, if data has been lost, recovery requires additional steps in order to put the database back into normal working order.
Once the database is recovered or rebooted, the immediate problem is quickly resolved, but because the root cause is still undetermined and therefore unresolved, the error condition may resurface, potentially causing several additional outages. Therefore, it is still important to diagnose the state of the structures and data surrounding the database failure. Such a diagnosis may provide valuable information that can reduce the chance of failure in the future. As a practical matter, diagnosing the failure may lead to determining which vendor's hardware or software is responsible for the database failure. Such information is valuable for a vendor's peace of mind, if nothing else. Thus, competing with the goal of recovering the database as quickly as possible, is the goal of determining why the database system failed in the first place.
Unfortunately, even with traditional techniques of diagnosing a database failure, the system administrator is usually unable to obtain a sufficient amount of clues to determine why the failure happened. A deliberate and thorough diagnosis of the failure may require an unacceptable amount of database downtime. For example, any amount of downtime over 30 minutes may be extremely costly for a database that is associated with a highly active web site. Too much downtime may have unduly expensive business ramifications, such as lost revenue and damage to the reputation of the web site owner.
Another problem with traditional debugging techniques is that they can be intrusive. For example, a database system that supports the Structured Query Language (SQL) may be debugged by compiling SQL statements and running against the database. The act of compiling and executing the SQL statements changes the state of the database system. Thus, the mere act of diagnosing the problem can easily make the problem worse because diagnosis may involve altering the state of the database. Diagnosing the problem typically involves using debugging software, which calls for exploration into data structures within the complex memory structures of the database systems. Although the data structures are best left untouched upon a failure, diagnosing the failure may involve working directly on the same data structures from which data is to be obtained. Nevertheless, it is important to preserve the original data and not change the data from its state at time of failure. A customer of the database may take issue to changing the database as such changing may jeopardize or even destabilize their database system.
Effective diagnosis, however, requires getting as much information as possible out of the data structures. It may be useful here to refer to Heisenberg's uncertainty principle, which effectively states that the closer an object is analyzed, the more the object materially changes because the mere act of analyzing is intrusive. Applying this principle to the act of diagnosing a database failure, a typical debugging process is naturally intrusive. Thus, it is difficult to be non-intrusive on a database and at the same time obtain a sufficient amount of meaningful data for debugging.
Traditional debugging techniques involve formatting certain parts of the database system and displaying this formatted portion in a human-readable form. This humanreadable form can be set aside for later analysis, for example, after the database has been recovered or is no longer down. The entire memory of the database server is not dumped because an average database server is very large, typically between about 200 megabytes and about 100 gigabytes of unformatted binary and data.
Unfortunately, such a debugging technique provides diagnosis only to the database server's end-memory state, which is the state after the database has been shut down. Because the end-memory state is being analyzed separately from the database, the programmer performing the debugging does not have access to the real database and some of the database's persistent structures. Some of these persistent structures could be on disk or, in a multiple node system, on other nodes. For example, in a parallel server configuration, the persistent structures needed for debugging could reside on other servers. Thus, the technique of separately debugging portions of the database prevents the programmer from analyzing data that resides in the persistent structures of the database.
For the foregoing reasons, what is needed is a method of debugging a software program, such as a database system, that is non-intrusive, yet allows for a comprehensive assessment of a failure.
SUMMARY OF THE INVENTION
A method and apparatus for debugging a software program is provided. In one embodiment, the method comprises preserving a first memory state of a software program, such as a database system at the time just before failure occurs. A second memory state of the software program is also preserved after failure occurs. The failure analysis involves comparing the first memory state with the second memory state.
REFERENCES:
patent: 5303379 (1994-04-01), Khoyi et al.
patent: 6085029 (2000-07-01), Kolawa et al.
patent: 6163858 (2000-12-01), Bodamer
patent: 6167535 (2000-12-01), Foote et al.
patent: 6226787 (2001-05-01), Serra et al.
patent: 6378124 (2002-04-01), Bates et al.
patent: 6412106 (2002-06-01), Leask et al.
patent: 6442748 (2002-08-01), Bowman-Amuah
patent: 6542844 (2003-04-01), Hanna
“Solaris 2.6 Reference Manual AnswerBook>> man Pages(3): Library Routines”, 5 pages, 2001, http://docs.sun.com:80/ab2/@LegacyPageView?toc=SUNWab_40_4%3A%2Fsafedir%2Fsp.
Joshi Vikram
Tsukerman Alex
Yamaguchi Shari
Becker Edward A.
Hickman Palermo & Truong & Becker LLP
Iqbal Nadeem
Oracle International Corporation
LandOfFree
Debug and data collection mechanism utilizing a difference... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Debug and data collection mechanism utilizing a difference..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Debug and data collection mechanism utilizing a difference... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3365927