System for collecting specific information from several...

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C707S793000, C707S793000

Reexamination Certificate

active

06694307

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to the field of data extraction, more specifically to a system for collecting specific information from several sources of unstructured data. In a practical application, the invention may be used to extract specific information, such as business-related information, from the multiple pages of the World Wide Web (WWW).
BACKGROUND OF THE INVENTION
With over one and a half billion pages, the WWW is one of the largest sources of information on the planet. Whether searching for corporate, educational, historical, social, current affairs, geographical or general-knowledge information, among many other types, the WWW offers the richest, most up-to-date bank of information in existence.
Unfortunately, the WWW boasts an extremely vast and unstructured content, through which navigation may be difficult and even unsuccessful. In order to find and extract a few specific and relevant pieces of information, a Web user may have to personally search through many Web pages and immense quantities of disorganised information. This exhaustive searching of the WWW consumes an excessive amount of time and is oftentimes very frustrating for the Web user.
Present day technology provides to the Web user the capability to search the WWW for specific information, using a search engine to identify its probably location. However, once potential Web pages are found, the pages have to be thoroughly visited by the Web user in order to find and extract the relevant information, with no guarantee that the required information is even present in the potential Web pages. Further, where a structured compilation of the specific information is required, the Web user must personally create this compilation by identifying, extracting and formatting the relevant information from the WWW.
One system that is currently used for collecting specific information from the WWW involves the use of dedicated databases containing specific information, where the information contained in each dedicated database is associated with pages of the WWW, in a simplified example through cross-referencing. These dedicated databases are created and maintained by a human operator, for use by the system, and require constant maintenance and updating. Once a search of the WWW has identified possible relevant Web pages, the system accesses the appropriate database, determines the information contained therein that corresponds to the relevant Web pages and generates therefrom a structured compilation of the requested information. In a particular example, assume that the specific information being searched for is contact information for a particular company, a search of the WWW having identified several potentially relevant Web pages. In this case, the system accesses a dedicated database containing commercial information, including contact information, on various corporate entities and extracts therefrom the required contact information, on the basis of the Web pages revealed by the search.
Unfortunately, this system has many disadvantages. In particular, the specific information provided to the Web user in the structured compilation is only as up-to-date as the last time the dedicated database from which the specific information was taken was updated, and may lack information newly available on the WWW. Another, and greater, disadvantage is the need for human resources to create and continuously update the dedicated databases, as well as the potential for incorrect information stored in the dedicated databases due to human error. Finally, while certain specific information may be unpublished (unavailable) on the WWW but available elsewhere, such as in a private Intranet or in a set of data files on a workstation, the system is specifically designed to work only with the pages of the WWW.
The background information provided above clearly indicates that there exists a need in the industry to provide a novel system for extracting and structurally compiling specific information from unstructured digitized data, such as the Web pages of the WWW.
SUMMARY OF THE INVENTION
Under a broad aspect, the invention provides a system for collecting specific information from several sources of unstructured digitized data. The system has an input for receiving at least one instruction governing the collection of the specific information. In a specific, non-limiting example of implementation, the system receives an instruction conveying the location(s) where the collection is to take place. The system includes a processing unit that connects to a plurality of sources of unstructured digitized data from which the specific information is to be collected, at least in part on the basis of the instruction(s) received at the input. The processing unit is operative to analyse the contents of each source of unstructured digitized data to identify in each source the information elements relevant to the specific information. The processing unit extracts the identified information elements from each source of unstructured digitized data where information elements relevant to the specific information have been identified, and processes the extracted information elements for generating an output signal containing the specific information. The system further includes an output for releasing the output signal.
The advantages of this system are twofold. First of all, the sources of unstructured digitized data do not have to be personally searched in their entirety by a human operator in order to collect the specific information. Rather, the system analyzes the contents of each source of unstructured digitized data and automatically extracts therefrom the requested specific information. Secondly, the specific information collected by the system is the most up-to-date information available from the particular source(s) of unstructured digitized data where originated the specific information, since the specific information is taken directly from the particular source(s) of unstructured digitized data.
In this specification, the term “source” in the expression “source of unstructured digitized data” refers to a broad category of facilities containing, storing or providing digitized data, including databases, servers, memory modules, text files, digitized documents, among other possibilities. The sources of unstructured digitized data may be of different, even incompatible, data formats.
In this specification, the term “unstructured” in the expression “source of unstructured digitized data” is defined with respect to the information being searched for in the source of digitized data, from the point of view of the searcher. More specifically, the searcher is unaware of any particular layout or structure organizing the information contained in the digitized data. Further, several sources of unstructured digitized data are considered to be “unstructured” since they share no common structure or layout for the information contained therein.
In a specific non-limiting example of implementation, the unstructured digitized data is the data contained in the many pages of the WWW and the specific information is business-related information, in particular sales lead information for prospective clients. Such sales lead information, also referred to herein as contact information, may include the business name, the postal address, the e-mail address, the telephone and fax numbers, the name and title of a contact person, the number of employees, etc. The system is software implemented and resides on a computing device, such as a server or a workstation. For the purposes of this specific example, the system resides on a workstation at which a system user can access and use the system. In particular, the processing unit includes an identification unit having an input for receiving at least one instruction that governs the collection of the contact information. In this specific example, the identification unit receives from the system user an instruction conveying the location of a remote WWW site, in the form of a machine-readable URL (Universal R

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System for collecting specific information from several... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System for collecting specific information from several..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System for collecting specific information from several... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3280139

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.