System and method for determining a character encoding scheme

Data processing: database and file management or data structures – Database design – Data structure types

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06701320

ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to character encoding systems and methods. More particularly, the present invention relates to systems and methods for determining the appropriate or best fit character encoding scheme for a set of data.
BACKGROUND OF THE INVENTION
The use of computer networks, particularly the Internet, to store data and provide information to users is becoming increasingly common. The Internet is a loosely organized network of computers spanning the globe. Client computers, such as home computers, can connect to other clients and servers on the Internet through a regional Internet Service Provider (“ISP”) that further connects to larger regional ISPs or directly to one of the Internet's “backbones.” Regional and national backbones are interconnected through long range data transport connections such as satellite relays and undersea cables. Through these layers of interconnectivity, each computer connected to the Internet can connect to every other (or at least a large percentage) of other computers on the Internet.
The Internet is generally arranged on a client-server architecture. In this network model, client computers request information stored on servers and servers find and return the requested information to the client computer. The server computers can store a variety of data types and provide a number of services. For example, servers can provide telnet, ftp (file transfer protocol), gopher, smtp (simple mail transfer protocol) and world wide web services, to name a few. In some cases, any number of these services can be provided by the same physical server over different ports (i.e., world wide web content over port
80
, email over port
25
, etc.). If a server makes a particular port available, client computers can connect to that port from virtually anywhere on the Internet, leading to global connectivity between computers.
For typical Internet users, the world wide web and email (smtp) have become the predominant services utilized. The world wide web was developed to facilitate the sharing of technical documents, but over the past decade the number of information providers has increased dramatically and now technical, commercial and recreational content is available to a user from around the world. The information provided through world wide web services is typically presented in the form of hypertext documents, known as web pages, that allow the user to “click” on certain words and graphics to retrieve additional web pages.
When a user requests a web page, a program known as a web browser can make a request to the appropriate web server (usually after retrieving the IP address for the web server from a name server), the web server locates the web page and transmits the data corresponding to the web page to the client computer as series of ones and zeros (e.g., 00000010000001010001000000001100 . . . ). The web browser must transform the bytes received into recognizable characters for display to the user.
Character encoding schemes provide a mechanism for mapping the retrieved bytes to recognizable characters. In a character encoding scheme, a “coded character set” is a mapping from a set of characters to a set of non-negative integers, with a character being defined within the coded character set if the coded character set contains a mapping from the character to an integer. The integer is known as a “code point” and the character as an “encoded character.” A large number of character encoding schemes are defined, many of which are defined by individual vendors, but no standardized character encoding scheme has been adopted universally. The lack of standardization is problematic because an integer that maps to the character “a” in one character encoding scheme may map to “I,” a Chinese character, or no character at all in another character encoding scheme. If a web browser receiving web page data uses an incorrect character encoding scheme to display the web page's contents, the contents may appear as unintelligible or meaningless.
In order to properly display a web page, a web browser must determine the appropriate character encoding scheme for that web page. This is typically done by reading a “charset” parameter in the content-type HTTP header of the web page or in a META declaration contained in the web page. Both these mechanisms, however, require that character encoding scheme be defined in the content of the web page itself. For web pages that do not provide this character encoding information, the web browser must attempt to determine the appropriate character encoding scheme through other mechanisms.
Existing web browsers such as Microsoft's® Internet Explorer and Netscapes® Navigator attempt to determine the appropriate character encoding scheme (when the character encoding scheme is not otherwise defined) by defining subsets of character ranges that are unique or special to a given character encoding scheme. For example, the web browser may define
1
-
3
as corresponding to a first character encoding scheme and
6
-
9
as corresponding to a second character encoding scheme. If the integers received by the web browser are
4
,
5
, and
8
, more of these integers fit in the defined range
6
-
9
for the second character encoding scheme. Therefore, the web browser could chose that scheme. The web browser can then display characters based on the second character encoding scheme. This process can be inefficient because the web browser must test a large number of ranges and can be inaccurate as the ranges for various character encoding schemes can overlap. Moreover, many character encoding schemes do not use consecutive integers to encode characters and the character encoding scheme may not use a well-defined range of integers to encode characters, leading to the display of incorrect characters by the web browser.
SUMMARY OF THE INVENTION
The present invention provides a character encoding detection system and method that eliminates or substantially reduces disadvantages and problems associated with previously developed character encoding detection systems and methods. More particularly, one aspect of the present invention can be characterized as a method for determining an appropriate (or best-fit) character encoding scheme including the steps of (i) generating a set of reference characters based on a reference character encoding scheme and a first set of bytes; (ii) generating a set of test characters based on a test character encoding scheme and said first set of bytes; (ii) generating a set of test bytes based on said test character encoding scheme and said set of test characters; (iv) generating a set of comparison characters based on said reference character encoding scheme and said set of test bytes; and (v) comparing said set of reference characters to said set of comparison characters. In one embodiment of the present invention, the aforementioned steps are implemented as a JAVA based software program with Unicode (e.g., USC2) as the reference character encoding scheme.
In another embodiment of the present invention, rather than comparing a set of reference characters to a set of comparison characters, the present invention can compare the original set of bytes with the set of test bytes. This embodiment of the present invention can omit generating the reference characters and the comparison characters. Yet another embodiment of the present invention can generate a set of reference integers corresponding to the original set of bytes (and the reference characters) and a set of test integers corresponding to the set of test bytes (and the test characters) and then compare the set of reference integers with the set of test integers. Again, this embodiment of the present invention can optionally omit generating the set of reference characters and the set of comparison characters.
Regardless of whether the test bytes are compared to the original set of bytes, the test integers are compared to the reference integers or the comparison characters are compared to the reference characte

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for determining a character encoding scheme does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for determining a character encoding scheme, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for determining a character encoding scheme will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3269500

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.