Data processing: speech signal processing – linguistics – language – Linguistics – Multilingual or national language support
Reexamination Certificate
1998-12-07
2002-08-20
Edouard, Patrick N. (Department: 2644)
Data processing: speech signal processing, linguistics, language
Linguistics
Multilingual or national language support
C345S467000
Reexamination Certificate
active
06438516
ABSTRACT:
FIELD OF THE INVENTION
The invention relates, in general, to methods and systems used for the computer processing of text, and more specifically, to the composing and decomposing of text represented according to the Unicode Standard in a computer system.
BACKGROUND OF THE INVENTION
Computer systems required to process text information, may use an international standard for international coding text. The accepted standard for international coded text information is called the Unicode® Standard published by Unicode, Inc. According to the Unicode Standard, “text” refers to alphabetic characters as well as punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, etc. The Unicode Standard, Version 2.0 and subsequent versions and revisions thereto, provides the In capacity to encode all the characters used for the major written languages of the world and is incorporated herein by reference. For example, Unicode scripts include Latin, Greek, Armenian, Hebrew, Arabic, Bengali, Thai, Japanese kana, a unified set of Chinese, Japanese, and Korean ideographs, as well as many other languages. The Unicode Standard provides codes for nearly 39,000 characters from the world's alphabets, symbol collections, and ideograph sets. Left unused for future expansion are 18,000 codes, while over 6,000 codes are reserved for private use. The private use codes are intended to be system or application specific and can be defined by those developing their own system or application.
The Unicode Standard is based on a 16 bit code set that provides codes for more than 65,000 characters, whereby each character is identified by a unique 16 bit value. In fact, there are 65,536, i.e. 2 to the eighth power, possible values inherent in a 16 bit word. The code values of the Unicode Standard are equivalent to the code values of the “Universal Character Set” in two-octet form (UCS-2), which is a subset of ISO/IEC 10646. ISO 10646's full code set is called Universal Character Set in four octet form (UCS-4). Unicode does not use complex modes or escape codes for constructing or representing characters and thus is a simplified and straightforward approach to representing characters.
The Unicode Standard is based on three underlying premises. The first premise is that the standard must define the smallest useful elements of text being coded. The second premise is that a unique character code must be assigned to each element. Finally, the third premise is that basic rules for encoding and interpreting text must be provided so that programs can successfully read and process the coded text. When defining elements of text for a given language, it must be determined what the smallest textual elements of the language are which are used to create words and sentences. For example, the smallest textual elements would be single graphical elements in many languages. But in other languages, the smallest textual elements may be multiple graphical elements, such as in Devanagari.
Regardless of the language, the smallest textual elements are represented in Unicode as “code elements”. Code elements serve as the building blocks for Unicode “characters”, wherein a Unicode “character” may be an element itself, e.g. “u”, a combination of text elements, e.g., “u”, or, to a much lesser extent, a symbol, e.g. “
*
”. For the most part, code elements correspond to the most commonly used text elements. For example, each upper case and lower case letter in the English alphabet is represented by a single code element. As a result, coding of elements under the Unicode Standard remains straightforward with a single value for each element. Where appropriate, the Unicode Standard also defines codes for the presentation of text. For instance, some codes control the direction in which text is written whether left to right or right to left and in rare cases where text must change directions within a single run of script. Also, the Unicode Standard defines explicit characters for line and paragraph endings, but the large majority of codes represent text or code elements.
Typically, interpretation of text by a computer system is accomplished as the text is being processed. For example, consider the case where a user is typing on a computer system using a word processor application. When the computer operator depresses a key or key combination, for example “shift and d”, the computer system receives a signal or message that the “shift” and “d” keys were simultaneously pressed at the keyboard. This message is encoded by the computer system as a Unicode Standard code. An application, e.g., a word processor, stores the code in memory and also passes it on to the display software for rendering the character on the screen. The display software processes the code and displays the letter “D”,; this process continues as typing continues.
While, the Unicode Standard directly addresses encoding and interpreting of text for presentation, it does not address many other actions performed on the text related to presentation or the application itself. For example, the standard does not address issues such as spell checking, that is left to applications. Furthermore, the Unicode Standard does not address the rendering of characters on the screen, such as font and size. The representation or rendering of the character on the screen is called a “glyph”. The Unicode Standard does not define glyphs, rather it limits itself to the code value associated with an abstract character entity, such as Latin character “b”. It is actually the software or hardware rendering engine of the computer or application program which is responsible for the appearance of the characters on the screen.
In addition, the Unicode Standard does address encoding of “composed character sequences” (CCS). CCS refers to the representation of multiple characters rendered together. For example, “â” is a composed character created by rendering an “a” and “{circumflex over ( )}” together. According to the standard, a CCS is made up of a base character first, occupying a single space, and is followed by one or more non-spacing marks to be rendered in the same space as the base character or a spacing mark to be rendered adjacent to the base character. For often used CCSs, the Unicode Standard defines a single code value to represent the common combination of characters, rather than combining a base character with a combination of other individual characters each time the common CCS is used. These are referred to as “pre-composed” characters. For example, the character “ü” can be encoded as the single code value U+00FC or as two values where the base character U+0075 represents “u” followed by the non-spacing character U+0308 which represents “{umlaut over ( )}”, expressed as “u+
{umlaut over ( )}
”.
Decomposition of pre-composed characters is also defined by the Unicode Standard. For example, a word processor importing a text file containing a pre-composed character may decompose the character into its base character and subsequent non-spacing characters if, for some reason, this makes processing within the word processor easier or more efficient. A pre-composed character is simply a special type of CCS, whereby the pre-composed character is represented by a single predefined Unicode value.
The Unicode Standard specifies an algorithm for determining whether CCSs of Unicode are “equivalent”. The Unicode concept of equivalence facilitates the interchanging of pre-composed characters with decomposed versions of the same characters and vice versa. Pre-composed characters and character sequences are equivalent if, when fully decomposed and correctly ordered, yield identical elements in identical sequences. The Unicode Standard algorithm decomposes pre-composed characters then orders them according to the Unicode rules based, in part, on each character's combining class. Elements which combine with other elements are referred to as “combining characters” and have associated with them a “combining class”. The combining class is a Unicode Standard construct whereby
Edouard Patrick N.
International Business Machines - Corporation
Kudirka & Jobse LLP
LandOfFree
Method and apparatus for optimizing unicode composition and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for optimizing unicode composition and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for optimizing unicode composition and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2891249