Method of identifying a language and of controlling a speech...

Data processing: speech signal processing – linguistics – language – Speech signal processing – Recognition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C704S002000, C704S005000, C704S008000, C704S009000

Reexamination Certificate

active

06711542

ABSTRACT:

DESCRIPTION
The invention relates to a method of identifying a language in which a text is composed in the form of a string of characters, and also to a method of controlling a speech synthesis unit and to a communication device.
At the user interfaces of communication devices, that is to say of terminal devices used in a communication network, such as for example mobile phones or PCs (personal computers), which have a speech reproduction unit for reproducing texts, it is necessary for the reproduction of texts, in particular for the reproduction of received texts or messages, such as for example short messages (SMS), e-mails, traffic information and the like, for the language of the received text or message to be known in order to reproduce the text of the message with the correct pronunciation.
To make possible the correct pronunciation of a name by means of a speech synthesis unit, EP 0 372 734 B1 proposes a method of identifying the language of a name in which a spoken name to be reproduced is broken down into groups of letters of 3 letters each and for each of the 3-letter groups the probability of the respective 3-letter group belonging to a certain language is established, in order then to ascertain from the sum of the probabilities of all the 3-letter groups the association with a language or a language group.
In a known method (GB 2 318 659 A) of identifying a language in which a document is written, the words of a language that are used most frequently are selected for each of a multiplicity of languages available and are stored in respective word tables of the language. In order to identify the language of a document, words of the documents are compared with the most frequently used words of the various languages, the number of matches being counted. The language for which the greatest number of matches is obtained in the word-for-word comparison is then established as the language of the document.
In a further known method of identifying a language on the basis of 3-letter groups (U.S. Pat. No. 5,062,143), a text is broken down into a multiplicity of 3-letter groups in such a way that at least some of the 3-letter groups overlap neighbouring words, that is to say are given a space in the middle. The 3-letter groups obtained in this way are compared with key sets of 3-letter groups of various languages, in order to ascertain the language of a text from the ratio of groups of letters of the text matching the 3-letter groups of a key set in relation to the total number of 3-letter groups of the text.
The invention is based on the object of providing a further method of identifying a language which makes it possible with little expenditure to identify reliably the language in which the text is composed, even in the case of short texts. In addition, the invention is based on the object of providing a method of controlling a speech synthesis unit and a communication device with which correct speech reproduction is possible for various languages with little expenditure.
This object is achieved by the methods according to claims
1
and
14
and by the communication device according to claim
15
.
Thus, according to the invention, a frequency distribution of letters in a text of which the language is sought is ascertained. This frequency distribution is compared with corresponding frequency distributions of available languages, in order to establish similarity factors which indicate to what extent the ascertained frequency distribution coincides with the frequency distributions of each available language. The language for which the ascertained similarity factor is the greatest is then established as the language of the text. In this case, it is expedient if the language is established only if the greatest similarity factor ascertained is greater than a threshold value.
Thus, according to the invention, the statistical distribution of letters, that is to say of individual letters, groups of 2 letters or groups of more than 2 letters, in a text to be analysed is established and compared with corresponding statistical distributions of the languages respectively available. This procedure requires relatively low computer capacities and relatively little storage space in its implementation.
In an advantageous development of the invention, it is provided that the ascertained frequency distribution is stored as the frequency distribution of a new language or is added to a corresponding frequency distribution of a language if, in response to an inquiry, a language to which the ascertained frequency distribution is to be assigned is indicated. In this way, it is made possible in a self-learning process for frequency distributions to be produced for further languages or, if a frequency distribution for this language has already been stored, to increase its statistical reliability.
In an advantageous development of the invention, it may be provided that the ascertained frequency distribution is added to the corresponding frequency distribution of the language established. As a result, the statistical reliability of stored frequency distributions of available languages can be automatically further improved, without the user needing to intervene.
In order to facilitate the processing of the text when ascertaining the frequency distribution of letters and groups of letters in the text, it is provided in an advantageous development of the invention that all non-letter characters, apart from spaces, are removed from the string of characters of the text, in order to ascertain from the string of characters thus obtained frequency distributions of letters and groups of letters in the text.
In another development of the invention, it is provided that the length of the text is established and, depending on the length of the text, one, two or more frequency distributions of letters and groups of letters in the text are ascertained, the length of the text being established as the number of letters in the text and the number of letters in the text being compared with the number of letters in an alphabet, in order to determine which frequency distributions are ascertained.
In this way, the computing effort in ascertaining the frequency distribution or frequency distributions and in the subsequent comparison of the frequency distributions for establishing similarity factors can be reduced, without significantly impairing the reliability of the language identification, since only the ascertainment of those frequency distributions of which the statistical significance would be only extremely low is omitted.
In particular, it is expedient that the frequency distributions of groups of letters with three letters, of groups of letters with two letters and of individual letters are ascertained if the number of letters in the text is greater than the square of the number of letters in the alphabet. Thus, if the number of letters in the text is very great, it is advantageous if not only the frequency distributions of individual letters and of 2-letter groups but also the frequency distribution of 3-letter groups are ascertained, whereby the statistical reliability of the overall finding is significantly increased.
If there is a reduced number of letters in the text, which is greater than the number of letters in the alphabet but less than its square, the frequency distributions of groups of letters with 2 letters and of individual letters are ascertained. If the number of letters in the text is less than the number of letters in the alphabet, expediently only the frequency distribution of individual letters is ascertained, since the statistical significance of the frequency distributions of groups of letters is then practically no longer assured in the method of evaluation according to the invention.
A particularly expedient development of the invention is distinguished by the fact that a complete alphabet is used, including special letters of various languages based on Latin letters. The use of a complete alphabet, that is to say an alphabet which contains not only the Latin letters common to all languages

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method of identifying a language and of controlling a speech... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method of identifying a language and of controlling a speech..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of identifying a language and of controlling a speech... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3248086

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.