Speech recognition

Speech recognition technologies allow computers equipped with microphones to interpret human speech, e.g. for transcription or as a control method.

Such systems can be classified as to whether they require the user to "train" the system to recognise their own particular speech patterns or not, whether the system can recognise continuous speech or requires users to break up their speech into discrete words, and whether the vocabulary the system recognises is small (in the order of tens or at most hundreds of words), or large (thousands of words).

Commercial systems for speech recognition have been available off-the-shelf since the 1990's. Systems requiring a short amount of training can (as of 2001) capture continuous speech with a large vocabulary at normal pace with an accuracy of about 98% (getting two words in one hundred wrong), and different systems that require no training can recognize a small number of words (for instance, the ten digits of the decimal system) as spoken by most English speakers.

However, it is interesting to note that despite the apparent success of the technology, few people use such speech recognition systems. It appears that most computer users can create and edit documents more quickly with a conventional keyboard, despite most people being able to speak considerably faster than they can type. In addition, heavy use of the speech organs results in vocal loading.

Some of the key technical problems in speech recognition are that:

inter-speaker differences are often large and difficult to account for. It is not clear which characteristics of speech are speaker independent.
the interpretation of many phonemes, words and phrases are context sensitive. For example, phonemes are often shorter in long words then in short words. Words have different meanings in different sentenses, e.g. "Philip lies" could be interpreted either as Philip being a liar, or that Philip is lying on a bed.
intonation and speech timbre can completely change the correct interpretation of a word or sentence, e.g. "Go!", "Go?" and "Go." can clearly be recognised by a human but not to so easily by a computer.
words and sentences can have several valid interpretations such that the speaker leaves the choice of the correct one to the listener.
written language may need punctuations according to strict rules that are not strongly present in speech, and are difficult to infer without knowing the meaning (commas, ending of sentences, quotations).

A general solution of many of the above problems effectively require human knowledge and experience, and would thus require advanced artificial intelligence technologies to be implemented on a computer. In lower-complexity systems it can suffice to use knowledge from linguistics to interpret the speech.

The "understanding" of the meaning of spoken words is generally regarded as a seperate field, that of natural language understanding.

Speech re See also: