Call us now +1 617-804-5550

Question:

What is a voice utterance?

Answer:

An utterance is what a user says when making a request to a smart speaker of a virtual assistant.

for natural language processors, an intent represents an action that fulfills a user’s spoken request. Intents can optionally have arguments called slots. The sample utterances are set of likely spoken phrases mapped to the intents.


Example:

What is the weather in Boston today

The intent is weather

There are two slots, that is location and time.


Here is an example of what an alternative utterance would be:

Tell me in boston what is the weather today

Frequently Asked Questions

What is an acoustic model?
An acoustic model is a representation that maps “the relationship between an audio signal and the phonemes or other linguistic units that make up speech. The model is learned from a set of audio recordings and their corresponding transcripts. It is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word.”
What is an adaptive system?
An adaptive system is a system that adapts its behavior to changing parameters, such as the user’s identity, the time of day, day of week or month, the context of the interaction, etc.
What is ASR?
ASR: Automatic Speech Recognition, or Automatic Speech Recognizer; software that maps audio input to a word or a language utterance.
What is ASR Tuning?
ASR Tuning: The activity of iteratively configuring the ASR (Automatic Speech Recognition, or Automatic Speech Recognizer) software to better map, both in accuracy and in speed, the audio input to a word or an utterance.
What is an earcon?
Earcon: The audio equivalent of an “icon” in graphical user interfaces. Earcons are used to signal conversation marks (e.g., when the system starts listening, when the system stops listening) and to communicate brand, mood, and emotion during a voice-first based interaction.
What is an echo cancellation?
Echo cancellation: A technique that filters out audio coming out of a device while processing incoming audio for speech recognition into that same device.
What is far field speech recognition?
Far Field Speech Recognition: Speech recognition technology that is able to process speech spoken by a user from a distance (usually 10 feet away or more) to the receiving device, usually in a context where there is ambient noise.
What is Houndify?
Houndify: A platform by music identifier service SoundHound that lets developers integrate speech recognition and Natural Language Processing systems into hardware and other software systems.
What is the intent of natural language processing?
A natural language processing intent is what a user is trying to accomplish. There are three types: Full Intent: A spoken request in which the user expresses everything that is required to complete their request, all at once, such as “Alexa ask Elle.com for today’s horoscope for virgo.” Partial Intent: A spoken request in which the user expresses just partial information of what is required to complete their request such as “Alexa ask Elle.com for the horoscope.” No Intent: A spoken request with minimal information such as “Alexa talk to Elle Magazine.”
What is N-best?
N-Best: In speech recognition, given an audio input, an ASR (Automatic Speech Recognizer) returns a list of results, with each result ascribed a confidence score (usually a fraction between 0 and 1 (e.g., “0.87”) or a percentage).  N-Best refers to the “N” results that were returned by the ASR and that were above the “confidence threshold”.  For instance if the user were to say, “Austin,” and the recognizer were to return, “Austin” with a score of 0.92, “Boston” with 0.87, “Houston” with 0.65, “Aspen” with 0.52, and “Oslo” with 0.43, and the threshold were set at 0.70, only the first two, “Austin” and “Boston” would be returned.
What is Named Entity Recognition?
Named Entity Recognition looks for categories of words.
What is NearField Speech Recognition?
Near Field Speech Recognition: In contrast to “Far Field” speech recognition, which processes speech spoken by a human to a device from a distance (usually of 10 feet or more), Near Field speech recognition technology is used for handing spoken input from hand-held mobile devices (such as Siri on the iPhone) that are used within inches or two feet away at best.
What is recognition tuning?
Recognition Tuning: The activity of configuring the ASR’s (Automatic Speech Recognizer) settings to optimize recognition accuracy and processing speed.
What is a sample utterance?
Sample utterance: A structured string of words that connects a specific intent to a likely utterance, which is what a user says when making a request. You provide a set of sample utterances as part of your interaction model for a custom skill. When users say one of these utterances, the Alexa service sends a request to your service that includes the corresponding intent.  Note: You only provide sample utterances for custom skills. Utterances for smart home skills are defined by the Smart Home Skill API.
What is a Speech Recognition Application?
Speech Recognition Application: Enable a device/computer to convert spoken words into written text to find the best matching word sequence. A Speech-To-Text (STT) engine lets you dictate a message and your device will send it as text. A Text-To-Speech (TTS) engine will reproduce the sound of the written words.
What is speech tagging?
Speech Tagging is when a chatbot identifies parts of speech, such as nouns and verbs. Speech tagging helps a computer understand how these structures impact meaning.
What is Speech to Text?
Speech To Text (STT):  Software that converts an audio signal to words (text).  “Speech to Text” is a term that is less frequently used in the industry than “Speech Recognition,” “Speech Reco,” or “ASR.”
What is text to speech?
Text to Speech (TTS): Technology that converts text to audio that is spoken by the system.  TTS is usually used in the context of dynamically retrieved information (a product ID), or when the list of possible items to be spoken by the system (such as full addresses) is very large, and therefore, recording all of the options is not practical.
What is tokenization?
Tokenization means that a natural language processor will divide words into tokens that the computer can then understand.