Today’s world of commerce is around technologies like AI (Machine Learning, NLP-NLU), etc. and it’s quite obvious that devices are being manufactured to be smart enough to capitalize on these potencies. One such service provided by Amazon, known as Alexa, leverages the capabilities of these technologies and builds on top of them. For anyone, who is keen to be familiar with Alexa, this article might be useful for them that what is Alexa and how it works. In this article, the first part of this series, I am describing different components and their details that all make Alexa work followed by high-level Alexa architecture, while in the next part, I will discuss Alexa’s technical architecture, an example application along with a list of dependencies required to build that example application and deploying application on Alexa App Server.
Alexa (as an AI assistant) is presented as a bridge between man and machine. Based AI, enables humans to talk to machines by taking instructions from them as an action or a command, or a question. Earlier, Echo speakers require to hold a button while saying the wake word to activate a device (Alexa powered) to cater user’s request but now there is no need for such a button required in the echo speakers recently to wake. Moreover, Amazon is advanced to provide Alexa’s potential in smart devices which could be a phone, tablets, or home appliances. To get into further details about how Alexa works, first, you are required to understand the terminology and importance of each component.
What is an Echo speaker:
Echo speaker (or Amazon Echo) is a speaker device used by a user to speak to Amazon’s personal and intelligent assistant Alexa to pass instructions for a task. These devices are available in many models and activate by a very specific wake word. These devices are manufactured with pre-configured wake word/s.
What is a Wake Word:
The wake word activates an echo device to listen to the user’s instructions. These could be usually pronounced as Alexa, Echo, or Computer.
What is Invocation Name:
This is a keyword that requires prompt particular Alexa skills. All custom skills must require an invocation name to get started on the interaction. A app developer can change the invocation name during the development of the skill but once the skill gets certified and published the invocation name can’t be changed further. Use of invocation name abides by Alexa policies available under “Policy testing for Alexa skills”. For example, an invocation name must not violate against intellectual property rights of a person or an organization as an entity, etc. Invocation name could be well associated with a question, command, or action. Below is an example of an invocation name in a sentence.
“Hey, Alexa can you start the action movie Terminator 3”
“Alexa” is the wake word in this instruction.
“Action Movie” is the invocation name here.
As a policy. invocation name could only be of one word if it must relate to a brand or intellectual property. A good invocation name should be a compound of two or more words but there are more conditions around it, depending upon a language skill like German.
What is Utterance:
An utterance is what the user wants Alexa to execute. In the above example, “Terminator 3” is an utterance. Utterances are nothing but phrases that users use while giving instructions to Alexa. The response from Alexa is decided and based on the identified utterance requested by the user.
What is NLP:
NLP refers to Natural Language Processing in the technology world and a subset of Artificial Intelligence. It is responsible for interactions between humans and computerized devices. This drives the complex task of analyzing and processing natural language, used by humans, to be understood by computers. This enables computers to understand, analyze, process and respond back to humans in accordance with natural language. This makes the way possible for man and machine to communicate in the form of text or speech and of course many more.
What is NLU:
NLU stands for Natural Language Understanding, is a subset of NLP and could be termed the first step in interpreting human natural language. This does also come under the umbrella of Artificial Intelligence. Understanding human natural language (many languages in this world) by a computational algorithm is a daunting task. A language could be native to a person and what makes it even more difficult is the formation of a sentence. This is because of the fact that the same sentence can be formed by many combinations and permutations of words, which complete a sentence in any order. Either it’s a speech or text formation. Here, computational power comes into play to decode meaningful words of a sentence and then pass it to further processing logic (NLP) so as to respond back to the user with the most appropriate response against the request made by the user. This requires scaling of servers, which is done by the most possible way of cloud computing and Amazon carries that capability. NLU plays another major role by deeply understanding the context of a sentence and identifying what is verb, noun, or tense used in a sentence. This process is known as “Part of Speech Tagging” (POS).
What is Deep Learning:
Deep learning is a subset of machine learning. Deep learning is a training process that caters to the acoustic model. This is accomplished by close observation of how audio and transcripts are paired. Deep learning can be well compared with, how the human brain works. As the human brain has neurons, which helps the brain to take decision. Similarly, deep learning works with the web of artificial neural networks. The data to be processed is of huge amount and unstructured, so Deep Learning being a subset of Machine Learning, helps machines to process data in a non-linear way. Deep Learning is a continuous process, which is getting evolved day by day, and many companies invested in research in this area.
Alexa, which is a cloud-based service from Amazon, has the following components in its kitty to represent an end-to-end architecture. Below is a high-level diagrammatical depiction of Alexa’s architecture followed by some details of associated components.
This is to take instruction from the user and it has been already explained above. As, Amazon already keeps on advancing about taking user instructions from smart devices like phones, tablets, and smart home appliances, this will be eliminating the need of using echo speakers going forward.
When users speak over an Echo speaker, it’s not an easy job to identify the absolute sound in the far-field environment. There could be many fake signals say, noises around like a TV/Music sound, etc. It’s very important to fetch the right voice command, hence signal processing plays an important role here. This is accomplished by using a number of microphones (known as beam-forming) and canceling or deducting/reducing signals of noises by the acoustic echo to make sure the only signal of importance should remain for further processing.
Alexa Voice Service:
This can be considered the brain of Alexa. This is a suite of services say APIs and tools. These services are configured around Alexa (a kinda AI assistant). This service holds the responsibility of understanding human natural language by taking voice commands from users via an echo device. As, AI has machine learning underneath, which further has capabilities like NLP – NLU. This resolves complex voice commands with advanced of computational power and deep learning algorithms.
The services in Alexa Voice Service, are nothing but Alexa skills. Depending upon voice command, the most appropriate service gets invoked and caters to users with the most meaningful response for the user requests. Alexa skills development is a niche area that requires developers to implement commanding solutions. These skills are key to success while responding to users with expected results. This is the component, which makes a decision by looking at the invocation name and utterance in a voiced sentence, which in turn, concludes the user’s input, processes it, and responds expectedly. The utterance is the phrase that encapsulates the user’s desired result.
This receives inputs from Alexa Voice Service (a response by Alexa Skills based upon the user’s input received). Then, it sends response command signals to an appropriate device connected online with a device cloud to accomplish the action, as instructed by the user. For example, this could be starting an Air Conditioner or playing a movie on TV.
Deepak Chhabra is working as VP - Product Engineering in QSS Technosoft Pvt. Ltd.
QSS has a proven track executing web and mobile applications for its esteemedcustomers. The company has a core competency in developing and deliveringenterprise level applications using Cloud Computing Services