Today’s world commerce around technologies like AI (Machine Learning, NLP-NLU) etc. and it’s quite obvious that devices are being manufactured to be smart enough to capitalize these potencies. One such service provided by Amazon, known as Alexa, which leverages the capabilities of these technologies and builds on top it. Anyone, who is keen to be familiar with Alexa, this article might be useful for them that what is Alexa and how it works. In this article, first part of this series, I am describing about different components and their details what all make Alexa working followed by high-level Alexa architecture, while in the next part, I will discuss about Alexa technical architecture, an example application along with list of dependencies required to build that example application and deploying application on Alexa App Server.
Alexa (as an AI assistant) is presented as a bridge between man and machine. Based upon AI, it enables humans to talk to machines by taking instructions from them as an action or a command or a question. Earlier, Echo speakers require to hold a button while saying wake word to activate a device (Alexa powered) to cater user’s request but now there is no need for such a button required in echo speaker recently to wake. Moreover, Amazon is advanced to provide Alexa’s potential in smart devices which could be a phone, tablet or home appliance. To get into further details about how Alexa works, first, you are required to understand the terminology and importance of each component.
What is Echo speaker:
Echo speaker (or Amazon Echo) is a speaker device used by a user to speak to Amazon personal and intelligent assistant Alexa to pass instructions for a task. These devices are available in many models and activate by a very specific wake word. These devices are manufactured with pre-configured wake word/s.
What is a Wake Word:
The wake word activates an echo device to listen to user’s instructions. These could be usually pronounced as Alexa, Echo or Computer.
What is Invocation Name:
This is a keyword which requires to prompt particular Alexa skills. All custom skills must require an invocation name to get start the interaction. A developer can change invocation name during the development of skill but once skill gets certified and published then invocation name can’t be changed further. Use of invocation name abides by Alexa policies available under “Policy testing for Alexa skills”. For example, invocation name must not violate against intellectual property rights of a person or an organization as entity etc. Invocation name could be well associated with a question, command or action. Below is an example of invocation name in a sentence.
“Hey, Alexa can you start action movie Terminator 3”
“Alexa” is wake word in this instruction.
“Action Movie” is the invocation name here.
As a policy. invocation name could only be of one word if it must relate to a brand or intellectual property. The good invocation name should be a compound of two or more words but there are more conditions around it, depending upon a language skill like German.
What is Utterance:
An utterance is what user wants Alexa to execute. In the above example, “Terminator 3” is utterance. Utterances are nothing but the phrases what users use while giving instruction to Alexa. The response from Alexa is decided and based upon the identified utterance requested by the user.
What is NLP:
NLP refers to Natural Language Processing in the technology world and a subset of Artificial Intelligence. It is responsible for interactions between humans and computerized devices. This drives a complex task of analyzing and processing natural language, used by humans, to be understood by computers. This enables computers to understand, analyze, process and respond back to humans in accordance with natural language. This makes the way possible for a man and machine communication in the form of text or speech and of course many more.
What is NLU:
NLU stands for Natural Language Understanding, is a subset of NLP and could be termed as the first step in interpreting human natural language. This does also come under the umbrella of Artificial Intelligence. Understanding human natural language (many languages in this world) by a computational algorithm is a daunting task. A language could be native to a person and what makes it even more difficult is the formation of a sentence. This is because of the fact that the same sentence can be formed by many combinations and permutations of words, which complete a sentence in any order. Either it’s a speech or text formation. Here, computational power comes into play to decode meaningful words of a sentence and then pass it to further processing logic (NLP) so as to respond back to the user with most appropriate response against the request made by the user. This requires scaling of servers, which is done by the most possible way of cloud computing and Amazon carries that capability. NLU plays another major role by deeply understanding the context of a sentence and identifies that what is a verb, noun or tense used in a sentence. This process is known as “Part of Speech Tagging” (POS).
What is Deep Learning:
Deep learning is a subset of machine learning. Deep learning is a training process which caters to the acoustic model. This is accomplished by close observation on how audio and transcripts are paired. Deep learning can be well compared with, how the human brain works. As the human brain has neurons, which helps the brain to take the decision. Similarly, deep learning works with the web of artificial neural networks. The data to be processed is of huge amount and unstructured, so Deep Learning being a subset of Machine Learning, helps machines to process data in a non-linear way. Deep Learning is a continuous process, which is getting evolved day by day and many companies invested into research of this area.
Alexa, which is a cloud-based service from Amazon, has the following components in its kitty to represent an end to end architecture. Below is a high-level diagrammatical depiction of Alexa’s architecture followed by some details of associated components.
This is to take instruction from the user and it has been already explained above. As, Amazon already keeps on advancing about taking user’s instructions from smart devices like phones, tablets and smart home appliances, this will be eliminating the need of using echo speaker going forward.
When users speak over Echo speaker, it’s not an easy job to identify the absolute sound in the far field environment. There could be many fake signals say, noises around like a TV/Music sound, etc. It’s very important to fetch the right voice command, hence signal processing plays an important role here. This is accomplished by using a number of microphones (known as beam-forming) and canceling or deducting/reducing signals of noises by the acoustic echo to make sure the only signal of importance should remain for further processing.
Alexa Voice Service:
This can be considered as the brain of Alexa. This is a suite of services say APIs and tools. These services are configured around Alexa (kinda AI assistant). This service holds the responsibilities of understanding human natural language by taking voice commands from users via echo device. As, AI has machine learning underneath, which further has capabilities like NLP – NLU. This resolves complex voice commands with advanced of computational power and deep learning algorithms.
The services in Alexa Voice Service, are nothing but Alexa skills. Depending upon a voice command, a most appropriate service gets invoked and cater users with the most meaningful response for user’s request. Alexa skills development is a niche area which requires developers to implement commanding solutions. These skills are key to success while responding to users with expected results. This is the component, which makes a decision by looking at invocation name and utterance in a voiced sentence, which in turn, concludes user’s input, processes it and respond expectedly. The utterance is the phrases which encapsulate the user’s desired result.
This receives inputs from Alexa Voice Service (a response by Alexa Skills based upon user’s input received). Then, it sends response command signals to an appropriate device connected online with a device cloud to accomplish the action, as instructed by the user. For example, this could be starting of an Air Conditioner or playing a movie on TV.
Deepak Chhabra is working as VP - Product Engineering in QSS Technosoft Pvt. Ltd.
QSS has a proven track executing web and mobile applications for its esteemed customers. The company has a core competency in developing and delivering enterprise level applications using Cloud Computing Services