Rohit Prasad: Amazon Alexa and Conversational AI

Rohit Prasad: Amazon Alexa and Conversational AI | Lex Fridman Podcast #57

Ad89JYS-uZM • 2019-12-14

Transcript preview

Open

Kind: captions
Language: en
the following is a conversation with
raha Prasad he's the vice president and
head scientist of Amazon Alexa and one
of its original creators the Alexa team
embodies some of the most challenging
incredible impactful and inspiring work
that is done in a high today the team
has to both solve problems at the
cutting edge of natural language
processing and provide a trustworthy
secure and enjoyable experience to
millions of people this is where
state-of-the-art methods in computer
science meet the challenges of
real-world engineering in many ways
Alexa and the other voice assistants are
the voices of artificial intelligence to
millions of people and an introduction
to AI for people who have only
encountered it in science fiction this
is an important and exciting opportunity
so the work that Rohit and the Alexa
team are doing is an inspiration to me
and to many researchers and engineers in
the AI community this is the artificial
intelligence podcast if you enjoy it
subscribe on YouTube give it five stars
an apple podcast supported on patreon or
simply connect with me on Twitter Alex
Friedman spelled Fri D ma n if you leave
a review on an apple podcast especially
but also cast box or comment on youtube
consider mentioning topics people ideas
questions quotes in science tech or
philosophy that you find interesting and
I'll read them on this podcast I won't
call out names but I love comments with
kindness and thoughtfulness in them so I
thought I'd share them someone on
YouTube highlighted a quote from the
conversation with Ray Dalio
where he said that you have to
appreciate all the different ways that
people can be a player's this connected
me to on teams of engineers it's easy to
think that raw productivity is the
measure of excellence but there are
others I've worked with people who
brought a smile to my face every time I
got to work in the morning their
contribution to the team is immeasurable
I recently started doing podcast ads at
the end of the introduction I'll do one
or two minutes after introducing the
episode and never any ads in the middle
that break the flow of the conversation
I hope that works for you it doesn't
hurt the listening experience this show
is presented by cash app the number one
finance app in the App Store I
personally use cash app to send money to
friends but you can also use it to buy
sell and deposit a big coin in just
seconds cash app also has a new
investing feature you can buy fractions
of a stock say $1 worth no matter what
the stock price is brokerage services
are provided by cash up investing a
subsidiary of square and member at CIBC
I'm excited to be working with cash app
to support one of my favorite
organizations called first best known
for their first robotics and Lego
competitions they educate and inspire
hundreds of thousands of students in
over 110 countries and have a perfect
rating at Charity Navigator which means
the donated money is used to maximum
effectiveness when you get cash app from
the App Store Google Play and use code
Lex podcast you'll get $10 and cash app
will also donate $10 to 1st which again
is an organization that I've personally
seen inspire girls and boys the dream of
engineering better world this podcast is
also supported by a zip recruiter hiring
great people is hard and to me is one of
the most important elements of
successful mission driven team I've been
fortunate to be a part of and lead
several great engineering teams the
hiring I've done in the past was mostly
through tools we built ourselves but
reinventing the wheel was painful sip
recruiters a tool that's already
available for you it seeks to make
hiring simple fast and smart
for example codable co-founder gretchen
nner use zip recruiter to find a
new game artist to join our education
tech company by using sip recruiters
screening questions to filter candidates
Gretchen found it easier to focus on the
best candidates and finally hiring the
perfect person for the role in less than
two weeks from start to finish
zip recruiter the smartest way to hire
CY zip recruiters effective for
businesses of all sizes by signing up as
I did for free at zip recruiter comm /
Lex
pod that
zipper Kirkham / Lex pod and now here's
my conversation with Rohit Prasad in the
movie her I'm not sure if you ever seen
a human falls in love with a voice of an
AI system let's start at the highest
philosophical level before we get too
deep learning and some of the fun things
do you think this what the movie her
shows is within our reach
I think not specifically about her but I
think what we are seeing is a massive
increase in adoption of AI assistants
Rai and all parts of our social fabric
and I think it's what I do believe is
that the utility these areas provide
some of the functionalities that are
shown are absolutely within reach so the
some of the functionality in terms of
the interactive elements but in terms of
the deep connection that's purely voice
based do you think such a close
connection as possible with voice alone
it's been a while since I saw her but I
would say in terms of the in terms of
interactions which are both human-like
and in these AI assistants you have to
value what is also super human we as
humans can be in only one place
AI assistance can be in multiple places
at the same time one with you on your
mobile device one at your home one at
work so you have to respect these
superhuman capabilities to Plus as
humans we have certain attributes we are
very good at where you're at reasoning
AI assistance not yet there but in
Terrell mauve AI assistance what they're
great at is computation memory it's
infinite and pure these are the
attributes you have to start respecting
so I think the comparison with
human-like versus the other aspect which
is also super human has to be taken into
consideration so I think we need to
elevate the discussion to not just human
like so there's certainly elements we
just mentioned
Alexa's everywhere computation is
speaking so this is a much bigger
infrastructure than just the thing that
sits there
in the room with you but it certainly
feels to us mere humans that there's
just another little creature there when
you're interacting with it you're not
interacting with the entirety of the
infrastructure you're interacting with
the device the feeling is okay sure we
anthropomorphize things but that feeling
is still there so what do you think we
as humans the purity of the interaction
with a smart assistant what do you think
we look for in that interaction I think
in the certain interactions I think will
be very much where it does feel like a
human because it has a persona of its
own and in certain ones it wouldn't be
so I think a simple example to think of
it is if you're walking through the
house and you just want to turn on your
lights on and off and you're issuing a
command that's not very much like a
human-like interaction and that's where
the AI shouldn't come back and have a
conversation with you just it should
simply complete that command so those I
think the blend of we have to think
about this is not human human alone it
is a human machine interaction and
certain aspects of humans are needed and
certain aspects are in situations demand
it to be like a machine so I told you
it's gonna be full soft cause in parts
what was the difference between human
and machine in that interaction when we
interact to humans especially those our
friends and loved ones versus you and a
machine that you also are close with I
think they you have to think about the
roles the AI plays right so and it
differs from different customer to
customer different situation to
situation especially I can speak from
Alexis perspective it is a companion a
friend at times an assistant an advisor
down the line so I think most a eyes
will have this kind of attributes and it
will be very situational in nature so
where is the boundary I think the
boundary depends on exact context in
which you are interacting what they are
so the depth and the richness of natural
language conversation is been by Alan
Turing being used to try to define what
it means to be intelligent you know
there's a lot of criticism of that kind
of
but what do you think it's a good test
of intelligence in your view in the
context of the Turing test and Alexa or
the elect surprise this whole realm do
you think about this human intelligence
what it means to define it what it means
to reach that level I do think the
ability to converse is an sign of an
ultimate intelligence I think that is no
question about it so if you think about
all aspects of humans there are sensors
we have and those are basically a data
collection mechanism and based on that
we make some decisions with our sensory
brains right and from that perspective I
think that there are elements we have to
talk about how we sense the world and
then how we act based on what we sense
those elements clearly machines have but
then there's the other aspects of
computation that is way better I also
mentioned about memory again in terms of
being near infinite depending on the
storage capacity you have and the
retrieval can be extremely fast and pure
in terms of like there's no ambiguity of
who did I see when right I mean if your
machine scan remember that quite well so
it again on a philosophical level I do
subscribe to the fact that to can be
able to converse and as part of that to
be able to reason based on the world
knowledge you've acquired and the
sensory knowledge that is there is
definitely very much the essence of
indulgence but indulgence can go beyond
human level intelligence based on what
machines are getting capable of so what
do you think maybe stepping outside of
Alexa broadly as an AI field what do you
think is a good test of intelligence put
it another way outside of Alexa because
so much of Alexa is a product is an
experience for the customer on the
research side what would impress the
heck out of you if you saw you know what
is the test what he said wow this thing
is now starting to encroach into the
realm of what we loosely think of as
human intelligence so well we think of
it as a GI and human intelligence all
together right so in some sense and I
think we are quite
far from that I think an unbiased view I
have is that the Alexus intelligence
capability is a great test I think of it
as there are many other proof points
like self-driving cars game playing like
go or chess let's take those two for as
an exemption clearly requires a lot of
data-driven learning and intelligence
but it's not as hard a problem as
conversing with as an AI is with it
humans to accomplish certain tasks or
open domain chat as you mentioned like a
surprise in those settings the key
difference is that the end goal is not
defined
unlike game playing you also do not know
exactly what state you are in in a
particular goal completion scenario in
certain times sometimes you can if it is
a simple goal but if you're even certain
examples like planning a weekend or you
can imagine how many things change along
the way you look for whether you make
change your mind and you you change
their destination or you want to catch a
particular event and then you decide no
I want this other event I want to go to
so these dimensions of how many
different steps are possible when you're
conversing as a human with a machine
makes it an extremely daunting problem
and I think it is the ultimate test for
intelligence and don't you think the
natural language is enough to prove that
conversation your conversation from a
scientific standpoint natural language
is a great test but I would go beyond I
don't want to limit it to as natural
language as simply understanding an
intent or parsing for entities and so
forth we are really talking about
dialogue
so so I would say human machine dialogue
is definitely one of the best tests of
intelligence
so can you briefly speak to the Alexa
prize for people who are not familiar
with it and and also just maybe were
things stand and what have you learned
what's surprising what have you seen the
surprising from this incredible
competition absolutely it's a very
competition like surprise is essentially
Grand Challenge in conversational
artificial intelligence where we threw
the gauntlet to the universities who do
active research in the field to say can
you build what we call a social board
that can converse with you coherently
and engagingly for 20 minutes that is an
extremely hard challenge talking to
someone in a who you're meeting for the
first time or even if you're you've met
them quite often to speak at 20 minutes
on any topic an evolving nature of
topics is super hard we have completed
two successful years of the competition
the first was one with the industry of
Washington's second industry of
California we are in our third instance
we have an extremely strong team of 10
cohorts and the third instance of the of
the lexer prizes underway now and we are
seeing a constant evolution first year
was definitely learning it was a lot of
things to be put together we had to
build a lot of infrastructure to enable
these you know STIs to be able to build
magical experiences and and do high
quality research just a few quick
questions sorry for the interruption
what is failure look like in the
20-minute session so what does it mean
to fail not to reach the twenty minimum
awesome question so there are one first
of all I forgot to mention one more
detail it's not just 20 minutes but the
quality of the conversation too that
matters and the beauty of this
competition before I answer that
question on what failure means is first
that you actually converse with millions
and millions of customers as these
social BOTS so during the judging phases
there are multiple phases before we get
to the finals which is a very controlled
judging in a situation where we have we
bring in judges and we have interactors
who interact with these social BOTS that
is a much more controlled setting but
till the point we get to the finals all
the judging is essentially by the
customers of Alexa and there you
basically rate on a simple question how
good your experience was so that's where
we are not testing for a 20 minute
boundary being claw across because you
do want
to be very much like a clear-cut winner
be chosen and and it's an absolute bar
so did you really break that 20-minute
barrier is why we have to test it in a
more controlled setting with actors
essentially in tractors and see how the
conversation goes so this is why it's a
subtle difference between how it's being
tested in the field with real customers
versus in the lab to award the prize so
on the latter one what it means is that
essentially the that there are three
judges and two of them have to say this
conversation is stalled essentially got
it and the judges the human experts
judges or human experts okay great so
this is in the third year so what's been
the evolution how far it's in the DARPA
challenge in the first year the
autonomous vehicles nobody finished in
the second year a few more finished in
the desert so how far along within this
I would say much harder challenge are we
this challenge has come a long way do
they extend that we've definitely not
close to the 20-minute barrier being
with coherence and engaging conversation
I think we are still five to ten years
away in that horizon to complete that
but the progress is immense like what
you're finding is the accuracy in what
kind of responses these social BOTS
generate is getting better and better
what's even amazing to see that now
there's humor coming in the bots are
quite you know you're talking about
ultimate science of intial and signs of
intelligence I think humor is a very
high bar in terms of what it takes to
create humor and I don't mean just being
goofy I really mean good sense of humor
is also a sign of intelligence in my
mind and something very hard to do so
these social BOTS are now exploring not
only what we think of natural language
abilities but also personality
attributes and aspects of when to inject
an appropriate joke went to when you
don't know the question the domain how
you come back with something more
intelligible so that you can continue
the conversation if if you and I are
talking about AI and we are domain
experts we can speak to it but if you
suddenly switch the topic to that I
don't know how do I change the
conversation so you're starting to
notice these elements as well and that's
coming from partly by by the nature of
the 20 minute challenge that people are
getting quite clever on how to really
converse and
essentially masks some of the
understanding defects if they exist so
some of this this is not a Lex of the
products this is somewhat for fun for
research for innovation and so on I have
a question sort of in this modern era
there's a lot of you look at Twitter and
Facebook and so on there's there's
discourse public discourse going on and
some things are a little bit too edgy
people get blocked and so on I'm just
out of curiosity are people in this
context pushing the limits is anyone
using the f-word is anyone sort of
pushing back sort of you know arguing I
guess I should say in as part of the
dialogue to really draw people in first
of all let me just back up a bit in
terms of why we're doing this right so
you said it's fun I think fun is more
part of the engaging part for customers
it is one of the most used skills as
well in our skill store but up that
apart the real goal was essentially what
was happening is with lot of AI research
moving to industry we felt that academia
has the risk of not being able to have
the same resources at disposal that we
have which is loss of beta massive
computing power and a clear ways to test
these AI advances with real customer
benefits so we brought all these three
together in the like surprise that's why
it's one of my favorite projects and
Amazon and with that the secondary fact
is yes it has become engaging for our
customers as well we're not there in
terms of where we want to it to be right
but it's a huge progress but coming back
to your question on how do the
conversations evolve yes there is some
natural attributes of what you said in
terms of argument and some amount of
swearing the way we take care of that is
that there is a sensitive filter we have
built that show you see words and so
it's more than keywords a little more in
terms of of course there's key word base
to but there's more in terms of these
words can be very contextual as you can
see and also the topic can be something
that you don't want a conversation to
happen because this is a criminal device
as well a lot of people use these
devices so we have put
lot of guardrails for the conversation
to be more useful for advancing AI and
not so much of these these other issues
you attributed what's happening in there
I feel as well right so this is actually
a serious opportunity I didn't use the
right word fun I think it's an open
opportunity to do some some of the best
innovation in conversational agents in
in the world absolutely why just
universities why just you know streets
because as I said I really felt young
minds young minds it's also - if you
think about the other aspect of where
the whole industry is moving with AI
there's a dearth of talent in in given
the demands so you do want the
universities to have a clear place where
they can invent and research and not
fall behind with that they can't
motivate students imagine all grad
students left - to industry like us or
or faculty members which has happened -
so this is in a way that if you're so
passionate about the field where you
feel industry and academia need to work
well this is a great example and a great
way for universities to participate so
what do you think it takes to build a
system that wins the allow surprise I
think you have to start focusing on
aspects of reasoning that it is there
are still more lookups of what intense
customers asking for and responding to
those are rather than really reasoning
about the elements of the of the
conversation for instance if you have if
you're playing if the conversation is
about games and it's about a recent
sports event there's so much context in
war and you have to understand the
entities that are being mentioned so
that the conversation is coherent rather
than you suddenly just switch to knowing
some fact about a sports entity and
you're just relating that rather than
understanding the true context of the
game like you if you just said I learned
this fun fact about
Tom Brady rather than really say how he
played the game the previous night then
the conversation is not really that
intelligent so you have to go to more
reasoning elements of understanding the
context of the dialogue and giving more
appropriate responses which tells you
that we are still quite far because a
lot of times it's more facts being
looked after and something that's close
enough as an answer but not really the
answer so that is where the research
needs to go more an actual true
understanding and reasoning and that's
why I feel it's a great way to do it
because you have an engaged set of users
working to make help these AI advances
happen in this case item actually
customers they're there quite a bit and
there's a skill what is the experience
for the for the user that is helping so
just to clarify this isn't as far as I
understand the Alexa so this skill is to
stand alone for the art surprise I mean
it's focused on the elect surprise it's
not you ordering certain things and I
was on the comet trait checking the
weather or you're playing Spotify right
separate skills directly and so you're
focused on helping not well I don't know
how do people how do customers think of
it are they having fun are they helping
teach the system what's the experience
like I think it's both actually and let
me tell you how they how you invoke this
skill so you all you have to say Alexa
let's chat and then the first time you
say Alexa let's chat it comes back with
a clear message that you're interacting
with one of those you know three social
BOTS and there's a fear so he's know
exactly how you interact right and that
is why it's very transparent you are
being asked to help right and and we
have lot of mechanisms where as the we
are in the first phase of feedback phase
then you send a lot of emails to our
customers and then this they know that
this the team needs a lot of
interactions to improve the accuracy of
the system so we know we have lot of
customers who really want to help be
zeros to bots and they are conversing
with that and some are just having fun
with just saying Alexa let's chat and
also some adversarial behavior to see
whether
how much do you understand as a social
bot so I think we have a good healthy
mix of all three situations so what is
the if we talk about solving the Alexa
challenge they like surprise what's the
data set of really engaging pleasant
conversations look like is if we think
of this as a supervised learning problem
I don't know if it has to be but if it
does maybe you can comment on that do
you think there needs to be a data set
of what it means to be an engaging
successful fulfilling copy that's part
of the research question here this was I
think it's we at least got the first
part right which is have a way for
universities to build and test in a
real-world setting now you're asking in
terms of the next phase of questions
which we are still we're also asking by
the way what does success look like from
a optimization function that's what
you're asking in terms of we as
researchers are used to having a great
corpus of annotated data and then making
a Rob then you know sort of tune our
algorithms on those right and
fortunately and unfortunately in this
world of a lexer prize that is not the
way we are going after it so you have to
focus more on learning based on live
feedback that is another element that's
unique we're just not I started with
giving you how you ingress and
experience this capability as a customer
what happens when you're done so they
ask you a simple question on a scale of
one to five how likely are you to
interact with this social bot again that
is a good feedback and customers can
also leave more open-ended feedback and
I think partly that to me is one part of
the question you're asking which I'm
saying is a mental model shift that as
researchers also you have to change your
mindset that this is not a dart by
evaluation or NSF funded study and you
have a nice corpus this is where it's
real world you have real data the scale
is amazing is this
beautiful thing then and then the
customer the user can quit the
conversation in exactly the user game
that is also a signal for how good you
were at that point so and then on a
scale of one to five one two three do
they say how likely are you or is it
just a binary Allah one two five one two
five Wow okay that's such a beautifully
constructed challenge okay you said the
only way to make a smart assistant
really smart to give it eyes and let
explore the world I'm not sure he might
been taken out of context but can you a
comment and I can you elaborate and that
idea is that I personally also find that
ideas super exciting from a social
robotics personal robotics perspective
yeah a lot of things do get taken out of
context my this particular one was just
as philosophically discussion we were
having on terms of what does
intelligence look like and the context
was in terms of learning I think just we
said we as humans are empowered with
many different sensory abilities I do
believe that eyes are an important
aspect of it in terms of if you think
about how we as humans learn it is quite
complex and it's also not unimodal that
you are fed a ton of text or audio and
you just learn that way no you are you
learn by experience you learn by seeing
you're taught by humans and we're very
efficient and how we learn machines on
the contrary are very inefficient on how
they learn especially these AI is I
think the next wave of research is going
to be with less data not just less human
not just with less label data but also
with a lot of week supervision and where
you can increase the learning rate I
don't mean less data in terms of not
having a lot of data to learn from that
we are generating so much data but it is
more about from a aspect of how fast can
you learn so improving the quality of
the data that's the quality data and
learning process I think more on the
learning process I think we have to we
as humans learn with a lot of
noisy data right and and I think that's
the part that I don't think should
change what should change is how we
learn right so if you look at you
mentioned supervised learning we have
making transformative shifts from moving
to more unsupervised more week
supervision those are the key aspects of
how to learn and I think in that setting
you I hope you agree with me that having
other senses is very crucial in terms of
how you learn so absolutely and from a
machine learning perspective which I
hope we get a chance to talk to a few
aspects that are fascinating there but
just stick on the point a sort of a body
you know an embodiment so Alexa has a
body is a very minimalistic beautiful
interface or there's a ring and so on I
mean I'm not sure of all the flavors of
the devices that Alyssa lives on but
there's a minimalistic basic interface
and nevertheless we humans so I have a
Roomba of all kinds of robots and all
over everywhere so what do you think the
Alexa the future looks like if it begins
to shift what his body looks like what
uh what may be beyond the Alexa what do
you think are the different devices in
the home as they start to embody their
intelligence more and more what do you
think that looks like philosophically a
future what do you think that looks I
think let's look at what's happening
today you mentioned I think all our
devices as an Amazon devices we also
wanted to point out Alexa is already
integrated a lot of third-party devices
which also come in lots of forms and
shapes some in robots right some and
microwaves some in appliances of that
you use in everyday life so I think it
is it's not just the shape Alexa takes
in terms of form factors but it's also
where all it's available it's getting in
cars it's getting in different
appliances in homes even toothbrushes
right so I think you have to think about
it is not a physical assistant it will
be in some embodiment
as you said we already have these nice
devices but I think it's also important
to think of it it is a virtual assistant
it does superhuman in the sense that it
is in multiple places at the same time
so I think the the actual embodiment in
some sense to me doesn't matter I think
you have to think of it as not as
human-like and more of what its
capabilities are that derive a lot of
benefit for customers and how there are
different ways to delighted and delight
customers and different experiences and
I think I am a big fan of it not being
in just human like it should be
human-like in certain situations Alexa
Frye social bot in terms of conversation
is a great way to look at it but there
are other scenarios where human like I
think is underselling the abilities of
this AI so if I could trivialize what
we're talking about so if you look at
the way Steve Jobs thought about the
interaction with the device that Apple
produced there was a extreme focus on
controlling the experience by making
sure there's only the Apple produced
devices you see the voice of Alexa being
taking all kinds of forms depending on
what the customers want and that means
that means it could be anywhere from the
microwave to a vacuum cleaner to the
home and so on the voice is the
essential elrom to the interaction I
think voice is an essence it's not all
but it's a key aspect I think to your
question in terms of you should be able
to recognize Alexa and that's a huge
problem I think in terms of a huge
scientific problem I should say like
what are the traits what makes it look
like Alexa especially in different
settings and especially if it's
primarily voice what it is but LX is not
just voice either right I mean we have
devices with a screen now you're seeing
just other behaviors of Alexa so I think
they're in very early stages of what
that means and this will be an important
profit for the following years but I do
believe that being able to recognize and
tell when it's Alexa versus it's not as
going to be important from an Alexa
perspective I'm not speaking for the
entire AI Thank You Marie but from but I
think attribution and as we go into more
of understanding who did what that
identity of the AI is crucial in the
coming world I think from the broad AI
community perspective that's also a
fascinating problem so basically if I
close my eyes and listen to the voice
what would it take for me to recognize
that this is Alexa exactly or at least
the Alexa that I've come to known from
my personal experience in my home
through my interactions that Korea and
the Alexa here in the u.s. is very
different the Alexa and UK and Alexa
India even though they are all speaking
English or the Australian version so
again we're so now think about when you
go into a different culture different
community but you travel there
what do you recognize Alexa I think
these are super hard questions actually
so there's a Tina works on personality
so if we talk about those different
flavours or what it means culturally
speaking India UK u.s. what does it mean
to add so the problem that we just
stated which is fascinating how do we
make it purely recognizable that it's
Alexa assuming that the qualities of the
voice are not sufficient it it's also
the content of what is being said how do
how do we do that how does the
personality kind of come into play
what's what's that researching would
look like it's such a fascinating we
have some very fascinating folks who
from both the UX background and human
factors are looking at these aspects and
these exact questions but I'll
definitely say it's not just how it
sounds the choice of words the tone not
just I mean the voice identity of it but
the tone matters the speed matters how
you speak how you enunciate words how
what choice of words are using how tours
are you or how lending in your
explanations you are all of these are
factors and you also you mentioned
something crucial that it's may have you
may have personalized it Alexa to some
extent in your homes or in the devices
you are interacting with so
you as your individual how you prefer
Alexa sounds can be different than how I
prefer and we may and the amount of
customizability you want to give is also
a key debate we always have but I do
want to point out it's more than the
voice actor that recorded and you'd
sounds like that actor it is more about
the choices of words the attributes of
tonality the volume in terms of how you
raise your pitch and so forth all of
that matters this is a fascinating
problem from a product perspective I
could see those debates just happening
inside of the Alexa team of how much
personalization do you do for the
specific customer because you're taking
a risk if you over personalized because
you don't I
if you create a personality for a
million people you can test that better
you can create a rich fulfilling
experience that will do well but if the
more you personalize it the less you can
test it the less you can know that it's
it's a great experience so how much
personalization what's the right balance
I think the right balance depends on the
customer give them the control so I'd
say I think the more control you give
customers the better it is for everyone
and I'll give you some key
personalization features I think we have
a feature called remember this which is
where you can tell Alexa to remember
something there you have an explicit
sort of control in customers hand
because they have to say like I remember
XYZ what kind of things would that be
used for so you can respond or something
I have stored my tire specs for my car
nice because it's so hard to go and find
and see what it is right when you're
having some issues I store my mileage
plan numbers for all the frequent-flyer
ones where sometimes just looking at it
and it's not handy so and so those are
my own personal choices army for Alexa
to remember something on my behalf right
so again I think the choice was be
explicit about how you provide that to a
customer as a control so I think these
are the aspects of what you do like
think about
where we can use speaker recognition
capabilities that it's if you taught
Alexa that you are Lex and this person
you're householders person to then you
can personalize the experiences again
these are very in this and the CX
customer experience patterns are very
clear about and transparent when a
personalization action is happening and
then you have other ways like you go
through explicit control right now
through your app that your multiple
service providers let's say for music
which one is your preferred one so when
you say place ting depend on your
whether you have preferred Spotify or
Amazon music or Apple music that the
decision is made where to play it from
so what's Alexis backstory from her
perspective this is there I remember
just asking as probably a lot of us are
just the basic questions about love and
so on of Alexa just to see what the
answer would be just as a it feels like
there's a little bit of a back like
there's a feels like there's a little
bit of personality but not too much is
Alexa have a metaphysical presence in
this human universe we live in or is it
something more ambiguous is there a past
is there birth is there family kind of
idea even for joking purposes and so on
I think well it does tell you if I think
you should double-check this but if you
said when were you born I think we do
respond I need to double check that but
I'm pretty positive about it I think you
do it because I think I've too soon but
that's like that's like hell like I was
born in your brand of champagne and
whatever the year good thing yeah so in
terms of the metaphysical I think it's
early does it have the historic
knowledge about herself
to be able to do that maybe have we
crossed that boundary not yet right in
terms of being thank you have you
thought about it quite a bit but I
wouldn't say that we have come to a
clear decision in terms of what it
should look like but you can imagine
though and I bring this back to the
Alexa prize social BOTS one
there you will start seeing some of that
like you these bots have their identity
and in terms of that you may find you
know this is such a great research topic
that some academia team may think of
these problems and start solving them -
so let me ask a question it's kind of
difficult I think but it feels
fascinating to me because I'm fascinated
with psychology it feels that the more
personality you have the more dangerous
it is in terms of a customer perspective
of products if you want to create a
product that's useful by dangerous I
mean creating an experience that upsets
me and so what how do you get that right
because if you look at the relationships
maybe I'm just a screwed-up Russian but
if you look at the real human to human
relationship some of our deepest
relationships have fights have tension
have the push and pull have a little
flavor in them do you want to have such
flavor in an interaction with Alexa how
do you think about that so there's one
other common thing that you didn't say
but is we think of it as paramount for
any deep relationship that's trust trust
yeah so I think if you trust every
attribute you said mm-hmm a fight some
tension yeah is or healthy but the
waters sort of unknowable in this
instance is trust and I think the bar to
earn customer trust for AI is very high
in some sense more than a human it's
it's not just about personal information
or your data it's also about your
actions on a daily basis how trustworthy
are you in terms of consistency in terms
of how accurate are you in understanding
me like if if you're talking to a person
on the phone if you have a problem with
your let's say your internet or
something if the person is not
understanding you lose trust right away
you don't want to talk to that person
that whole example gets amplified by a
factor of 10 because as when you're a
human interacting with an AI you have a
certain expectation either you expect it
to be
very intelligent and then you get upset
why is it behaving this way more you
expect it to be not so intelligent and
when it surprises you're like really
you're trying to be too small so I think
we grapple with these hard questions as
well but I think the key is actions need
to be trustworthy from these a is not
just about data protection your personal
information protection but also from how
accurate it accomplishes all commands
are all interactions well it's tough to
hear because Trust you're absolutely
right but Trust is such a high bar with
AI systems because people and I see this
because I work with autonomous vehicles
I mean the bar this placed on AI system
is unreasonably high yeah that is going
to be as I agree with you and I think of
it is it's it's a challenge and it's
also keeps my job so from that
perspective that I totally I think of it
at both sides as a customer and as a
researcher I think as a researcher yes
occasionally it will frustrate me that
why is the bar so high for these AIS and
as a customer then I say absolutely it
has to be that high right so I think
that's the trade-off we have to balance
but doesn't change the fundamentals that
trust has to be own and the question
then becomes is are we holding the AIS
to a different bar and accuracy and
mistakes then we hold humans that's
going to be a great societal questions
for years to come I think for us well
one of the questions that we grapple as
a society now that I think about a lot I
think a lot of people know I think about
a lot and Alexis taking on head-on is
privacy is the reality is us giving over
data to any AI system can be used to
enrich our lives in in in profound ways
so if maybe basically any product that
does anything awesome for you would the
more data has the more awesome things it
can do and yet at the other side people
imagine the worst case possible scenario
of what can you possibly do with that
data
people it's it goes down to trust as you
said
for there's a fundamental distrust of in
certain groups of governments and so on
and depending on the government
depending on who is in power depending
on all these kinds of factors and so
here's the lux in the middle of all of
it in the home trying to do good things
for the customers so how do you think
about privacy in this context the smart
assistants in the home how do you
maintain how do you earn trust
absolutely so as you said Trust is the
key here so you start with trust and
then privacy is a key aspect of it it
has to be designed from very beginning
about that and we believe in two
fundamental principles one is
transparency and second is control so if
by transparency I mean when we build
what is now called smart speaker or the
first echo we were quite judicious about
making these right trade-offs on
customers behalf that it is pretty clear
when when the audio is being sent the
cloud the light ring comes on when it
has heard you say the word wake word and
then the streaming happens right so and
the light ring comes up we also had we
put a physical mute button on it just so
you're if you didn't want it to be
listening even for the weak word then
you turn the power button on the mute
button on and that disables the
microphones that's just the first
decision on essentially transparency and
control over then even when we launched
we gave the control in the hands of the
customers that you can go and look at
any of your individual utterances that
is recorded and delete them anytime and
we have cut to true to that promise
right so and that is super again a great
instance of showing how you have the
control then we made it even easier you
can say lecture delete what I said today
so that is now making it even just just
more control in your hands with what's
most convenient about this technology is
voice you delete it with your voice now
so these are the types of decisions we
continually make we just recently
launched this feature called what we
think of it as if you wanted humans not
to review your data because smile you
mentioned supervised
so you in supervised learning humans
have to give some annotation and that
also is now a feature where you can
essentially if you selected that flag
your data will not be reviewed by a
human so these are the types of controls
that we have to constantly offer with
customers so why do you think about as
people so much that so that so
everything you just said is really
powerful to the control the ability to
leak because we collect we have studies
here running at MIT that collects huge
amounts of data and people consent and
so on the ability to delete that data is
really empowering and almost nobody ever
asked to delete it but the ability to
have that control is really powerful but
still you know there's these popular
anecdotes anecdotal evidence that people
say they like to tell that them and a
friend were talking about something I
don't know sweaters for cats and all
sudden they'll have advertisements for
cat sweaters on Amazon there's that
that's a popular anecdote as if
something is always listening
what can you explain that anecdote that
experience that people have what's the
psychology of that what's that
experience and can you you've answered
it but let me just ask is Alexa
listening no Alexa listens only for the
wake word on the device right and awake
word is the words like Alexa Amazon echo
and you but do you only choose one at a
time so you choose one and it listens
only for that on our devices so that's
first from a listening perspective we
have to be very clear that it's just the
wake word so you said why is there this
anxiety if you make yeah it's because
there's a lot of confusion what it
really listens to right and you and I
think it's partly on us to keep
educating our customers and the general
media more in terms of like how what
really happens and we've done a lot of
it and with our pages on information are
clear but still people have to have more
there's always a hunger for information
and clarity and will constantly look at
how best to communicate if you go back
and read everything yes it states
exactly that
and then people could still question it
and I think that's absolutely okay to
question what we have to make sure is
that we are because our fundamental
philosophy is customer first customer
obsession is our leadership principle if
you put as researchers I put myself in
the shoes of the customer and all
decisions in Amazon are made with that
and I throw and Trust has to be earned
and we have to keep earning the trust of
our customers in this setting and to
your other point on like is there
something showing up based on your
conversations no I think the answer is
like you a lot of times when those
experiences happen you have to also be
know that okay maybe a winter season
people are looking for sweaters right
and it shows up on your amazon.com
because it is popular so there are many
of these you mentioned that personality
or personalization turns out we are not
that unique either right so those things
we we as humans start thinking oh must
be because something was heard and
that's why this other thing showed up
the answer is no probably it is just the
season for sweaters I'm not gonna ask
you this question because it's just cuz
your doll so because people have so much
paranoia but for Milan as you say from
my perspective I hope there's a day when
customer can ask Alexa to listen all the
time to improve the experience to
improve because I personally don't see
the negative because if you have the
control and if you have the trust
there's no reason why I shouldn't be
listening all the time to the
conversations to learn more about you
because ultimately as long as you have
control and Trust every data you provide
to the device that the device wants is
going to be useful and that's it
to me I as a machine learning person I
think it worries me how sensitive people
are about their data relative to how
empowering it could be for the devices
around them how enriching it could be
for their own life to improve
the product so I just it's something I
think about sort of a lot how do we make
that devices obviously Lux that thinks
about it a lot as well I don't know if
you want to comment on that sort of okay
have you seen them in the form of a
question okay I have have you seen an
evolution in the way people think about
their private data in the previous
several years so as we as a society a
more more comfortable to the benefits we
get by sharing more data first let me
answer that part and then I'll want to
go back to the other aspect you were
mentioning so as a society on a general
we are getting more comfortable as a
society doesn't mean that everyone is
and I think we have to respect that
I don't think one-size-fits-all is
always gonna be the answer for all right
by definition so I think that's is
something to keep in mind in these going
back to your on what more magical
experiences can be launched in these
kind of AI settings I think again if you
give the control we it's possible
certain parts of it so if you have a
feature called follow-up mode where you
if you turn it on and Alexa after you've
spoken to it will open the mics again
thinking you lanced something again yeah
like if you're adding lists to your
shopping items so right or a shopping
list or to-do list
you're not done you want to keep so in
that setting it's awesome that it opens
the mic for you to say eggs and milk and
then bread right so these are the kind
of things which you can empower so I and
then another feature we have which is
called Alexa guard I said it only
listens for the wake word all right but
if you have a let's say you're going to
say Lex you leave your home and you want
a lexer to listen for a couple of sound
events like smoke alarm going off or
someone breaking your glass right so
it's like just to keep your peace of
mind so you can say Alexa on guard or
I'm away or and then it can be listening
for these sound events and when you're
home it you come out of that mode right
so this is another one where you again
gave controls in the hands of the user
or the custom
and to enable some experience that is
you higher utility and maybe even more
delightful in the certain settings like
follow up more and so forth again this
general principle is the same
control in the hands of the Castro so I
know we kind of started with a lot of
philosophy and a lot of interesting
topics and we'll just jumping all over
the place but really some of the
fascinating things at the alexa team and
Amazon's doings in the the algorithm
side the data side the technology at the
deep learning machine learning and and
so on so can you give a brief history of
Alexa from the perspective of just
innovation the algorithms the data of
how I was born how it came to be how is
grown where it is today yeah start with
in Amazon everything starts with the
customer and we have a process called
working backwards Alexa and more
specifically then the product echo there
was a working backwards document
essentially that reflected what it would
be started with a very simple vision
statement for instance that morphed into
a full-fledged document along the way
changed into what all it can do right
you can but the inspiration was the Star
Trek computer so when you think of it
that way you know everything is possible
but when you launch a product you have
to start with someplace and when I
joined we the product was already in
conception and we started working on the
far field speech recognition because
that was the first thing to solve by
that we mean that you should be able to

Resume

Berikut adalah rangkuman komprehensif dan terstruktur berdasarkan transkrip wawancara dengan Rohit Prasad (VP dan Head Scientist Amazon Alexa).

---

# Masa Depan Kecerdasan Buatan: Evolusi Alexa, Tantangan NLP, dan Visi AI "Superhuman"

### Inti Sari (Executive Summary)
Video ini menampilkan wawancara mendalam dengan Rohit Prasad, sosok di balik keberhasilan Amazon Alexa, yang membahas evolusi asisten suara dari sekadar perintah sederhana menuju kecerdasan percakapan yang kompleks. Pembahasan mencakup tantangan teknis utama seperti pengenalan suara jarak jauh (*far-field speech*), pentingnya kemampuan *reasoning* dalam AI, serta keseimbangan kritis antara personalisasi dan privasi pengguna. Prasad juga mengungkapkan visi masa depan di mana AI tidak hanya menjadi alat bantu fungsional, tetapi mitra yang mampu memahami konteks dan tujuan manusia secara menyeluruh.

### Poin-Poin Kunci (Key Takeaways)
*   **Percakapan sebagai Ujian Kecerdasan Tertinggi:** Kemampuan untuk berdialog dianggap sebagai ujian kecerdasan yang lebih sulit dibandingkan bermain catur atau mengemudi otonom, karena tujuan dan kondisi dalam percakapan seringkali tidak terdefinisi.
*   **Visi "Superhuman", Bukan Sekadar "Manusiawi":** Alexa dirancang untuk memiliki kemampuan di atas manusia dalam hal memori tak terbatas, komputasi cepat, dan keberadaan yang *ubiquitous* (bisa di mana saja), bukan hanya meniru kepribadian manusia.
*   **Pentingnya Privasi dan Kepercayaan:** Kepercayaan adalah fondasi adopsi AI. Amazon memberikan kontrol penuh kepada pengguna melalui fitur seperti tombol *mute* fisik, manajemen data suara, dan transparansi indikator visual.
*   **Revolusi *Deep Learning*:** Perubahan drastis akurasi pengenalan suara Alexa dicapai dengan beralih dari metode statistik tradisional ke *deep learning* dan pelatihan data berskala besar.
*   **Masa Depan *Reasoning*:** Tantangan berikutnya dalam AI adalah kemampuan *reasoning* (penalaran) yang memungkinkan asisten merencanakan tujuan kompleks (seperti merencanakan akhir pekan) tanpa instruksi langkah demi langkah.

---

### Rincian Materi (Detailed Breakdown)

#### 1. Filosofi AI dan Definisi Kecerdasan
Diskusi dimulai dengan perbandingan antara AI dalam film *Her* dengan kenyataan saat ini. Rohit Prasad menjelaskan bahwa meskipun adopsi asisten AI telah meningkat pesat, tujuannya bukanlah menciptakan hubungan romantis, melainkan utilitas yang tinggi.
*   **Interaksi Manusia-Mesin:** AI tidak harus selalu meniru manusia. Dalam beberapa konteks, seperti menyalakan lampu, pengguna lebih menginginkan efisiensi daripada percakapan yang "manusiawi".
*   **Kecerdasan vs Sensorik:** Kecerdasan didefinisikan sebagai kemampuan untuk merasakan (sensor), mengambil keputusan, dan bercakap. Mesin memiliki keunggulan pada memori yang hampir tak terbatas dan kecepatan pengambilan data, yang berpotensi melampaui kemampuan kognitif manusia.
*   **Tantangan Percakapan:** Berbeda dengan permainan seperti Go atau Catur yang memiliki aturan tetap, percakapan *open-domain* (seperti *Alexa Prize*) sangat sulit karena tujuannya jelas dan status percakapan terus berubah secara dinamis.

#### 2. Inovasi Teknis: Dari Ide ke Realitas
Bagian ini mengulas perjalanan teknis pengembangan Alexa, yang terinspirasi dari komputer di *Star Trek*.
*   **Masalah *Far-Field*:** Tantangan terbesar awal adalah pengenalan suara dari jarak jauh (20-40 kaki) dalam lingkungan yang bising. Teknologi ini sebelumnya dianggap mustahil oleh banyak ahli.
*   **Deteksi *Wake Word*:** Membedakan kata "Alexa" dari kata yang mirip (seperti "I like you" atau nama "Alec") adalah masalah teknis yang kompleks yang membutuhkan detektor kata kunci presisi tinggi.
*   **Pivot ke *Deep Learning*:** Pada tahun 2013, tim memutuskan untuk menggandakan investasi pada *deep learning*. Hasilnya, tingkat kesalahan pengenalan suara berkurang hingga lima kali lipat dalam enam bulan, membuktikan bahwa teknologi ini siap untuk diluncurkan ke publik.
*   **Pemahaman Bahasa Alami (NLU):** Alih-alih menggunakan aturan kaku (*rule-based*), Alexa menggunakan pendekatan statistik dan *data-driven* untuk memahami maksud pengguna (*intent*) dan entitas, memungkinkannya menangani ribuan variasi perintah.

#### 3. Personalisasi, Identitas, dan Privasi
Rohit menekankan bahwa Alexa adalah asisten *virtual* yang hadir di berbagai perangkat (mobil, microwave, robot), bukan satu fisik tertentu.
*   **Identitas dan Lokalisasi:** Mengenali pengguna dan konteks budaya (bahasa Inggris AS vs Inggris vs India) di berbagai perangkat merupakan masalah ilmiah yang besar.
*   **Kepribadian yang Dapat Dikontrol:** Alexa memiliki kepribadian yang dirancang oleh pakar UX, namun tingkat personalisasi (seperti nada bicara atau preferensi) diserahkan kepada kontrol pengguna.
*   **Privasi sebagai Prioritas:**
    *   **Transparansi:** Lampu cincin biru menunjukkan saat data dikirim ke cloud.
    *   **Kontrol:** Pengguna dapat memutar ulang, menghapus rekaman suara, dan memilih keluar dari tinjauan manual manusia.
    *   **Fitur Keamanan:** *Alexa Guard* dapat mendeteksi suara alarm asap atau kaca pecah saat pengguna tidak ada di rumah, memberikan rasa aman tambahan.

#### 4. Evolusi Kemampuan dan Pembelajaran Mandiri
Alexa terus berkembang dari sekadar perintah satu arah (*transactional*) menjadi percakapan yang lebih kontekstual.
*   **Koreksi Otomatis (*Self-Learning*):** Sistem sekarang dapat belajar dari kesalahannya sendiri tanpa supervisi manusia. Jika pengguna membatalkan atau mengulang perintah (misalnya salah stasiun radio), sinyal ini digunakan untuk memperbaiki respons di masa depan secara otomatis.
*   **Konteks dan Memori:** Alexa mulai mengingat informasi dalam satu sesi (*short-term memory*) untuk menghubungkan perintah berurutan, seperti memesan tiket bioskop lalu memesan taksi tanpa mengulang data lokasi.
*   **Suara yang Lebih Alami:** Penggunaan *Neural Text-to-Speech* (TTS) membuat suara Alexa lebih mirip manusia dengan intonasi, ritme, dan emosi yang tepat.

#### 5. Visi Masa Depan: *Reasoning* dan Tugas Kompleks
Membahas masa depan 5 tahun ke depan, Rohit menyatakan bahwa batas antara percakapan *goal-oriented* (berorientasi tujuan) dan *open-domain* akan menipis.
*   **Menyelesaikan Masalah Kompleks:** AI di masa depan diharapkan dapat membantu merencanakan acara kompleks, seperti "malam keluar" atau "liburan akhir pekan", yang melibatkan riset, pemesanan, dan koordinasi logistik secara otomatis.
*   **Tantangan *Reasoning*:** Ini adalah masalah tersulit karena membutuhkan pemahaman tentang tujuan meta pengguna dan kemampuan mengingat preferensi jangka panjang (*long-term memory*).
*   **Demokratisasi AI:** Melalui *Alexa Skills Kit*, pengembang pihak ketiga dapat menciptakan pengalaman AI mereka sendiri, yang kini berjumlah lebih dari 90.000 keterampilan.

### Kesimpulan & Pesan Penutup
Kecerdasan buatan percakapan berada di titik balik sejarah, sebanding dengan kemunculan mobil otonom, dalam hal dampaknya terhadap kehidupan sehari-hari. Meskipun tugas-tugas sederhana seperti memeriksa cuaca atau memutar musik telah menjadi hal yang biasa, tantangan ke depan adalah menciptakan sistem yang dapat merencanakan dan menalar (reasoning) untuk membantu manusia menyelesaikan tujuan hidup yang lebih kompleks. Bagi Rohit Prasad dan timnya, kepuasan terbesar bukan hanya pada publikasi ilmiah, melainkan melihat miliaran orang mengadopsi teknologi ini untuk mempermudah kehidupan mereka.

Read

file updated 2026-02-13 13:24:23 UTC