File TXT tidak ditemukan.
David Ferrucci: The Story of IBM Watson Winning in Jeopardy | AI Podcast Clips
4Hx15WVxvII • 2019-10-12
Transcript preview
Open
Kind: captions Language: en so one of the greatest accomplishments in the history of AI is Watson competing against in a game of Jeopardy against humans and you were a lead in that accrue at a critical part of that so let's start the very basics what is the game of Jeopardy the game for us humans human versus human right so it's to take a question and answer it actually no but it's not right it's really not it's really it's really to get a question and answer but it's what we call a factoid questions so this notion of like it's it really relates to some fact that everything few people would argue whether the facts are true or not in fact most people what an jeopardy kind of counts on the idea that these these statements have factual answers and and the idea is to first of all determine whether or not you know the answer which is sort of an interesting twist so first of all understand the question you have to understand the question what is it asking and that's a good point because the questions are not asked directly right they're all like the way the questions are asked is nonlinear it's like it's a little bit witty it's a little bit playful sometimes it's a it's a little bit tricky yeah they're aston and exactly in numerous witty tricky ways exactly what they're asking is not obvious it takes it takes an experienced humans a while to go what is it even asking right and it's sort of an interesting realization that you have was a missus Oh what's the Jeopardy is a question answering Shou and there's a go like I know a lot and then you read it and you're you know you're still trying to process the question and the champions have the answer to moved on there's like there's three questions ahead the time you figured out what the question even met so there's there's definitely an ability there to just parse out what the question even is so that was certainly challenging it's interesting historically though if you look back at the jeopardy games much earlier you know like 63 yeah and I think the questions were much more direct they weren't quite like that they got sort of more and more interesting the way they asked them that sort of more interesting and subtle and nuanced and humorous and witty over time which really required the human to kind of make the right connections and figuring out what the question was even asking so yeah you have to figure out the questions even asking then you have to determine whether or not you think you know the answer and because you have to buzz in really quickly you sort of have to make that determination as quickly as you possibly can otherwise you lose the opportunity buzz in you've been going before you really know if you know the answer I think well I think a lot of humans will will assume they'll they'll look at the look at their process of very superficially in other words what's the topic what are some key words and just say do I know this area or not before they actually know the answer then they'll buzz in and then they'll buzz in and think about it it's interesting what humans do now some people who know all things like Ken Jennings or something or the more recent big jeopardy player that knows well suppose that those just assume they know although jeopardy you know just posed it you know Watson interestingly didn't even come close to knowing all of Jeopardy right Watson leaving at the peak even that that's better yeah so for example I mean we had this thing called recall which is like how many of all the Jeopardy questions you know how many did could we even find like find the right answer for like anywhere like could we come up with if we look you know we had up a big body of knowledge some of the order of several terabytes I mean from from a web scale was actually very small but from like a book scale talking about millions in bucks right so they're cool in millions a book and cyclopædia is dictionaries books a ton of information and you know for I think was 80 only 85% was the answer anywhere to be found hmm so you're ready down well you're ready down at that level just to get just to get started right so and and so was important to get a very quick sense of do you think you know the right answer to this question so we have to compute that confidence as quickly as we possibly could so it's in effect we have to answer it and at least you know spend some time essentially answering it and then judging the confidence that we you know that that our answer was right and in deciding whether or not we work off enough to buzz in and that would depend on what else was going on in the game it could because it was a risk so like if you're really in a situation where I have to take a gas I have very little to lose then you'll buzz in with less confidence so that was a counter for the the financial standings of the different competitors correct how much is a game was laughs how much time was left and where were you were in the standings things like that what how many hundreds of milliseconds that we're talking about here do you have a sense of what is we targets yeah was the targeted so I mean we targeted answering and under three seconds and buzzing it so the decision to buzz in and then the actual answering are those two yes there were two there were two different things in fact we had multiple stages whereas like we would say let's estimate our confidence which which is sort of a shallow answering process and then ultimate and then ultimately decide to buzz in and then we may take another second or something it's kind of go in there and measure and do that but by and large we're saying like we can't play the game we can't even compete if we can't on average answer these questions and around three seconds or less so you stepped in so there's this there's these three humans playing a game and you stepped in with the idea that IBM Watson would be one of replaced one of the humans and compete against two can you tell the story of Watson taking on this game sure seems exceptionally difficult yeah so the story was that um it was or it was coming up I think the 10-year anniversary of a big blue eye not big deep blues IBM wanted to do sort of another kind of really you know fun challenge public challenge that can bring attention to IBM research and the kind of the cool stuff that we were doing I had been working in an AI at IBM for some time I had a team doing what's called open domain factoids question-answering which is you know we're not gonna tell you what the questions are we're not even gonna tell you what they're about can you go off and get accurate answers to these questions and it was an area of AI research that I was involved in and so it was a big Pat it was a very specific passion of mine language understanding and always always been a passion of mine one sort of narrow slice on whether or not you could do anything was language was this notion of open domain and meaning I could ask anything about anything factoids meaning it essentially had an answer and and you know being able to do that accurately and quickly so that was a research area that might even already been in and so completely independently several you know IBM exactly there's like what are we gonna do what's the next cool thing to do and Ken Jennings was on his winning streak this was like whatever was 2004 I think was on his win winning streak and someone thought hey that'd be really cool if um if the computer can play jeopardy and so this was like in 2004 they were shopping this thing around and everyone who's telling the the research execs no way Mike this is crazy and we have some pretty you know senior people if you like I say only other's crazy and who come across my desk and I was like but that's kind of what what I'm really interested in doing and but there was such this prevailing sense of this is nuts we're not gonna risk IBM's reputation on this we're just not doing it and this happened in 2004 it happened in 2005 at the end of 2006 it was coming around again and I was coming off of a I was doing that the open domain question-answering stuff but I was coming off a couple other projects I had a lot more time to put into this and I argued that it could be done and I argue it would be crazy not to do this can I you could be honest at this point so even though you argued for it what's the confidence that you had yourself privately that this could be done what was we just told the store how you tell stories to convince others how confident were you what was your estimation of the problem at that time so I thought it was possible and a lot of people thought it was impossible I thought it was possible a reason why I thought it was possible is because I did some brief experimentation I knew a lot about how we were approaching home open domain factoids question answering we've been doing it for some years I looked at the Japanese stuff I said this is going to be hard for a lot of the points that you mentioned earlier hard to interpret the question hard to do it quickly enough hard to compute an accurate confidence none of this stuff had been done well enough before but a lot of the technologies were building with the kinds of technologies that should work but more to the point what was driving me was I was an IBM research I was a senior leader in IBM Research and this is the kind of stuff we were supposed to do we were basically supposed to the moonshot this is I mean we were supposed to take things and say this is an active research area it's our obligation to kind of if we have the opportunity to push it to the limits and if it doesn't work to understand more deeply why we can't do it and so I was very committed to that notion saying folks this is what we do it's crazy not not to do that this is an active research area we've been in this for years why wouldn't we take this Grand Challenge and and push it as hard as we can at the very least we'd be able to come out and say here's why this problem is is way hard here's what we've tried and here's how we failed so I was very driven as a scientist from that perspective and then I also argued based on what we did a feasibility study oh why I thought it was hard but possible and I showed examples of you know where it succeeded where it failed why it failed and sort of a high level architecture approach for why we should do it but for the most part that at that point the execs really were just looking for someone crazy enough to say yes because for several years at that point everyone has said no I'm not willing to risk my reputation and my career you know on this thing clearly you did not have such fears okay I did not say you died right in and yet for what I understand it was performing very poorly in the begin so what were the initial approaches and why did they fail well there were lots of hard aspects to it I mean one of the reasons why prior approaches that we had worked on in the past um failed was because of because the questions were difficult difficult to interpret like what are you even asking for right very often like if if the question was very direct like what city you know or what you know even then it could be tricky but but you know what city or what person often when it would name it very clearly you would know that and and if there was just a small set of them in other words we're gonna ask about these five types like it's gonna be an answer and the answer will be a city in this state or a city in this country the answer will be a person of this type right like an actor or whatever it is but turns out that in jeopardy there were like tens of thousands of these things and it was a very very long tale meaning you know that it just went on and on and and so even if you focused on trying to encode the types at the very top like there's five that were the most let's say five of the most frequent you still cover a very small percentage of the data so you couldn't take that approach of saying I'm just going to try to collect facts about these five or ten types or twenty types or fifty types or whatever so that was like one of the first things like what do you do about that and so we came up with a an approach toward that and the approach looked promising and we we continue to improve our abilities to handle that problem throughout the project the other issue was that right from the outside I said we're not going to I committed to doing this in three to five years so we did in four so I got lucky um but one of the things that that putting that like stake in the ground was I and I knew how hard the language of the standard problem was I said we're not going to actually understand language to solve this problem we are not going to interpret the question and the domain of knowledge the question of - in reason over that to answer these questions were obviously we're not going to be doing that at the same time simple search wasn't good enough to confidently answer with this you know a single correct answer first others like brilliant that's such a great mix of innovation in practical engineering three three four eight so you're not you're not trying to solve the general NLU problem you're saying let's solve this in any way possible oh yeah no I was committed to saying look we're good solving the open domain question answering problem we're using jeopardy as a driver for the havoc management hard enough big benchmark exactly and now we're gonna do it we're just like whatever like just figure out what works because I want to be able to go back to the academic the scientific community and say here's what we tried here's what work here's what didn't work I don't want to go in and say oh I only have one technology man only gonna use this I'm gonna do whatever it takes I'm like I'm gonna think out of the box man do whatever it takes one um and I also lose another thing I believe I believe that the fundamental NLP technologies and machine learning technologies would be would be adequate and this was an issue of how do we enhance them how do we integrate them how do we advance them so I had one researcher and came to me who had been working on question answering with me for a very long time who had said we're gonna need Maxwell's equations for question-answering and said if we if we need some fundamental formula that breaks new ground and how we understand language we're screwed yeah we're not gonna get there from here like we I am not counting I am that my assumption is I'm not counting on some brand new invention what I'm counting on is the ability to take everything that has done before to figure out a an architecture on how to integrate it well and then see where it breaks and make the necessary advances we need to make and so this thing works yeah push it hard to see where it breaks and then patch it up I mean that's how people change the world and that's the you know mosque approaches the Rockets SpaceX that's the Henry Ford and so on a lot and and I happen to be and in this case I happen to be right but but like we didn't know but you kind of have to put a second or so how you gonna run the project so yeah and backtracking to search so if you were to do what's the brute force solution what what would you search over so you have a question how would you search the possible space of answers look web search has come a long way even since then but at the time like you know you first of all I mean there are a couple of other constraints around the problems interesting so you couldn't go out to the web you couldn't search the Internet in other words the AI experiment was we want a self-contained device device if devices as big as a room fine it's as big as a room but we want a self-contained advice contained device you're not going out the internet you don't have a lifetime lifeline to anything so it has to kind of fit in a shoebox if you will or at least the size of a few refrigerators whatever it might be see but also you couldn't just get out there you couldn't go off network right to kind of go so there was that limitation but then we did it but the basic thing was go go do what go do a web search problem was even when we went and did a web search I don't remember exactly the numbers but someone the order of 65% at a time the answer would be somewhere you know in the top 10 or 20 documents so first of all that's not even good enough to play pretty you know the words even if you could pull the avian if you could perfectly pull the answer out of the top 20 documents top 10 documents whatever was which we didn't know how to do but even if you could that do that your you'd be at and you knew it was right lens we've had enough confidence in it right so you have to pull out the right answer you have you depth of confidence it was the right answer and and then you'd have to do that fast enough now go buzz in and you'd still only get 65% of them right wind doesn't even put you in the winner's circle man winner's circle you have to be up over 70 and you have to do it really quick and you do really quickly but now the problem is well even if I had somewhere in the top 10 documents how do I figure out where in the top 10 documents that answer is and how do i compute a confidence of all the possible candidates so it's not like I go in knowing the right answer and I have to pick it I don't know the right answer I have a bunch of documents somewhere in there's the right answer how do i as a machine go out and figure out which one's right and then how do I score it so and now how do I deal with the fact that I can't actually go out to the web first of all if you pause and then just think about it if you could go to the web do you think that problem is solvable if you just pause on it just thinking even beyond jeopardy do you think the problem of reading text defined where the answer is but we saw we solved that in some definition of solves given the Jeopardy challenge how did you do it forever so how did you take a body of work and a particular topic and extract the key pieces of information so what so now forgetting about the the huge volumes that are on the web right so now we have to figure out we did a lot of source research in other words what body of knowledge is gonna be small enough but broad enough to answer Jeffrey and we ultimately did find the body of knowledge that did that I mean it included Wikipedia and a bunch of other stuff so like encyclopedia tennis stuff I don't know theories different types of semantic resources unlike word net and other types of matters like that as well as like some Web crawls in other words where we went out and took that content and then expanded it based on producing statistical see you know statistically producing seas using those seas for other searchers searches and then expanding that so using these like expansion techniques we went out and had found enough content and we're like okay this is good and we even up and totally and you know we had a threat of research as always trying to figure out what content could we efficiently include I mean there's a lot of popular cut like what is the church lady well I think was one of the end hey know what I guess the that's probably an encyclopedia so it's a you know is that but then we would but then we would take that stuff when we would go out and we would expand in other words we go find other content that wasn't in the core resources and expanded you know the amount of content will grew it by an order of magnitude but still so again from a web scale perspective this is very small amount of content it's very select we then we then took all that content so we we pre analyzed the crap out of it meaning we we we parsed it you know broke it down into all those individual words and then we did semantic static and semantic parses on it you know had computer algorithms that annotated it and we in that we index that in a very rich and very fast index so we have a relatively huge amount of you know let's say the equivalent of for the sake of argument two to five million bucks we've now analyzed all that blowing up at size even more because now with all this metadata and we then we richly index all of that and in by way in a giant in-memory cache so Watson did not go to disk so the infrastructure component there if you just speak to it how tough it I mean I know mm maybe this is two thousand eighty nine you know that that's kind of a long time ago right how hard is it to use multiple machines Olivia how hard is the infrastructure the hardware component we used IBM we so we used IBM hardware we had something like I figured exactly but 2000 or close to three thousand cores completely connected so they had a switch were you know every CPU was connected to every other scene they were sharing memory in some kind of way Lauren of clever shared memory right and all this data was pre analyzed and put into a very fast indexing structure that was all all all in all in memory and then we took that question we would analyze the question so all the content was now pre analyzed so if I so if I went and tried to find a piece of content it would come back with all the metadata that we had pre computed how do you shove that question how do you connect the the big stuff of the meta the the big knowledgebase of the metadata and that's index to the simple little witty confusing question right so therein lies you know the Watson architects right so he would take the question we would analyze the question so which means that we would parse it and interpret it a bunch of different ways we try to figure out what is it asking about so we would come we had multiple strategies to kind of determine what was it asking for that might be represented as a simple string and character string or something we would connect back to different semantic types that were from existing resources so anyway the bottom line is we would do a bunch of analysis in the question and question analysis had to finish and had to finish fast so we do the question analysis because then from the question analysis we would now produce searches so we would and we had built using open source search engines we modified them we had a number of different search engines we would use that had different characteristics we went in there and engineered and modified those search engines ultimately to now take our question Alice's produce multiple queries based on different interpretations of the question and fire out a whole bunch of searches in parallel and they would produce a would come back with passages so this is these are passive search algorithms dudes come back with passages and so now you let's say you had a thousand passages now for each passage you you parallel eyes again so you went out and you paralyzed paralyzed the search each search would now come back with a whole bunch of passages maybe you had a total of a thousand or five thousand whatever passages for each passage now you'd go and figure out whether or not there was a candidate it would call it candidate answer in there so you had a whole bunch of other a whole bunch of other algorithms that would find candidate answers possible answers to the question and so you had Canada answer jet cold candidate answers generators to a whole bunch of those so for every one of these components the team was constantly doing research coming up better ways to generate search queries from the questions better ways to analyze the question better ways to generate candidates and speeds so better is accuracy and speed cracked so right and speed and accuracy for the most part we're separated we handle that sort of in separate ways like I focused purely on accuracy and to an accuracy are we ultimately getting more questions and producing more accurate confidences and they had a whole nother team that was constantly analyzing the workflow to find the bottlenecks and then in figuring out of both paralyze and drive the algorithm speed but anyway so so now think of it like you have this big fan out now right because you have you had multiple queries now you have now you have thousands of candidate answers for each candidate answer you're gonna score it so you're gonna use all the data that built up you're gonna use the question analysis you can use how the query was generated you're gonna use the passage itself and you're going to use the cans at an answer that was generated and you're going to score that so now we have a group of researchers coming up with scores there are hundreds of different scores so now you're getting a fan at it again from however many candidate answers you have to all the different scores so if you have a 200 different scores and you never thousand candidates now you have two thousand thousand scores and and so now you got to figure out you know how do I now rank these ranked these answers based on the scores that came back and I want to rank them based on the likelihood that there are correct answers to the question so every score was its own research project winning me my score so is that the annotation process of basically human being saying that this this answer think of think of it if you want to think of it what you're doing you know if you want to think about what a human would be doing human would be looking at a possible answer they'd be reading the you know Emily Dixon Dickinson they've been reading the passage in which they occurred they'd be looking at the question and they'd be making a decision of how likely it is that Emily Dixon Dickinson given this evidence in this passage is the right answer to that question got it so that that's the annotation task that stands I'm scoring tasks so but scoring implies zero to one kind of that's right zero to one school is not a binary no give the score give it a zero yeah exactly is it what humans did give different scores so that you have to somehow normalize and all that kind of stuff that deal with all that depends on what your strategy is we both we could be relative to it could be we we actually looked at the raw scores as well standardized scores because humans are not involved in this humans are not involved sorry so I mean I'm misunderstanding the the the procedure this is passages where is the ground truth coming from grab truth is only their answers to the questions so its end to end it's end to end so we also I was always driving and and performance a very interesting a very interesting you know engineering approach and ultimately scientific and research approaches were always driving intent now that's not to say we wouldn't make hypotheses that individual component performance was related in some way to n10 performance of course we would because people would have to build individual components but ultimately to get your component integrate into the system you had to show impact on end-to-end performance question answering performance as there's many very smart people work on this and they're basically trying to sell their ideas as a component that should be part of the system that's right and and they would do research on their component and they would say things like you know I'm going to improve this as a candidate generate or I'm going to improve this as a question score or I was a passive scorer I'm going to improve this or as a parser and I can improve it by two percent on its component metric like a better parse or better candidate or a better type estimation whatever it is and then I would say I need to understand how the improvement on that component metric is going to affect the end-to-end performance if you can't estimate that and can't do experiments to demonstrate that it doesn't get in that's like the best run AI project I've ever heard that's awesome okay what breakthrough would you say like I'm sure there's a lot of day to day break this but it was there like a breakthrough that really helped improve performance like what what people began to believe or is it just a gradual process well I think it was a gradual process but one of the things that I think gave people confidence that we can get there was that as we fouled as as we follow this procedure of different ideas build different components plug them into the architectural run the system see how we do do the error analysis start off new research projects to improve things and the end and and the very important idea that the individual component work did not have to deeply understand everything that was going on with every other component and this is where we we leverage machine learning in a very important way so while individual components could be statistically driven machine learning components some of them were heuristic some of them were machine learning components the system has a whole combined all the scores using machine learning this was critical because that way you can divide and conquer so you can say ok you work on your candidate generator or you work on this approach to answer scoring you work on this approach to type scoring you work on this approach to passage search or the passive selection and so forth but when we just plug it in and we had enough training data to say now we can weaken train and figure out how do we weigh all the scores relative to each other based on the predicting the outcome which is right right or wrong on jeopardy and we had enough training data to do that so this enabled people to work independently and to let the machine learning do the integration beautiful so that yeah the machine learning is doing the fusion and then it's a human orchestrated ensemble that's our friend approaches so it's great still impressive that you're able to get it done a few years that were that not obvious to me that it's doable if I just put myself in that mindset but when you look back at the Jeopardy challenge again when you're looking up with the Stars what are you most proud of looking back at those days I'm most proud of my um my commitment and my team's commitment to be true to the science to not be afraid to fail that's beautiful because there's so much pressure because it is a public event it is a public show that you were dedicated to the idea that's right do you think it was a success in the eyes of the world it was a success by your I'm sure exceptionally high standards is there something you regret you would do differently it was a success I it was a success for our goal our goal was to build the most advanced open domain question answering system we went back to the old problems that we used to try to solve and we did dramatically better on all of them as well as we beat jeopardy so we won at jeopardy so it was it was a success it was I worried that the world would not understand it as fast because it came down to only one game and I knew statistically speaking this can be a huge technical success and we could still lose that one game and that's a whole nother theme of this of the journey but it was a success it was not a success in natural language understanding but that was not the goal yeah that was but I would argue I understand what you're saying in terms of the science but I would argue that the inspiration of it right the they not a success in terms of solving natural language understanding there was a success of being an inspiration to future challenges absolutely that drive future efforts you
Resume
Categories