„Roboter, die in der realen Welt eingesetzt werden, benötigen eine ausgefeilte visuelle Intelligenz, um sich in der Fülle der Situationen zurechtzufinden, denen sie begegnen können.“ Das sagt der Informatiker Professor Dr. Enrico Motta vom Knowledge Media Institute (KMi) der Open University in Großbritannien. Die dominierenden Deep-Learning-Methoden reichen dafür nicht aus. Wie sich visuelle Intelligenz durch Kombination mit anderen Komponenten der Künstlichen Intelligenz verbessern kann, erläutert Motta in seinem Vortrag „A Hybrid Approach to Enabling Visual Intelligence in Robots“ aus der Reihe „Co-Constructing Intelligence“ (Ko-Konstruktion von Intelligenz), einem Angebot der Universitäten Bielefeld, Bremen und Paderborn.
„Ein visuell intelligenter Roboter muss mindestens in der Lage sein, den Inhalt seiner Beobachtungen zu erkennen“, sagt Enrico Motta. Die Objekterkennung erfolgt typischerweise mit Methoden des Deep Learning (DL). Sie stellen den De-facto-Standard für verschiedene Aufgaben der Künstlichen Intelligenz dar, darunter Bild- und Spracherkennung.
okay Zoom doesn’t like my keypad, so I’ll use the mouse okay. So basically the very very quick Sam
summary is as follows And I’m going to talk about robots.
As I said in particular, the work we have done on trying to get robot robots to understand the world around them.
So in particular through the the visual capabilities and in the pressure that we have used basically is what we call
hybrid approach that integrates that kind of state of the art deep learning techniques for object recognition with additional sources of knowledge.
And I’ll talk quite a bit about these hybrid elements because I believe
that even in the age of deep learning and unless you augment the
sort of state of the art deep learning techniques, a lot of machine learning techniques that we the other types of reasoning, other types of knowledge,
if you are not going to get the kind of performance and let’s say you’re not going to exhibit a level of intelligence
that is required in a lot of the real world situations. And and of course, I want to make these extremely concrete.
So basically all this discussion would be grounded on and scenario that we have
implemented in which basically we have a robot called Harness Hands is the acronym for Health and Safety.
And and basically this robot has been trained to go round the lab
with and spot any situation in the lab, which is a violation of the health and safety guidelines. So
now talking about robots, basically, you know, technically hard, so is what is called the service robots robot.
And in the past few years there’s been quite a nice explosion
of service robots in various industries. They follow on the on the top left is from the sci rock
competition, robotic competition, which 1 to 1 of the task was to send the people in a real restaurant you
and of course the restaurant in Queens and indeed you can see on the top left the two two customers have been said by a Thiago robot.
And now even better, that there was simply, you know, a competition, which means the European teams will be the, you know, on this benchmark.
And but more interesting if you look at the image on the top, actually on the bottom right corner and these are essentially delivery robots
and I believe Milton Keynes city of the city where there’ll be less is based has been the faster city UK, one of the first cities
in the world to have autonomous robots roaming the street, roaming the streets and delivering food.
So the point is that this is well beyond academia. These these applications on the service robots are now
and not just the established order in in specific industrial scenarios, but that sort of everyday
if you walk around Milton Keynes, we would bump into delivered robots pretty regularly.
Okay. So let me show you just to give you a taste of what we are going to talk about.
And then I would take a little bit to the great show and talk about a little bit more about your paradigms.
So basically, when we talk about the tech, the health and safety issue, you know, you you have to understand is a robot goes around the lab,
you need to recognize objects
that is not just to recognize obviously needs to sort of interpret the situation. So, for example, you know, a possible health and safety violation is
when you get like in this picture, you get a walk whichis
next to electric heater. Now the health and safety guidelines
basically say that the electric devices can be next to flammable material, which means that, you know,
hands needs to understand, okay, portable heater is an extra device. Electric device can overheat and the book is made of paper.
Paper is flammable and therefore, you have these a health and safety violation
and so as you can see straight away, it is
unless you just remove the bottom, unless it comes as you can see the way, there are quite a number of capabilities
that are required from a robot in order to be able to perform this job as a kind of health and safety inspector.
But first of all, you need to understand, you know, you have to have knowledge of the health and safety rules like the one I just mentioned.
And of course, you operate in a real world environment, so you need to identify objects in the environment.
And this object identification is is much more complicated than a lot of the standard image recognition benchmarks,
which are based on rather kind of sanitized photos. And this is, you know, the real world in real world and day and night
you have a sign in the clouds and people move stuff around. Exactly. So it’s a much more challenging problem.
And then, of course, you need to understand the relevant domain entities. Like I said, the book is made of paper. The paper is flammable material.
And and of course, among other things, you also need to understand spatial relations, at least in order to interpret the guidelines, which
talk about the e-book and what flammable material can be next to a cell, so heat or an electrical appliance.
So so these are quite a lot, you know, hands needs to be able to handle.
So now how to go about building has been. So let’s take a little bit of a regression and let’s talk a little bit
to a paradigm shift as a new will show inside and say
well back in 1958 at the very beginning. Yeah. Basically what the is about is build intelligent machines.
You know, machines like hands that can perform a job that require a certain degree of intelligence.
So it’s what they call the nicely with an expression, a synthetic approach to intelligence.
And of course, over the past 60 plus years, in fact, it’s almost 70 years since then
that that was the workshop, which was like which is considered the best of it.
Yeah. There’ve been a number of paradigms that have been proposed. Indeed, one of the very first ones was the
the one by New Well and Simon on GIPS, the general problem solver.
And it’s quite interesting because you know in the early days would be the the idea was essentially that the mind
was some kind of computer and the reason was essentially logical reasoning.
And and therefore the problem solver, you know, walked on on a search space in which operators were primarily
logical operators as as shown in this in his slide.
And the typical kind of benchmark task was chased landed on the bottom right corner.
You can see a new well in Simon playing chess is actually quite interesting because although they knew
when Sam introduced a lot of important ideas that are still very relevant, for example, like
the notion of a yahaya sesh, which is pretty much as relevant today as it was in 1958, the and a lot of the things
that didn’t get it right that mean you can’t you can’t really solve the most interesting real world problems simply
by mapping the problem to a nice logical framework. Well at least the job is to be good form from the complexity of the world
to the nice logical reasoning as expressed in these in these operators.
And in fact, interestingly, you know, I recommend that is a book by Garry Kasparov called Deep Thinking, which is basically about A.I.
in chess. And and they say that, of course, Niebuhr and Simon, when they were talking about these things and they were using chess as an example,
it was pragmatic because neither of them could play chess properly. You know, they really didn’t understand chess to the level of a world champion that got Garry Kasparov
and indeed, chess players, they don’t set spaces the way you will. And Simon thought about chess.
There is all the pattern recognition, which is a complete different problem solving paradigm.
And so basically the limitations of these initial paradigm were quickly identified
so that essentially the next big paradigm was really that the knowledge base paradigm where such a people realized work
is not really about having a modest permanence or understanding that
the logical properties of disjunction is really about acquiring domain knowledge from experts.
Back in this case of a bit, in the case of the amazing system, it selects antibiotics for bacteremia and
and try to duplicate the reasoning process of of an expert.
So in this case, for example, you know, this reasoning process shows that if somebody has a low blood and white
blood cell count and then eventually you can deduce that
some of his immune system is somehow compromised, which, you know, you can then conclude that there’s some kind of infection
and then you can try and refine, which, you know, understand which kind of infection specifically we are talking about.
So this is all good, but this is not very is all extremely domain specific.
So what then difference indeed, in analyzing a number of expert systems of this kind,
you say actually it is not just about domain knowledge, but it is also about these and more general problem
solving methods like these risk classifications that can be used in a variety of different domains. So in particular in this case of mice
and what they are doing, they are taking some data, they are carrying out a number of abstraction steps
until that they concluded that we are talking about a, you know, an immune deficiency or a compromise. HOST
And then there is a risk step which then leads to a category of
solutions and then you refine to a more specific solution. So essentially is these
abstractions that go up and you risk step, let’s say, going horizontally and then a refinement step,
which is called the risk classification of this paradigm pretty much dominated the in the eighties and nineties and is myopic.
So it was in a parametric design. And basically what we did in those days was really to
to take a big class of tasks that were amenable to a treatment like design or scheduling or planning
or diagnosis and really understand what was the structure of the task
and what was the kind of problem solving that the was was required for that task.
So essentially the level of abstraction went up significantly,
but the paradigm remained that all of the knowledge based system, essentially
the idea was that essentially the intelligence is
is a side effect of knowledge. So if you want to have intention problem solving, you really need to
you have to represent explicitly a large body of knowledge which is relevant to
the specific task that you want to carry out and the specific principle that you want to use.
And and of course, this knowledge based paradigm was quite dominant for a number of years and, and in particular
the side effect of this notion that, you know, you need a lot of knowledge in order to be intelligent.
Basically this led to a variety of activities
to do with building a very large knowledge basis. And of course, the most
pioneering one was the psych knowledge based and a project to stop in 1984.
But then so there was all this flurry activity. But the bottom line was that the
nowadays the dominant paradigm is not really the knowledge base paradigm, but is really the deep learning paradigm where essentially
you have these sort of super clever machine learning with multiple layers, so neurons atop each other, which provided enough data
and available produce rather exceptional results.
For example, you’ll know, I’m sure, about Chop Jeopardy, but I don’t know how many of you know that
the try to catch up on and a classic biology
test in the states for high school students and on and out of 60 questions
Chad Djibouti got 59 correct that which basically means know these basic platforms as well as a top high school student.
Now the interesting thing was that that
enduring then I think that during the twenties between some for a period of almost
ten years, between 2002, 2003 to 2013, historic
day in the Highlands, you actually worked on same on this gold
building in the project to of building a system that was able to do biology, chemistry, biology and chemistry physics
at the level of a high school student and they should be pretty much failed, you know.
So they did a lot of interesting work. They didn’t achieve that. The performance of Chad Djibouti was able to achieve and
and that the one could argue that the reason why they did that was essentially
they went out of stock into a knowledge base by paradigm as opposed to working on
being basically deep learning. But now in reality, both paradigms to some extent, that based on data knowledge.
But then it is interesting that the quite flagship project that uses the knowledge based by them didn’t really reach the objectives
that they wanted to reach. While Chad Djibouti reasonably quickly once it was applied, you know, solving biology problems that they were able to do it.
So so yeah, deep learning is that is the current dominant paradigm. However, it is also the case that the some
limitations and non-trivial limitations. First of all, you know, you need data.
If you have data, then the performance decreases quite dramatically.
And also there is a certain element of brittleness, again, because it’s so much data, you know, data driven the PISA,
as long as soon as they encounter, you know, deep learning system countries
never will reach it, which is normal. Then the performance decreased dramatically.
And this is not you know, you don’t have to come up with an esoteric scenario. You know, you just will a recognition that so doing
on, you know, pictures provided by Amazon, you do it in the real world and the performance decreases dramatically simply because the real world
is much more dynamic than the Amazon catalog. And but then, you know, I’ll argue that there is a fundamental
issue here that Greece and Turkey compare to the way people learn.
You know, people learn, you know, we learn concepts which is different from learning patterns,
as deep learning systems do in general, actually learning systems do. So let me give you some example.
What I’m talking about. So is
can can anybody tell me what is these antibodies which on the audio first, second, tell me, what is this
exposed in beetle? Exactly. So you pass the test Now, have you ever seen a Volkswagen Beetle
that looks like a bull or a big temple person that did not know? Exactly.
So you see, you you are a human being. You have demonstrated that you are human being and not the checkerboard.
But even if I cannot see you right now, because you somehow you have the concept of those flying beetle,
you have not just been trained to recognize images almost like a beetle where, you know, typically looks like he has four wheels.
He’s not much wider than tall x86. Yeah. And you take you train that
machine learning a declining system on the, you know, 1000 pitches was like in beetle beetles
and there is a very high probability that if you show this picture
they will not be able to recognize because they just recognize patterns, in particular geometrical patterns
as opposed to the concept like white, that you know, that the back end looks exactly like, like, like that.
The real lab that looks exactly like it was like a beetle and the wheel looks, you know, it sees Volkswagen XY Exceedance And you get another example,
which I did just a couple of days ago for a different presentation. This is daily that these you know,
these one of these charges you system like systems it is able to generate art from and and a description
so I told the daily say okay Kim Jerry that picture
the B checking J the image of a lawyer with an Etruscan iPhone
and of course you know that he is amazing so he just does you know Jerry to a lion with an iPhone.
And the problem is that, of course, this is an it just collided with an iPhone, which is not the same thing as a lion.
We didn’t within an iPhone. And moreover, it just scan time. They were not iPhones. So my
the speaker gave is nonsensical. And but actually but the problem is Dali hasn’t got any domain not
you know, doesn’t really know what he’s talking about. Basically what it does, he has a huge database of images,
annotated images, billions of billions of data even. Just so that means that there’s every time we ask him to produce an image,
he can construct a new one from the database, but doesn’t necessarily mean that he understands what we’re talking about.
And therefore, if I say something nonsensical, he just follows up. So, okay, so as a result of that, he’s been
by a lot of interest in this hybrid. The, you know, integrating a machine, learning this is not representation. And indeed for the past, I think he’s now three or four years there’s been
the triple A make workshop and and one of these principles
you really I spring symposium workshops which is really about
you know getting together people who are interesting in the integration machine learning read the noise engineer knowledge based systems
that in fact they also noticed that just in the past two or three months there’ve been a couple of major reviews of the field that
came out in people’s days that, yeah, there is a lot of interest in the integration of these two paradigms.
So let’s go back now to actually a little let’s make me an apple.
Oh, of course, if I if I can blow my own trumpet, as we say in England, of course, this is stuff that we have been doing for at least four
since 2005, 2006. So we are talking about 17, 18. Yes,
indeed. In 2008 we published in this paper on the individual semantic web applications.
And basically what we talked about, what we say was a quick summary is that, you know, the semantic web and mind, you know,
they have availability on a huge scale of data
means we now need to move from the classic logical, reasonable or, you know, or inferential reasoning of traditional knowledge basis
to basically adding this intelligence a side effect of, okay, so how do you handle how do you handle scale?
You can’t handle scale and we can’t handle the billions of data points or billions of images, for example.
And we the traditional inferences in traditional knowledge based
what you need these you know, you need machine learning, you need the statistics, you need data mining, you need measures that are able to handle these
big scale extract significant, you know, interesting patterns.
But then, you know, like in the case of the actual Goliath, if you want to make sense of these, but then you need knowledge.
So essentially the kind of paradigm that we have advocated now for, as I say, for all for well over ten years,
is essentially to try to combine the ability of managing huge amounts
of data of data mining, of machine learning systems with the and essentially
since making capability that they the large scale knowledge base that had been built over the past 20,
30, 40 years, bring it bring to the plate in order to come up with the
and meaningful interpretations of the phenomenon that you want to study.
So for example, in our work of excellent refer to Wolfram Scott, it is one of the first thing we did was to
automatically generate the taxonomy research area. So it would be the size from 6 million papers.
And again, without giving too many details because it would take another hour. And essentially the way we did was that we use large scale
data mining on 6 million papers and we identify possible and patterns that refer to research
areas using a variety of essentially statistical prefix. But then we bring on all the knowledge we can find in the real world
for Wikipedia from Scopus and also, for example, from call for papers
to validate where the what we found are really research areas and therefore the Associated Conferences workshop exceeded all their knowledge.
And the result is in the computer science ontology. You know, you can find that the URL did a few so came out in open is UK
which is that the the by far the largest the ontology of research areas in computing and
that is available just just a single based on the ACM and taxonomy is 1000 as 1000 research areas.
Our taxonomy has 10,000 research areas and has been formally adopted by Springer Nature and these are trained
use it in the you know, I do academic commercial organizations. So that so the point is that
the intensive hype and differentiation of paradigms, I’m not going to tell you anything new.
But the interesting thing is that how these how sort of generic paradigms can be instantiated in a concrete problem,
which is getting a robot to perform intelligently in the real world. So let’s move back to Hans.
So I think he’s actually a good point to have a quick stop, just just
in case anybody has as a quick question on this first part, please shout now.
Otherwise, in 10 seconds alone, anybody has any questions?
I do, if you don’t mind. Yeah, go ahead. Is that that so? Yes, go ahead.
I was wondering about the whole idea of scalability and knowledge, which you just pointed out. So that’s been basically an important problem in the community for a while.
Do you think we’re there yet? I mean, I have more questions to ask later, but do you think that we’ve reached a point where we can actually scale of knowledge as well as getting up data
in the sense of basically vector and machine learning and what the
I mean, when you talk about scale, it all depends on what you want to do. Yeah.
So what I emphasize in my in my presentation so far was that
the the windows be system operator that you know, they operate the true rules
all through inference making essentially to that the you know you remember that I showed you that picture of then the search space with the operators. Yep.
Now so you have these operators and operators applied to a knowledge base.
Now if you are talking about billions and billions and billions of data points, then that paradigm
also still has scalability problems. You need to do something else. Yep.
If you want to do it too. In fact so. And of course that data, of course in some cases you don’t even have
the knowledge in the format that that you want to have. So they so the answer to your question is if you are talking about,
for example, knowledge graph technology, if you are talking about triple stores and sparkle exist here, I think the level of performance now
is is extremely good. You know, it’s a precision that to the level of performance
you have in that in the area of very large databases. But if you are talking about the
doing things where you have a lot of unstructured data together with structured data or you need to analyze
lots of also data points to extract the meaningful patterns to get insights,
then you can’t do that with, you know, with all say with knowledge technology.
You have to do it. You have to use other methods and then bring in the knowledge technology that that that would be my, I would say
my standpoint for my clients.
Okay. So let’s let’s move back to Hans then. So.
So as I say, the if what recognition goal nice objects in the wall.
You have to start from that of course and you are little basic object recognition which is, which is the state of that.
So the the scenario we are talking about is the one of health and safety in my, which means you basically
want to recognize the typical objects that that dying came out. So what we did that we we just did a little bit of all the things
and we we came up with the such a a
a catalog of 60 distinct classes and and we sent
we sent Hans around to essentially collect that data and Hans came back with over one
1600 observations and all the what would it we call instances, you know, really
instances in the sort of logical sense because of course Hans may actually find the same object twice without recognizing it is the same object.
So he’s just an, let’s say, observations of the individual items in that he came by.
And of course, you know, actually will appreciate the fact that came out is a place where you can find the printable.
You can also find the electric guitar and and because of the nature
of research labs, so does does so the the collection is quite ingenious. And of course there are also health and safety specific
items that are really important for the application domain. Like I distinguish it, science six weeks exceeded
so so as I say, you know we send Hans around health collect all these observations
we equip it hands with essentially two state of the art at his state of the art when we were doing this work which is now about to between two
and three years ago and in particular we use to to reason is true deep learning based reason is stronger
and then guitar one is called and that one is K net and net is back to a good and when and you trying to
identify new and classes why K net is the one that works best that we the non
objects and basically I mean you know as you can see a net number of a sort of results.
But the really important one is the one that highlighted basically if you are talking about the essence based the measure, which means
that we are talking about recognize the individual items as opposed to
performance on that and on the cost base, the test set
and essentially, you know, it doesn’t really go higher than 50%. You know,
every every every two observation, one on average would be would be wrong. So it’s not that bad.
But of course, as you can see, is much, much worse. And what Sapochnik can do on the Amazon test bench
where they really have, you know, close to 100% success. And the reason for that is simply that the difference between the real world
and the slightly sanitized world of image recognition. So but this is this is essentially what what, what, what
to me, of course, this was hundred percent when you look at being able to draw. So we would have been very happy.
So the fact that you only get 50% is perfect. It gives us a decent baseline on which to try improvement now.
So what? So why? Why, why? There are so many mistakes.
So so what we did, we did an analysis of all the admirals that the Queen, the and I show you a couple of examples.
So this one is quite a funny one. Fictional by funny, but this one is actually funny because basically these armchair
seen from the back and was recognized as a mug. And of course,
you know, nobody I don’t think I’ve ever seen a mug as big as that armchair And and these again you know, is these things
that these these deep learning systems don’t have knowledge of the world. You do understand and IMAG cannot be that big because otherwise the K
my people will have to be about 20 meters tall to use that mug in proportion.
So so that’s this interesting case, which is simply, you know, if if they actually some knew it both sizes
these are all could be eliminated and then there is another one which is even more interesting that these and
we have a radiator of we have many radiators and of course that if you do different observations over time,
these radiators sometimes may look like a window blinds because of the reflection from the sun. And
so I’d say the opposite way around that, that that is actually a window
that because of the reflection of the sun, we the blinds e appears to be a radiator
because it doesn’t appear to be just glass, but the MP has to be a shiny surface with that with the summer and, and sort of
abstraction and, but again the interesting thing is that, you know, a human being
will not make this mistake or at least will not make this mistake more the worse.
Because, of course, in the moment, do you know that there’s a window in a particular spot? You also have knowledge about the fact that windows are not movable object.
So if on Sunday morning you see a window, there is very likely the next day that window is being replaced by radiators.
It’s not impossible. But these that is very, very unlikely. So so what is this? Because again, we have them among
the various types of common sense knowledge we have. We also know about the
the shape, the motion, properties, objects. We are able to distinguish objects which are static from objects
which easily can be moved from one place to another. So okay, so, so basically
from this analysis, we we say, okay, can we we equip the with commonsense knowledge.
It would improve the object recognition performance. And what types of common sense knowledge do we have
to try and represent that in us? So let me just skip Yeah.
So what, what what we did we the a formal analysis of various
as I say that in which we use different categories from a framework which we build to
looking at the literature on common sense and particularly and provide do to papers by like Italian and of
and the and basically there are five elements that are really on say six including then equal more than building.
But basically what we are you know, what we are able to do is, for example, as a nation
now there’s something called motion vision. It is, you know, these and evidence that the human brain
keeps distinct representations, will study the objects, which means we we are very, very good distinguishing between things that are static and things that move it.
And in fact, it is even more than that because we are also very good at keeping track of an object that moves over time.
So if we if we if we follow a car passing on the street, we are very able we are very good at following that car
and understanding that is still the same car in the next frame and
that is a lot of evidence again, that, you know, 3D is something
that we construct the essentially in post-processing, it forms 2D shapes
and of course we have these very, very, very fast, the perceptual mechanism
that means that not just with old objects, but also with new objects, we can recognize them extremely fast.
What we are very good at understanding and compositionally at identifying the sub parts of the objects, and especially nations.
And and something really, I would say mindblowing is that there have been some experiments to show that
even infants younger than six months understand a certain basic principles like insertion, which is that which is quite amazing.
So you know, the bottom line is that, you know, human beings are pretty good at
understanding the world. And especially the key thing is that we we construct models of the world and then we are able
to keep these models up to date to dynamically change, but also to apply them to a situation that we have already experienced before.
So what so what NASA did was essentially to to look at how to a generalize
this framework in in a way which could be implemented. So that which means for us a mutation in physics
is to represent the physical properties objects in in a way that the robot can
and make use of this representation in terms of composition. And it means to represent special reason.
But all relations and again, being able to reason about that first perception sectioning essentially say easy one, you know the real methods
can really provide extremely fast object recognition at least on non carrying on.
This are new ones and generic to to new use. Again, these are provided by existing CVS and motion vision means,
you know, to equip physical robots with the ability to to
track objects over time. So as I say we did these did that had a renaissance
and turns out that the although there are a number of explanations for the arrows, there are three I would say elemental common sense it really stand out.
One is the one I showed you with the example, the armchair, the market, the relative size.
If only hands understands that size, in particular relative size,
then we can we in principle, you know that what it is
that we should be able to improve its performance and object recognition. What we do is spatial reasoning.
Again, that was something that appeared in a number of areas. And this I think is the example that
I gave you about the window, the radiator motion trajectory. So, you know, the ability to understand the difference between objects that
are movable, objects that are not movable, and the fact that no movable objects are likely to change over time in different observations.
So equipped with these error analysis. And well, let me say one other thing.
Of course, it turns out to be quite useful is also to read that which was not in the frame of but that’s quite obvious.
So we also did a key B cover study here. Won’t get much time to talk about these.
But the interesting thing is basically the only as either KB which has the information about simple motion trajectories
and difference between static and movable objects is no robe. So essentially this analysis was quite useful to identify gaps
in the common sense knowledge that you can extract from existing cables as opposed to building new new ones.
So okay, so what what we did, in fact what NASA did was to say, okay, let’s,
let’s introduce a size reasoner and see whether the performance improves. So
and again, you know, in terms of what I call many signs that there is quite a lot of evidence
that we use size the big time to understand
the reality around us as in the example of the armchair as Marg.
But basically, you know, there is an example, there is the walk by and in a approach which, which basically says that the
we have been, you know, in our mind we have essentially canonical
understandings, you know, let’s say understand your canonical size. So we expect the beam to be of a certain size and respect
the car to be a certain size and, and we use these again
as as part of our way to categorize and conceptualize objects. So
I want you know I won’t give you all the boring details all the side of it except that the say essentially
we we went for kind quality, we abstracted the from the specific dimensional, you know, except a cardboard box can be very small, can be very big,
can be flat, can be fake of course is flat when is folded. So you see multiple shape.
So what what we did essentially was to arrange it and then a position size according to two dimensions.
One is the fact that whether it thinks thick or flat and the other one is the area where the things are small or large.
And these and these again, because this allows us to do much faster inference when
hand goes around trying to understand objects. So we we integrated these essays
reasoning with the machine learning and the declining evidence. And integration was done in a basically
according to a waterfall model, the kind of cascade model essentially. First we run
the deep learning algorithm. These in this case the little and says is where
this seeing that in reality the fast English you seems to me like a bottle because indeed the shape of the distinguish is very similar to a bottle.
And so she comes up with these three ranked predictions that is a bottle
she’s not able to a signature if she’s not a feisty machine, is a bean. So once these predictions are passed to the size
and reason, it essentially says reason, it says, well, that that is much too big to be a bottle.
Pretty much all the bottles I have in my knowledge base are smaller than that. So one could know the size is good enough to be a feisty being. So
you validated. Feisty and bean is objects that satisfy the size criteria.
But because Feisty machine was ranked higher by the deep
learning reason, the mean e the output is a feisty machine.
So let’s see whether this improves performance. As I showed you earlier
in the reason that we only achieve a 50%
and we we run the number of variations, but essentially the
the last row here, the one which basically use the hybrid the faster the deep learning
and then use all the information about size both area and thickness and thickness basically to produce
a 5% improvement over the baseline. So it’s not the unbelievably fantastic, but it is actually
quite, quite, quite meaningful as a 5% improvement. And the interesting thing is actually that the
if if we take a slightly idealized scenario in which given the prediction
and because basically this the what we call the realistic scenario is everything is automatic.
Essentially if the deep learning system is very confident this something is a bottle is not passed to the same reason that the information is only passed
the size of reason that if the deep learning system is not very confident and now if instead the we only
if we remove the mistake from the deep learning that
in the sense that we we we pass to the size reason that only the predictions that
that are wrong regardless of the confidence or declaring that actually the performance increased dramatically to 66%.
So essentially one of the problems of course with this cascade architecture is at the end of the day, if
if you get, you know, garbage and you get a garbage out. So these are only a subset of the incorrect predictions
that can be fixed by size, for example, because we do have, you know, we assume a real world scenario in which the robot is completely autonomous.
Therefore, we don’t have perfect knowledge about the the performance of the and declining system.
And nonetheless, 55% is incremental to 50. So and interesting think we did even better on the Amazon
robotics challenge and because you said much easier challenge it,
we estimated that we achieved an 88%
performance compared the 78 performance and on the on the mixture and bottom
you and the new object that and all knowledge and no one objects.
Okay so as I showed you earlier also the you know it looked like it would be very useful to integrate a special reason.
So we again, I will speak I will skip some of the details. And this is what we did.
We know we we did a very formal lab theory, and this is your first order logic theory in terms of 40
axiom of all the spatial relations that that that we needed. And we did quite
a lot of work on this notion of bounding box and essentially given an object that you know
the system automatically generates the the minimal bounding box for that object. Of course the object can have different shapes.
But then the interesting thing, of course, in contrast with the a lot of the formal work and special reasoning is that, you know,
the robot viewpoint is dynamic. So we came up with the notion of the contextualized bounding box, which
basically is the minimum bounding box,
but rotated it from the point of view of the robot. So the end result is something like this.
And when the robot, the analyzer is seen in came out, you can then extract the representation,
which means that based on the or the spatial primitives that have been
formally defined in first a little logic and then implemented in impulse grace.
And so that so that that the robot has is able to quite efficiently
and reasoning these on these special relations. So and the
the other important thing about the special relationship is then the also you know because, at the end of the day we want to understand the role
that we use these and to support essentially common sense recognition.
That is, if I see a keyboard next to a screen
that is much more likely to be a computer screen than a TV screen, that because statistically, you know, you get keyboard, you
is much more likely to have keyboards next to computer screen than the TV screen. And and even more likely, if you have an object
that looks like a computer mouse next to a keyboard and is much more likely to be a computer mouse than another object
with the same shape, which is, you know, on a bed separate from my computer, go to the keyboard, exceed it.
So essentially we we try that to reproduce this sort of difficulty reasoning
that has been analyzed in that in cognitive science to improve the ability has to to recognize objects.
And so then the architecture itself becomes more interesting because we we add these special reason it and then we have to reason this.
And now of course, you have this problem out. We organize the reasoning now.
So what we did is we did two different tests and in the first place we have tested did the planning component.
Then we have the size reason and then we have the spatial reason. And so we have a sort of cascade of three reason.
This in the second test that we have then declining component and then essentially we have the same reason and a special reason that in parallel
you reach the say is reason a
guess the input from the EL element, the special reason that also gets the input from the D L element
and then essentially the debasement we are combined.
And so for example, like in this example and basically these you know, in fact these is the same example
as before is multiple distinguishable is the ranking from the L and we already see in the size reason that if you look at a special reason,
a special reason say the BINZEL for which makes sense. I have seen a lot of images of beings on floor.
So that’s consistent that the images but also the first is near the extinguisher style so also that that’s consistent
so both fire extinguisher been plausible candidates for the special reason that distinction of being a plausible candidate for the size reason.
Again we use the ranking and the original ranking from the end and we go for festival show game. So
again, if you remember we had actually the 55% that we the plus size reason.
Now we had the special reason to know the especially so of course the moment the you are the second component then the
you have more scenarios in which you may assume a complete realistic case which harnesses compute models.
Or you can assume a completely idealized case in which a you know,
which are predictions from the the planning component. Correct. And which ones are not correct.
So you only pass the No. Correct one or fixing, but also in the equality
special reason that of course, if you want to use the other objects
as a context to understand the object in question that you are trying to classify.
And of course if you if the deep reason, the deep learning component gets the other objects wrong, then of course,
in the screw up, the spatial, it is going to get special reason. So the idealized scenario which everything that is passed and save reason, equality
and a special reason that is correct this and that is scenario A And then there is the real world scenario in which basically what does luck?
We don’t know what is correct and what is not correct. And of course there are two intermediate. The case is being seen.
If you go straight to D the realistic case, we can see that again
and integrating their size especially. So they parallel produce a further improvement to 58%.
I think I don’t have to have here a we talk about here you also get an improvement in 68%
but A is the one where idealizing which only incorrect. The only incorrect guess by the DL are passed on to the
to the other reason is so let’s so basically you know the bottom line
is that we started by a baseline of 50% and
performance in terms of life measure and we now go to 58. So the final piece in the puzzle
then is to represent that the health and safety rules. So this is the input
that we got from the health and safety people. This is the kind of inspection form and guidelines that
the human inspector needs to follow. So we
just follow these two rules that were encoded in two hands. For example, like the first one is waste rubbish kept in this area.
You cannot keep rubbish everywhere in the lab with just two or three specific areas
where the rubbish can be kept. And again we formalize this
as a test on the logic statement. And the important thing is, as you can see in the second one, ignition shake it.
Of course we make use of the spatial relations equality, spatial relations defined in our exact framework
and then bingo, we will launch that has that we these
and an object recognition system augmented the we we
any special reason that in size reserved and with the rules for health and safety. So let’s keep the screens nuptial scroll straight
I will show you a quick video and then and then we will finish So these
yeah so this is just going around the position of course. Extract the hands as a map knows, for example, which areas are suitable
for rubbish, which assemble suitable for rubbish.
So here it looks at the extinguisher. It needs to understand what they are.
And again, let’s, let’s just move forward to.
Okay, so the risk assessment routine. So for example, here is looking at these if I distinguish
and use the Quigley’s bishop, this are let’s just go backward, let’s
okay, okay. So sorry about that. Oh my stockings.
There’s a pause bottom. But if I follow
so if I the more videos this comes from YouTube, not from it. So yeah, this just shows you the way the situation is represented.
So it’s positive to face the show with the set and confidence. 2.90.5 so
is able to deduce using this special reason that
the right the relations are fixed on this do on the wall and he gets this information that I’m about
to be inspected besides sex you have from existing knowledge basis like Quasimodo end in shape net and then he moves on that.
So let’s click on the arrow
and that Okay And then now
you say okay they are basically wall mounted Yes they accessible.
Yeah. There was nothing to restrict the how possible distinguish a clear label that you say is not. And the reason for this is that the object recognition didn’t support the labels.
The labels are there. So this is a failure of the switch of the deep learning system.
It fails to support the labels. So let’s let’s move on and see. Look at another scenario.
Okay? In this case, this is a case I will show you at the very beginning of book next to an extra ISA.
And this is also shows you a little bit of the the trace of the system is trying all the
you it’s doing the initial and trying all the various rules in this case it defines the
the rubbish being outside the next negative and then is a warning is another case in which it is a trip A-Z
and again this is probably probably spotted.
Okay. I think we have seen enough to get excited about this,
so let’s okay, let’s go to the conclusion.
So bottom line is that using this hybrid approach as opposed to a purely planning approach, we have improved the performance of hands
by 8% from 50 to 58, and there’s still a lot more work
that can be done that I think that tools and the value of combining these,
the two paradigms, especially because a lot of the errors downstream
are really areas upstream in the sense that, as I say, that we don’t assume an oracle that tells
that only partial incorrect predictions for fixing you just based on everything. Therefore
the arrows and arrows in the sense of things that do things that are correct propagate to the downstream.
And an approach of this type of architecture is that, as you can see from the nice piece that I presented, that we can
very clearly understand the value of the different components, which is much more difficult if you try and embed
the data domain knowledge directly into the DL component. And of course in terms of future work and actually NASA is currently working
on precision farming in neatly applying these techniques to robots in agriculture.
And also we have the project together, which sounds so good for robotic assistance in the home, in which of course
you want the robot to make sense of the home environment, for example, to identify possible hazards for elderly people in in the home.
And of course, there’s more that can be done. Also in terms of the added common sense components.
Okay, I’ll stop here. Thank you so much for your attention.
Aktuelle Technologie auf große Datenmengen angewiesen
„Doch trotz der großen Erfolge bei diesen und anderen Leistungsvergleichen schneiden DL-Architekturen aus kognitiver Sicht im Vergleich zu menschlichen Fähigkeiten immer noch schlecht ab, sowohl in Bezug auf die Effizienz als auch auf die Erkenntnisgewinnung“, sagt Enrico Motta. Mit Blick auf Effizienz seien DL-Methoden bekanntermaßen sehr datenhungrig, während Menschen in der Lage seien, selbst aus einem einzigen Beispiel zu lernen und zu verallgemeinern.
Hinzu kommt: Aus erkenntnistheoretischer Sicht haben Menschen den Maschinen voraus, dass sie das Gesehene verstehen können, obwohl ein Objekt typische Eigenschaften vermissen lässt. „Aus erkenntnistheoretischer Sicht besteht ein Schlüsselaspekt des menschlichen Lernens darin, dass es weit über die Mustererkennung hinausgeht. Menschen lernen Konzepte – nicht nur Muster. Sie können daher Beispiele dieser Konzepte erkennen, selbst wenn wichtige Merkmale fehlen“, so Motta. Das gelte zum Beispiel für ein Auto, von dem alle Räder abmontiert wurden oder die Darstellung eines rosa Elefanten in einem Tutu. Diese Fähigkeit vermeide die „Brüchigkeit“, die für DL-Methoden wie auch für andere Arten von KI-Systemen typisch ist.
© Aneta Tumilowicz
Schwächen von Deep Learning ausgleichen
Um Roboter mit visueller Intelligenz auszustatten, arbeiten Enrico Motta und sein Team an hybriden Computerarchitekturen, die DL-Methoden mit anderen KI-Komponenten kombinieren. In seinem Vortrag stellt Motta seine aktuelle Forschung dazu vor. „Wir haben eine hybride Architektur entwickelt – sie ergänzt einen Deep-Learning-Ansatz mit einer Vielzahl von Komponenten zum logischen Denken aus der Kognitionswissenschaft, um eine neue Klasse von visuell intelligenten Robotern zu entwickeln.“
„Enrico Motta gehört zu den international führenden Wissenschaftler*innen zu Wissensrepräsentation und -management und semantischen Technologien“, sagt Professor Dr. Philipp Cimiano, Leiter der Arbeitsgruppe Semantische Datenbanken an der Universität Bielefeld, der den Vortrag mit organisiert. „Seine Forschung liefert weitreichende Impulse dazu, wie Roboter und andere KI-Systeme Beobachtungen semantisch interpretieren können, indem sie sich nicht nur auf die beobachteten Merkmale, sondern auch auf ihr Hintergrundwissen verlassen. Dadurch können Roboter besser generalisieren und von konkreten Situationen und Beobachtungen abstrahieren, um besser ihr Wissen auf unbekannte Situationen zu übertragen.“
Professor Dr. Enrico Motta ist Professor für Wissenstechnologien an der Open University mit Sitz in Milton Keynes, Großbritannien. Dort leitet er die Forschungsgruppe Intelligente Systeme und Datenwissenschaft am Knowledge Media Institute (KMI), dessen Direktor er von 2002 bis 2007 war. Er hat außerdem eine Professur am Fachbereich Informationswissenschaft und Medienwissenschaft der Universität Bergen in Norwegen. In seiner Forschung beschäftigt er sich mit der Integration und Modellierung großer Datenmengen, semantischen und sprachlichen Technologien, intelligenten Systemen und Robotik sowie Mensch-Maschine-Interaktion.
Vortragsreihe dazu, wie die Umwelt gemeinsam interpretiert wird
Der Vortrag trägt den Titel „A Hybrid Approach to Enabling Visual Intelligence in Robots“. Er ist Teil der Vortragsreihe „Co-Constructing Intelligence“. Für die Reihe kooperieren die Universitäten Bielefeld, Bremen und Paderborn. Philipp Cimiano organisiert die neue Vortragsreihe unter anderem mit der Bielefelder Informatikerin Professorin Dr.-Ing. Britta Wrede, dem Bremer Informatiker Professor Dr. Michael Beetz und der Paderborner Linguistin Professorin Dr. Katharina Rohlfing. Die Vortragsreihe ist ein Angebot einer gemeinsamen Forschungsinitiative der drei Universitäten. Der Zusammenschluss nutzt das Prinzip der Ko-Konstruktion, um das Verständnis und die Fähigkeiten von Robotern an die von Menschen anzupassen. Die Forschenden arbeiten so an der Basis für eine flexible und sinnhafte Interaktion von Robotern mit Menschen im Alltag. Der Begriff Ko-Konstruktion bezieht sich darauf, dass die Interpretation der Umwelt und die Ausführung von Handlungen in Zusammenarbeit erfolgen.
Weitere Informationen:
- Hinweise zum Vortrag auf Website des Instituts JAII
- „Maschinen beibringen, wie Menschen zu denken“ (Pressemitteilung vom 16. Januar 2023)