girl with wooden toys

Machine Common Sense Through Embodied AI

I finally got approval to write about my work on the DARPA Machine Common Sense project. I was part of the MIT-Harvard-Stanford team and we built embodied agents for an ai2thor-based simulation environment with the goal of mimicking the cognitive abilities of young children, in particular regarding interaction with the environment, intuitive physics and goal inference of other agents.

The basic overview looks like this: We have interaction, agency and physics tasks and are always given RGB-D output as well as sometimes full segmentation maps to shift the focus from basic vision problems to higher level cognition. From the point clouds we build 3D occupancy grids which allow us to more easily plan. Additionally, we build scene graphs and use probabilistic programming techniques to solve sub-tasks like reorientation and pose estimation. In this article I will stay fairly superficial and primarily give an overview on spatial perception, motion planning and how probabilistic approaches play into this, but the papers are linked to further down if you would like to dive in deeper.

We we have employed standard SLAM (Simultaneous Localization and Mapping) techniques in the past to first infer the pose of our agent given the current observation, partial map and proprioceptive signals (localization) and infer the map of the environment given our observations and inferred poses (mapping), but since our environments are fairly fixed, our vision almost perfect with no noise and only few artifacts from the depth map resolution as well as the uncertainty about our movements very low, we usually can solve this in a straightforward way. Here you see an RGB-D image, the point cloud and the projected point cloud.

This approach works across simulators. For instance, here is the same point cloud extraction applied to the ai2thor-based MCS environment, TDW out of MIT and iGibson together with a unified visualization tool:

You can then discretize the point clouds into 3D occupancy grids:

Comparing both representations we find that the discretized version is indeed a useful representation for planning and reasoning:

Here you can see how the 2D projection of the occupancy grid dynamically grows as our agent runs through the environment.

We are already able to motion plan and explore:

Since we can easily estimate the bounding boxes, basic A*, RRT and other planning algorithms can be applied. At some point I used OMPL for this, but we also used our own implementations. Exploration is usually done by computing the frontier of the known area and moving to it. However, since most of our environments are fixed rooms, we also have an exploration strategy that computes a connected components labeling of unexplored regions to easily determine the largest region, then computes the convex hull around them, so we do not move into unknown fields which could be occupied by holes, lava or big objects, moves into a convex hull field and then rotates such that it looks at the centroid of the unexplored region to approximately maximize the information gain. There are certainly better approaches, but this has worked reasonably well.

An interesting navigation approach for supporting holes, ramps and lava I came up with was building a navigation mesh:

The navigation mesh is built by looking at the height gradients in the occupancy grid – edges are only added between nodes if the gradient indicates the agent can safely pass it, i.e., that are low enough that it is neither too steep or a wall or a hole. And then I just let networkx render it and find the shortest path.

Let us assume we have the following complex setup of ramps. Then the first navigation plan will look like this. As you can see, different connectivity yields different plans and we have to replan once we climbed the first ramp, since we cannot see the elevated surface from where we start.

Note that the underlying graph is fairly flexible. For instance, you can delete edges in areas that a visual detector perceives as lava. It turns out that lava is so uniquely determined by its color that simple color segmentation can classify it virtually perfectly:

You might be tempted to follow a similar color segmentation approach to determine the agent position in agency tasks, i.e., apply a homography to recover tiles, subtract tiles, cluster colors as depicted here:

This works. But it is more elegant to just use the point clouds again and do use DBSCAN or a similar algorithm to cluster points into objects as illustrated in the following. I also show the PDDL generated from the known tile structure and agent positions which is very handy to do goal inference in subsequent steps:

Note that while I will not go too much into agency in this article, but you can read more in our paper here. The same trick we applied to more easily plan/reason over the environment applies to objects: We can discretize them into voxels, so we can take their spatial structure into account without having to deal with enormous point clouds.

You might also be interested in our 3DP3 NeurIPS paper which leverages scene graphs and point clouds together with probabilistic programming on top of neural pose estimators to get more accurate and robust pose estimations. More can be found here.

Another example of where PPL techniques come in handy is reorientation after the agent has been displaced (“kidnapped”). For simplicity we assume rectangular rooms and right angles between walls. In the new position we can still easily find room corners and based on a corner and our relative position we can geometrically derive that we can only be in one of four spots:

Note that you can shoot rays around the agent to characterize its position in the room relative to the walls. I deliberately show a slightly pathological case: It can easily happen that the rays penetrate walls due to the limited depth map resolution, so any approach we pick needs to have a certain robustness.

Based on both initial considerations we can then use importance resampling, MCMC with Gaussian drift proposals or particle filters with rejuvenation steps to determine our position. In the rectangular case we can see that only the two positions that make geometric sense gain support, since for the other two positions the wall lengths do not match.

Let us consider six cases – three with colored walls and three without distinguishable wall color. In (a) the agent looks against a shorter wall. If the walls are colored in a distinguishable way that breaks the symmetry our approach will converge towards a single pose, but if the walls are indistinguishable, we do not know into which shorter wall we look, only that it is not a longer wall. In (b) we don’t look into a wall, but a corner. We again see that unique colors break the symmetry and converge, whereas monochromatic walls yield the same four positions we determined geometrically above. Finally, in (c) we see what happens when we stare against a wall without seeing any corners: If there is a color, we do not know exactly where we stare against the wall, but at least that it has to be on a line parallel to the wall, whereas for equally colored walls we could be looking against any of the walls with the measured distance. One of the compelling properties of probabilistic models is that they capture this uncertainty quite elegantly.

Of course, you can also combine these approaches with neural detectors, for instance with a detectron for object classification and instance segmentation as shown here, but I will not get into it here, since I already explained it in another article and since the emphasis of our group was on probabilistic approaches.

The project was quite interesting – we were on tier 1 focusing on cognitive abilities of young children, tier 2 focused on creating evaluations and the simulation environment and tier 3 (e.g., Yejin Choi’s group) focused on commonsense knowledge that can be captured with language models etc. I enjoyed directly working with Josh Tenenbaum, Vikash Mansinghka, Dan Yamins, Dan Bear and Elizabeth Spelke, that the all-hands attracted additional high-class AI researchers like Gary Marcus, Doug Lenat, Henry Lieberman, David Ferrucci and others as well as that our probabilistic agent performed quite competitively and won at least twice by total points. This is satisfying not only because our colleagues from Berkeley and OSU were quite strong themselves, but also since Berkeley focused on RL and OSU on automated planning, so it was a very fruitful and friendly comparison of paradigms. It reminds me of when I took my operating systems class at university: I learned a lot there, but really learned OS principles in a seminar where we wrote an operating system for a microcontroller. Similarly, this project forced me to write productive code to make an agent solve a whole variety of tasks which was more insightful than implementing PPO or AC3 again in isolation. Furthermore, I find Josh Tenenbaum’s Game Engine in the Head concept quite compelling and this was a great hands-on way to explore it. Finally, it raises the interesting question to which degree AI has to be embodied to truly learn about the world.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply