Robot Learns From Watching YouTube Videos

Can a robot learn to make a soufflé or a stir-fry the same way many of us do, by spending afternoon watching cooking videos? New research at the University of Maryland suggests it’s possible.

Domestic robots were once a staple of popular science fiction. Think of Rosie the Robot from the cartoon TV series The Jetsons, or the Hired Girl robots from Robert Heinlein’s novel The Door into Summer. Back in the 50s and early 60s the promise of cybernetics seemed imminent.

The reality has proved much more difficult, however. The Roomba, a self-directing vacuum cleaner, seems to be as far as we’ve come in the home. Impressive, in its own way, but not an autonomous robot that can flexibly take on tasks.

The problem lies partly in perception—building a robot that can identify objects from any angle in different environments, and partly in what might be called the acquisition of common-sense knowledge—how do you incorporate the tacit background understanding that lets a robot change its plans when the situation changes or some new constraint is introduced?

Researchers from the University of Maryland and NICTA, Australia’s Information and Communications Technology Research Centre, have recently demonstrated how a robot named Baxter, Rethink Robotics, can learn these types of skills by watching YouTube cooking demonstrations, a step on the road to an autonomous robot.

Mass-based Rethink Robotics was started by the well-known robotics researcher Rodney Brooks. Baxter has two arms and an animated face on a screen that shows what it’s looking at, and is commonly used in robotics research.

Read more about Baxter and other human-like robots.

The University of Maryland project uses two types of recognition modules to gain information from the videos. The first recognizes the object in a video as being one of 48 different objects—kitchen staples such as an apple, a fork, or yogurt. The other module recognizes that one of six grasping types—a power grasp on a small diameter object, for example, or a precision grasp on a spherical object—is appropriate to the action to be accomplished.

This recognition is done by convolutional neural network or CNN. Neural networks are statistical learning algorithms that mimic the actions of networks of neurons. They have been around for years, but CNNs have recently advanced to the point where they can mimic the way humans perceive the world. Google and Baidu, for example, use CNNs for their visual search engines.

But recognizing the object and the grasp in the video is just the first step. To act, the robot needs to understand the hierarchical and recursive structure of the actual action. Baxter models that understanding semantically, with a grammatical structure similar to those used in linguistic analysis. In other words, the robot builds a visual sentence by following certain rules, and then acts it out. Once Baxter has learned from the video, it can do things like make a salad or serve a cup of coffee.

Linking perception, reasoning, and action control in a single system allows Baxter to not only imitate the movement of human beings, but to also understand the goals of the actions. With this advancement, a robot could theoretically use many different actions—even those unexpected by its designers—to achieve the desired goal.

“Imagine that we want the robot to cut up a cucumber,” says Yezhou Yang, Ph.D. candidate at the University of Maryland Department of Computer Science. “But it doesn’t see a knife on the table. We’re interested in whether the robot would be able to reason that it could use an available spatula to cut the cucumber instead. That’s the kind of reasoning we’re hoping to develop.”

The approach followed by the University of Maryland team is becoming more common. The Stanford University-based Robo Brain project is an example of another way it is being used. This multi-university open-source initiative explores how machines learn and is focused on large-scale data processing, language and dialog, perception, and reasoning systems.

“The focus really is moving toward having robots manipulate things by watching humans,” says Ashesh Jain, a PhD student at Cornell currently working on the Robo Brain project. “You can learn high-level symbolic concepts. ‘Am I grasping the bottle, yes or no?’ This is a more recent topic of study, in the last three to five years.”

As for Yang, he doesn’t see anything particularly special about a robot cooking. “We started with cooking videos mostly because they are so easily available online,” he says. “We will be extending these actions to things like arts and crafts, as well as assembly line work and so forth.”

Still, there is a charm to robots wielding spatulas, bowls and cutting boards, allowing them to show a domestic side they certainly don’t in most dystopian fiction. Will you fear a Terminator wearing an apron? “I’ll be baaack—with the gateau peche!”

Image courtesy of Yang, Yi, Fermüller, and Aloimonos