The Quest to Give AI Chatbots a Hand—and an Arm

Peter Chen, CEO of the robot software company Covariant, sits in front of a chatbot interface resembling the one used to communicate with ChatGPT. “Show me the tote in front of you,” he types. In reply, a video feed appears, revealing a robot arm over a bin containing various items—a pair of socks, a tube of chips, and an apple among them.

The chatbot can discuss the items it sees—but also manipulate them. When WIRED suggests Chen ask it to grab a piece of fruit, the arm reaches down, gently grasps the apple, and then moves it to another bin nearby.

This hands-on chatbot is a step toward giving robots the kind of general and flexible capabilities exhibited by programs like ChatGPT. There is hope that AI could finally fix the long-standing difficulty of programming robots and having them do more than a narrow set of chores.

“It’s not at all controversial at this point to say that foundation models are the future of robotics,” Chen says, using a term for large-scale, general-purpose machine-learning models developed for a particular domain. The handy chatbot he showed me is powered by a model developed by Covariant called RFM-1, for Robot Foundation Model. Like those behind ChatGPT, Google’s Gemini, and other chatbots it has been trained with large amounts of text, but it has also been fed video and hardware control and motion data from tens of millions of examples of robot movements sourced from the labor in the physical world.

Including that extra data produces a model not only fluent in language but also in action and that is able to connect the two. RFM-1 can not only chat and control a robot arm but also generate videos showing robots doing different chores. When prompted, RFM-1 will show how a robot should grab an object from a cluttered bin. “It can take in all of these different modalities that matter to robotics, and it can also output any of them,” says Chen. “It’s a little bit mind-blowing.”

Video generated by the RFM-1 AI model.Courtesy of Covariant

The model has also shown it can learn to control similar hardware not in its training data. With further training, this might even mean that the same general model could operate a humanoid robot, says Pieter Abbeel, cofounder and chief scientist of Covariant, who has pioneered robot learning. In 2010 he led a project that trained a robot to fold towels—albeit slowly—and he also worked at OpenAI before it stopped doing robot research.

Covariant, founded in 2017, currently sells software that uses machine learning to let robot arms pick items out of bins in warehouses but they are usually limited to the task they’ve been training for. Abeel says that models like RFM-1 could allow robots to turn their grippers to new tasks much more fluently. He compares Covariant’s strategy to how Tesla uses data from cars it has sold to train its self-driving algorithms. “It’s kind of the same thing here that we’re playing out,” he says.

Abeel and his Covariant colleagues are far from the only roboticists hoping that the capabilities of the large language models behind ChatGPT and similar programs might bring about a revolution in robotics. Projects like RFM-1 have shown promising early results. But how much data may be required to train models that make robots that have much more general abilities—and how to gather it—is an open question.