Rho-alpha is designed to help robots including humanoids become more autonomous. Source: Microsoft
To be useful in more dynamic and less structured environments, robots need artificial intelligence trained on a variety of sensory inputs. Microsoft Corp. today announced Rho-alpha, or ρα, the first robotics model derived from its Phi series of vision-language models.
Vision-language-action models (VLAs) enable physical AI systems to perceive, reason, and act with increasing levels of autonomy, noted Microsoft. The new models built on Phi are intended to make robots more adaptable and trustworthy, the company said.
“Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks,” wrote Ashley Llorens, corporate vice president and managing director of the Microsoft Research Accelerator. “It can be described as a VLA+ model in that it expands the set of perceptual and learning modalities beyond those typically used by VLAs.”
For perception, Rho-alpha adds tactile sensing, and Microsoft said it is working to include modalities such as force. For learning, the company claimed that Rho-alpha can continually improve with feedback provided by people.
The video below demonstrates Rho-alpha interacting with the BusyBox, a physical interaction benchmark that Microsoft Research recently introduced, cued by natural language instructions.
Rho-alpha uses simulation, demonstration, and the Web
Rho-alpha co-trains for tactile awareness on trajectories from physical demonstrations and simulated tasks, as well as web-scale visual question-answering data, said LLorens in a blog post. “We plan to use the same blueprint to continue extending the model to additional sensing modalities across a variety of real-world tasks,” he added.
There a lack of scalable robotics training data, especially for tactile and other less-common sensing modalities, acknowledged Microsoft. With the open NVIDIA Isaac Sim framework, researchers can generate synthetic data in a multistage process based on reinforcement learning.
“While generating training data by teleoperating robotic systems has become a standard practice, there are many settings where teleoperation is impractical or impossible,” said Abhishek Gupta, assistant professor at the University of Washington. “We are working with Microsoft Research to enrich pre-training datasets collected from physical robots with diverse synthetic demonstrations using a combination of simulation and reinforcement learning.”
“Training foundation models that can reason and act requires overcoming the scarcity of diverse, real-world data,” observed Deepu Talla, vice president of robotics and edge AI at NVIDIA. “By leveraging NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets, Microsoft Research is accelerating the development of versatile models like Rho-alpha that can master complex manipulation tasks.”
Humans provide course correction for Microsoft models
Even with expanded perception, robots can still make mistakes during operation, said Microsoft. It explained that corrective feedback from teleoperation devices such as a 3D mouse can help Rho-alpha continue learning.
In the video below, Microsoft shows two UR5e cobot arms with tactile sensors using Rho-alpha to insert a plug. The right arm has difficulty with the task and is aided by human guidance in real time.
“Our team is working toward end-to-end optimizations of Rho-alpha’s training pipeline and training data corpus for performance and efficiency on bimanual manipulation tasks of interest to Microsoft and our partners,” said Llorens. “The model is currently under evaluation on dual-arm setups and humanoid robots. We will publish a technical description in the coming months.”
Microsoft said it is looking to work with robotics manufacturers, integrators, and end users to see how technologies such as Rho-alpha and associated tooling can help them train, deploy, and continuously adapt cloud-hosted physical AI with their own data. The company invited interested stakeholders to join its Research Early Access Program.
The post Microsoft Research reveals Rho-alpha vision-language-action model for robots appeared first on The Robot Report.

