On Tuesday, Hugging Face published SmolVLA, an open source vision language action (VLA) artificial intelligence (AI) model. The big language model is designed for robotics workflows and training-related tasks. The business says that the AI model is compact and efficient enough to operate locally on a computer with a single consumer GPU, such as a MacBook. The New York, US-based AI model repository further stated that SmolVLA can outperform considerably larger models. The AI model is presently available for download.
Hugging Face's SmolVLA AI model may be run locally on a MacBook
According to Hugging Face, robotics improvements have been modest, despite the rise in the AI area. According to the business, this is owing to a shortage of high-quality and diversified data, as well as large language models (LLMs) tailored to robotics processes.
VLAs have arisen as a solution to one of the issues, however the majority of the top models from firms like Google and Nvidia are proprietary and trained on private datasets. As a result, the greater robotics research community, which relies on open-source data, experiences significant inefficiencies when recreating or improving on these AI models, according to the article.
These VLA models may gather photos, videos, or direct camera feeds, interpret the real-world situation, and then perform a specified activity with robotics gear.
According to Hugging Face, SmolVLA tackles both of the major pain issues in the robotics research field; it is an open-source robotics-focused model trained on an open dataset from the LeRobot community. SmolVLA is a 450 million parameter AI model that can operate on a desktop computer with a single suitable GPU, as well as a recent MacBook device.
The architecture is based on the company's VLM models. It comprises of a SigLip visual encoder and a language decoder (SmolLM 2). The visual information is recorded and retrieved by the vision encoder, while natural language cues are tokenized and sent into the decoder.
When dealing with motions or physical action (executing the job using robotic gear), sensorimotor signals are combined into a single token. The decoder then consolidates all of this data into a single stream and processes it all together. This allows the model to interpret real-world data and tasks in context, rather than as isolated things.
SmolVLA transfers all it has learned to another component known as the action expert, which determines what action to take. The action expert is a transformer-based architecture with one hundred million parameters. It anticipates a sequence of future robot motions (walking steps, arm movements, and so on), often known as action chunks.
While it relates to a specific population, people working in robotics can download the open weights, datasets, and training recipes to replicate or improve the SmolVLA model. Additionally, robotics enthusiasts with access to a robotic arm or equivalent gear can download these to run the model and experiment with real-time robotics processes.