Deploying OpenVLA on SiMa.ai MLSoC

PyTorchONNX RuntimeModel QuantizationOpenVLALLaMA-2DINOv2SigLIPSiMa.ai Modalix MLSoCNeural Processing Units (NPU)

This project focuses on deploying OpenVLA, a large-scale Vision-Language-Action (VLA) model, onto the SiMa.ai Modalix MLSoC edge AI platform for real-time robotics applications. The work demonstrates how multimodal AI models can be adapted from research environments to specialized edge hardware while preserving performance, efficiency, and deployment feasibility. The project explores the challenges involved in translating large multimodal architectures designed for GPU environments into optimized pipelines capable of running on low-power edge AI hardware.

The Problem & Solution

Problem

Modern multimodal AI models such as OpenVLA are typically developed using frameworks like PyTorch and designed to run on powerful GPU infrastructures. However, deploying these models on edge AI hardware presents several challenges: • Unsupported PyTorch operations on specialized accelerators • Dynamic tensor shapes incompatible with static computation graphs • High computational cost of large transformer architectures • Hardware constraints on memory usage and power consumption Without significant architectural adaptation, these models cannot be deployed efficiently on edge devices used in robotics or embedded AI systems.

Solution

To address these limitations, the OpenVLA model architecture was re-engineered for deployment on the SiMa.ai Modalix MLSoC platform. The deployment workflow included: • Analyzing the original PyTorch architecture of OpenVLA • Re-implementing model components using ONNX runtime • Optimizing tensor operations for static graph execution • Applying quantization techniques to reduce memory and compute overhead • Converting the optimized model into SiMa.ai Model SDK formats • Deploying and validating the model using the Modalix simulator

Architecture

Vision Module: The system processes visual inputs using two transformer-based encoders: • DINOv2 – extracts dense geometric features from images • SigLIP – generates semantic visual embeddings Multimodal Projector: A projection module fuses the outputs from the vision transformers, producing a unified feature representation that combines spatial and semantic information. Language & Action Module: The architecture incorporates a LLaMA-2 7B language model backbone that processes both visual embeddings and natural language instructions. The model produces action tokens, which correspond to robot control parameters such as position, rotation, and gripper state.

Key Features

Deployment of a 7B-parameter Vision-Language-Action model on edge AI hardware
Multimodal reasoning using vision, language, and action outputs
ONNX-based model re-implementation for hardware compatibility
Static graph optimization for edge inference pipelines
Quantization strategies for memory and compute efficiency
Modular deployment workflow for robotics AI systems

Key Impact

1
Demonstrates the feasibility of deploying large multimodal AI models on edge computing hardware
2
Enables real-time robotic decision making using vision and language inputs
3
Reduces dependency on cloud infrastructure for robotics AI workloads
4
Provides a scalable pipeline for future deployment of multimodal AI models on specialized hardware
5
Advances the integration of multimodal AI with edge robotics systems

Interested in discussing this architecture?

Get in touch