In this chapter, we will introduce the advanced development workflow of D-Robotics-LLM.
This workflow applies to the following scenarios:
Quantizing models yourself.
Offline execution: The model generates textual responses by reading local audio, video, or image data.
Online execution: The model generates textual responses by streaming audio or video data. Compared to offline execution, online execution processes data while it is being transmitted, significantly reducing the latency before the model outputs its first token.
For the above scenarios, we will continue using the Qwen2.5_Omni_3B model as an example to demonstrate usage.
Please ensure you have correctly completed environment setup for both the development host and development board as described in the Environment Deployment section.
Download the provided deployment package D-Robotics_LLM_{version}.tar.gz and extract it.
Currently, only the Qwen2.5-Omni-3B model is supported. Before downloading the model, please ensure you understand the model's license terms, required dependencies, and other necessary information to guarantee proper subsequent usage.
You can obtain Omni-series models from the Hugging Face platform. Below is the download link for the model:
D-Robotics-LLM provides a command-line tool to quantize and compile models for on-device deployment. Using the Qwen2.5-Omni-3B model as an example, the reference command is as follows:
For detailed usage instructions and important considerations regarding the oellm_build tool, please refer to the oellm_build Tool section.
If you obtain our pre-compiled HBM models via the links provided in resolve_model.txt, you may skip this model quantization step.
All Omni models provided in the resolve_model.txt file are compiled with chunk_size set to 256 and cache_len configured to 2048. Currently, only this configuration is supported.
The Qwen2.5_Omni_3B model supports multiple modalities including text, audio, images, and video. Regardless of input combinations, the model always outputs plain text.
Multimodal support operates in two modes—offline and online—with slight differences in supported input combinations, as detailed below:
| No. | Text | Audio | Image | Video |
|---|---|---|---|---|
| 1 | Y | N/A | N/A | N/A |
| 2 | N/A | Y | N/A | N/A |
| 3 | N/A | N/A | Y | N/A |
| 4 | N/A | N/A | N/A | Y |
| 5 | Y | N/A | Y | N/A |
| 6 | N/A | Y | Y | N/A |
| 7 | Y | N/A | N/A | Y |
| 8 | N/A | Y | N/A | Y |
Text content should be included directly within the JSON file; no separate text file is needed.
Supported audio formats include mp3, wav, and flac, with a maximum duration of 30 seconds.
Supported image formats include jpg, png, bmp, and jpeg; images will be resized to a fixed resolution of 448x448.
Supported video formats include mp4 and mkv, with a maximum duration of 5 seconds. Videos are sampled at 2 frames per second and resized to 448x448. Additionally, if a video contains audio and no separate audio input is provided, the embedded audio will be processed. If a separate audio input is provided, the audio embedded in the video will be ignored.
All modal inputs must be configured via a JSON file. For detailed instructions, please refer to the On-Device Execution section.
| No. | Text | Audio | Video |
|---|---|---|---|
| 1 | Y | N/A | Y |
| 2 | N/A | Y | Y |
| 3 | N/A | Y | N/A |
Text content can be fed to the model using the xlm_omni_feed_text_online API.
Video format is limited to nv12. You can use the xlm_omni_feed_video_online API to feed single-frame nv12 data to the model. Frames will be resized to 448x448, and each conversation supports transmission of 2 to 10 frames.
Audio data must be of type float32 with values in the range [-1, 1]. You can use the xlm_omni_feed_audio_online API to either transmit complete audio in one go or stream audio segments incrementally. Each conversation supports up to 30 seconds of cumulative audio.
Within the directory D-Robotics_LLM_{version}/oellm_runtime/example, we have pre-prepared compiled executables in subdirectories that can be run directly on the device. Alternatively, you can generate the required files yourself by executing different build scripts. Reference commands are as follows:
Next, create a working directory on the device with the following command:
Before execution, ensure the following items are ready:
*.hbm).embed_tokens.bin).oellm_omni_offline and oellm_omni_online) along with their corresponding JSON configuration files.D-Robotics_LLM_{version}/oellm_runtime/set_performance_mode.shD-Robotics_LLM_{version}/oellm_runtime/libD-Robotics_LLM_{version}/oellm_runtime/configD-Robotics_LLM_{version}/oellm_runtime/exampleAfter preparing all necessary files, organize them into a unified directory structure as shown below:
Copy the prepared folder from your development host to the device directory using the following command:
Finally, configure LD_LIBRARY_PATH under the path /home/root/llm/D-Robotics_LLM_{version}/oellm_runtime with the following commands:
Reference command for offline execution:
Program arguments are as follows:
| Argument | Description | Required |
|---|---|---|
-h, --help | Display help information. | / |
-c, --config | Specify the path to the JSON configuration file used at runtime. | Required |
Example JSON configuration file:
The parameters in the JSON configuration file are described as follows:
| Parameter | Description | Optional/Required |
|---|---|---|
visual_hbm_path | Specifies the path to the quantized video/image feature extraction model file (*.hbm). | Required |
audio_hbm_path | Specifies the path to the quantized audio feature extraction model file (*.hbm). | Required |
text_hbm_path | Specifies the path to the quantized text model file (*.hbm). | Required |
embed_tokens | Specifies the path to the model's input embedding weights (embed_tokens.bin). | Required |
tokenizer_dir | Specifies the path to the tokenizer and partial initialization data configuration. | Required |
model_type | Specifies the model type to run; the current Omni model type is 5. | Required |
online_mode | Specifies whether the model runs in online or offline mode. Valid values: 'true', 'false'. | Required |
When running the program, you also need to provide the path to a JSON file containing multimodal input information via the command line, then press Enter to start the interaction.
Offline execution supports modalities including text, audio, images, and video. You must prepare the input information in advance within the JSON file and save it locally. The template is as follows:
Note: This JSON file template is provided for illustrative purposes only. For details on supported multimodal input combinations, please refer to the Multimodal Support section.
In this JSON configuration template, the same conversation node includes the system role with a text field, and the user role with optional text, audio, image, and video fields. If a particular modality is not needed, you must delete the entire corresponding object (including its braces).
For example, when providing only video input, the JSON file can be configured as follows:
The D-Robotics-LLM deployment package provides an API that supports streaming-based online execution of the Qwen2.5_Omni_3B model. We provide an online execution example for reference.
The reference command to run this example on the device is as follows:
The program accepts the following command-line arguments:
| Parameter | Description | Optional/Required |
|---|---|---|
-h, --help | Display help information. | / |
-c, --config | Specifies the path to the runtime JSON configuration file. | Required |
An example JSON configuration file is shown below:
The parameters in the JSON configuration file are described as follows:
| Parameter | Description | Optional/Required |
|---|---|---|
visual_hbm_path | Specifies the path to the quantized video/image feature extraction model file (*.hbm). | Required |
audio_hbm_path | Specifies the path to the quantized audio feature extraction model file (*.hbm). | Required |
text_hbm_path | Specifies the path to the quantized text model file (*.hbm). | Required |
embed_tokens | Specifies the path to the model's input embedding weights (embed_tokens.bin). | Required |
tokenizer_dir | Specifies the path to the tokenizer and partial initialization data configuration. | Required |
model_type | Specifies the model type to run; the current Omni model type is 5. | Required |
online_mode | Specifies whether the model runs in online or offline mode. Valid values: 'true', 'false'. | Required |
video_path | Specifies the path to the video file to be processed during online execution. | Required |
user_text | Specifies the user's textual input content. | Optional |
Audio data in online execution is extracted from the video. If you test with your own video data, we recommend using videos that contain audio tracks.
A complete offline interaction proceeds as follows. [User] <<< is followed by the path to your JSON file containing the input data (e.g., ./omni_offline_prompt.json), and [Assistant] >>> shows the model’s textual output. Before generating the response, the program prints the input information from the JSON file to the terminal.
Enter 1, 2, or 3 to run one of the three supported multimodal input combinations in online mode; enter 0 to exit the program. A full demonstration of the online interaction is shown below: