Advanced Development

In this chapter, we will introduce the advanced development workflow of D-Robotics-LLM.

This workflow applies to the following scenarios:

  1. Quantizing models yourself.

  2. Offline execution: The model generates textual responses by reading local audio, video, or image data.

  3. Online execution: The model generates textual responses by streaming audio or video data. Compared to offline execution, online execution processes data while it is being transmitted, significantly reducing the latency before the model outputs its first token.

For the above scenarios, we will continue using the Qwen2.5_Omni_3B model as an example to demonstrate usage.

Environment Setup

Please ensure you have correctly completed environment setup for both the development host and development board as described in the Environment Deployment section.

Deployment Package Preparation

Download the provided deployment package D-Robotics_LLM_{version}.tar.gz and extract it.

Model Preparation

Note

Currently, only the Qwen2.5-Omni-3B model is supported. Before downloading the model, please ensure you understand the model's license terms, required dependencies, and other necessary information to guarantee proper subsequent usage.

You can obtain Omni-series models from the Hugging Face platform. Below is the download link for the model:

Model Quantization

D-Robotics-LLM provides a command-line tool to quantize and compile models for on-device deployment. Using the Qwen2.5-Omni-3B model as an example, the reference command is as follows:

oellm_build \ --model_name qwen2_5-omni-3b \ --input_model_path ./models/qwen/Qwen2.5-Omni-3B \ --output_model_path ./output_hbm \ --march nash-m \ --chunk_size 256 \ --cache_len 2048 \ --device cuda:1
Note

For detailed usage instructions and important considerations regarding the oellm_build tool, please refer to the oellm_build Tool section.

If you obtain our pre-compiled HBM models via the links provided in resolve_model.txt, you may skip this model quantization step.

All Omni models provided in the resolve_model.txt file are compiled with chunk_size set to 256 and cache_len configured to 2048. Currently, only this configuration is supported.

Multimodal Support

The Qwen2.5_Omni_3B model supports multiple modalities including text, audio, images, and video. Regardless of input combinations, the model always outputs plain text.

Multimodal support operates in two modes—offline and online—with slight differences in supported input combinations, as detailed below:

Offline Execution

No.TextAudioImageVideo
1YN/AN/AN/A
2N/AYN/AN/A
3N/AN/AYN/A
4N/AN/AN/AY
5YN/AYN/A
6N/AYYN/A
7YN/AN/AY
8N/AYN/AY
  • Text content should be included directly within the JSON file; no separate text file is needed.

  • Supported audio formats include mp3, wav, and flac, with a maximum duration of 30 seconds.

  • Supported image formats include jpg, png, bmp, and jpeg; images will be resized to a fixed resolution of 448x448.

  • Supported video formats include mp4 and mkv, with a maximum duration of 5 seconds. Videos are sampled at 2 frames per second and resized to 448x448. Additionally, if a video contains audio and no separate audio input is provided, the embedded audio will be processed. If a separate audio input is provided, the audio embedded in the video will be ignored.

All modal inputs must be configured via a JSON file. For detailed instructions, please refer to the On-Device Execution section.

Online Execution

No.TextAudioVideo
1YN/AY
2N/AYY
3N/AYN/A
  • Text content can be fed to the model using the xlm_omni_feed_text_online API.

  • Video format is limited to nv12. You can use the xlm_omni_feed_video_online API to feed single-frame nv12 data to the model. Frames will be resized to 448x448, and each conversation supports transmission of 2 to 10 frames.

  • Audio data must be of type float32 with values in the range [-1, 1]. You can use the xlm_omni_feed_audio_online API to either transmit complete audio in one go or stream audio segments incrementally. Each conversation supports up to 30 seconds of cumulative audio.

On-Device Execution Preparation

Within the directory D-Robotics_LLM_{version}/oellm_runtime/example, we have pre-prepared compiled executables in subdirectories that can be run directly on the device. Alternatively, you can generate the required files yourself by executing different build scripts. Reference commands are as follows:

# Offline execution sh build_omni_offline.sh # Online execution sh build_omni_online.sh

Next, create a working directory on the device with the following command:

# Running on S100/S100P mkdir -p /home/root/llm

Before execution, ensure the following items are ready:

  • A functional development board for running on-device programs.
  • On-device deployable model files (*.hbm).
  • Input embedding weights (embed_tokens.bin).
  • Executable files (oellm_omni_offline and oellm_omni_online) along with their corresponding JSON configuration files.
  • Required runtime libraries. To simplify deployment, you may directly use the contents from the following directories within the D-Robotics-LLM package:
    • D-Robotics_LLM_{version}/oellm_runtime/set_performance_mode.sh
    • D-Robotics_LLM_{version}/oellm_runtime/lib
    • D-Robotics_LLM_{version}/oellm_runtime/config
    • D-Robotics_LLM_{version}/oellm_runtime/example

After preparing all necessary files, organize them into a unified directory structure as shown below:

root@ubuntu:/home/root/llm . |-- model | |-- resolve_model.txt | |-- Qwen2.5_Omni_3B_Audio.hbm | |-- Qwen2.5_Omni_3B_Visual.hbm | |-- Qwen2.5_Omni_3B_Text.hbm | |-- embed_tokens.bin |-- config | |-- Qwen2.5_Omni_3B_config |-- example | |-- oellm_omni_offline | | |-- oellm_omni_offline | | |-- omni_offline_config.json | | |-- omni_offline_prompt.json | | |-- draw_guitar.mp4 | |-- oellm_omni_online | | |-- oellm_omni_online | | |-- omni_online_config.json | | |-- draw_guitar.mp4 |-- include |-- lib `--set_performance_mode.sh

Copy the prepared folder from your development host to the device directory using the following command:

scp -r llm/* root@{board_ip}:/home/root/llm

Finally, configure LD_LIBRARY_PATH under the path /home/root/llm/D-Robotics_LLM_{version}/oellm_runtime with the following commands:

# Modify hardware registers to switch the device into performance mode sh set_performance_mode.sh # Set environment variables lib=/home/root/llm/lib export LD_LIBRARY_PATH=${lib}:${LD_LIBRARY_PATH}

On-Device Execution

Offline Execution

Reference command for offline execution:

cd ./example/oellm_omni_offline ./oellm_omni_offline --config ./omni_offline_config.json

Program arguments are as follows:

ArgumentDescriptionRequired
-h, --helpDisplay help information./
-c, --configSpecify the path to the JSON configuration file used at runtime.Required

Example JSON configuration file:

omni_offline_config.json
```{ "visual_hbm_path": "../../model/Qwen2.5_Omni_3B_Visual.hbm", "audio_hbm_path": "../../model/Qwen2.5_Omni_3B_Audio.hbm", "text_hbm_path": "../../model/Qwen2.5_Omni_3B_Text.hbm", "embed_tokens": "../../model/embed_tokens.bin", "tokenizer_dir": "../../config/Qwen2.5_Omni_3B_config/", "model_type": 5, "online_mode": false }

The parameters in the JSON configuration file are described as follows:

ParameterDescriptionOptional/Required
visual_hbm_pathSpecifies the path to the quantized video/image feature extraction model file (*.hbm).Required
audio_hbm_pathSpecifies the path to the quantized audio feature extraction model file (*.hbm).Required
text_hbm_pathSpecifies the path to the quantized text model file (*.hbm).Required
embed_tokensSpecifies the path to the model's input embedding weights (embed_tokens.bin).Required
tokenizer_dirSpecifies the path to the tokenizer and partial initialization data configuration.Required
model_typeSpecifies the model type to run; the current Omni model type is 5.Required
online_modeSpecifies whether the model runs in online or offline mode.
Valid values: 'true', 'false'.
Required

When running the program, you also need to provide the path to a JSON file containing multimodal input information via the command line, then press Enter to start the interaction.

Offline execution supports modalities including text, audio, images, and video. You must prepare the input information in advance within the JSON file and save it locally. The template is as follows:

Note

Note: This JSON file template is provided for illustrative purposes only. For details on supported multimodal input combinations, please refer to the Multimodal Support section.

{ "conversation": [ { "role": "system", "content": [ { "type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." } ] }, { "role": "user", "content": [ { "type": "text", "text": "user_text_input" }, { "type": "audio", "audio": "user_audio_input.mp3" }, { "type": "image", "image": "user_image_input.jpg", "resized_width": 448, "resized_height": 448 }, { "type": "video", "video": "user_video_input.mp4", "resized_width": 448, "resized_height": 448 } ] } ] }

In this JSON configuration template, the same conversation node includes the system role with a text field, and the user role with optional text, audio, image, and video fields. If a particular modality is not needed, you must delete the entire corresponding object (including its braces).

For example, when providing only video input, the JSON file can be configured as follows:

omni_offline_prompt.json
{ "conversation": [ { "role": "system", "content": [ { "type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." } ] }, { "role": "user", "content": [ { "type": "video", "video": "draw_guitar.mp4", "resized_width": 448, "resized_height": 448 } ] } ] }

Online Execution

The D-Robotics-LLM deployment package provides an API that supports streaming-based online execution of the Qwen2.5_Omni_3B model. We provide an online execution example for reference.

The reference command to run this example on the device is as follows:

cd ./example/oellm_omni_online ./oellm_omni_online --config ./omni_online_config.json

The program accepts the following command-line arguments:

ParameterDescriptionOptional/Required
-h, --helpDisplay help information./
-c, --configSpecifies the path to the runtime JSON configuration file.Required

An example JSON configuration file is shown below:

omni_online_config.json
{ "visual_hbm_path": "../model/Qwen2.5_Omni_3B_Visual.hbm", "audio_hbm_path": "../model/Qwen2.5_Omni_3B_Audio.hbm", "text_hbm_path": "../model/Qwen2.5_Omni_3B_Text.hbm", "embed_tokens": "../model/embed_tokens.bin", "tokenizer_dir": "../../config/Qwen2.5_Omni_3B_config/", "model_type": 5, "online_mode": true, "video_path": "./draw_guitar.mp4", "user_text": "Please describe what I am doing" }

The parameters in the JSON configuration file are described as follows:

ParameterDescriptionOptional/Required
visual_hbm_pathSpecifies the path to the quantized video/image feature extraction model file (*.hbm).Required
audio_hbm_pathSpecifies the path to the quantized audio feature extraction model file (*.hbm).Required
text_hbm_pathSpecifies the path to the quantized text model file (*.hbm).Required
embed_tokensSpecifies the path to the model's input embedding weights (embed_tokens.bin).Required
tokenizer_dirSpecifies the path to the tokenizer and partial initialization data configuration.Required
model_typeSpecifies the model type to run; the current Omni model type is 5.Required
online_modeSpecifies whether the model runs in online or offline mode.
Valid values: 'true', 'false'.
Required
video_pathSpecifies the path to the video file to be processed during online execution.Required
user_textSpecifies the user's textual input content.Optional
Tip

Audio data in online execution is extracted from the video. If you test with your own video data, we recommend using videos that contain audio tracks.

Execution Results

Offline Execution

A complete offline interaction proceeds as follows. [User] <<< is followed by the path to your JSON file containing the input data (e.g., ./omni_offline_prompt.json), and [Assistant] >>> shows the model’s textual output. Before generating the response, the program prints the input information from the JSON file to the terminal.

[User] <<< omni_offline_prompt.json Role: system Type: text Text: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." Role: user Type: video Video: "draw_guitar.mp4" VideoPreprocess Time: 1799.69ms Audio(inVideo)Preprocess Time: 214.787ms [Assistant] >>> Oh, that's really cool! You're drawing a guitar on the tablet. Have you been practicing drawing for a long time?If you want to practice more, you can try to draw other things like flowers or animals. It's also a great way to relax and have fun. So, what do you think about it? Performance prefill: 895.73tokens/s decode: 14.03tokens/s

Online Execution

Enter 1, 2, or 3 to run one of the three supported multimodal input combinations in online mode; enter 0 to exit the program. A full demonstration of the online interaction is shown below:

xlm init success On-device Omni multimodal LLM interactive online demo Currently supported multimodal input combinations: 1. Video (NV12) from sensor + text 2. Video (NV12) from sensor + audio (PCM) from sensor 3. Audio (PCM) from sensor This demo simulates the online scenario. Please enter 1, 2, or 3 to run the corresponding example. Enter 0 to quit. [User] <<< 1 Role: system Type: text Text: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." Role: user Type: text Text: "Please describe what I am doing" Type: video Video: " " VideoPreprocess Time: 1009.98ms ```[Assistant] >>> Hmm... You're drawing on a tablet. Look, there's a black outline on the screen that looks like the silhouette of a musical instrument, and you're sketching over it with a stylus while holding the tablet with your fingers. Are you practicing drawing, or is this for something else? If you have any thoughts, feel free to share them with me. Performance prefill: 894.32tokens/s decode: 14.11tokens/s [User] <<< 2 Role: system Type: text Text: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." Role: user Type: video Video: " " Type: audio Audio: " " AudioPreprocess Time: 221.13ms VideoPreprocess Time: 1007.44ms [Assistant] >>> Oh, that's a really cool drawing! It looks like a guitar. You've got the basic shape and the strings all drawn in. What made you decide to draw a guitar? It's a great choice. If you want, you can tell me more about your drawing process. Performance prefill: 894.91tokens/s decode: 14.10tokens/s [User] <<< 3 Role: system Type: text Text: "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech." Role: user Type: audio Audio: " " AudioPreprocess Time: 210.43ms [Assistant] >>> Oh, sure! I'm here. What's your drawing? Let me see it. Performance prefill: 896.71tokens/s decode: 14.10tokens/s [User] <<< 0 [system out] >>> Alright, wish you a wonderful day—goodbye!