Yannqi's homepage

Beijing, China

I am currently a Ph.D. student at the Institute of Automation, Chinese Academy of Sciences (CASIA), under the guidance of Prof. Shiming Xiang. I received my bachelor’s degree from University of Electronic Science and Technology of China (UESTC) under the supervision of Prof. Lu Yang.

My research focuses on building intelligent multimodal systems. I am particularly interested in:

Multimodal Large Language Models – reasoning, auto-thinking, and retrieval-augmented generation
Audio-Visual Learning – segmentation, generation, and cross-modal understanding
Visual Semantic Segmentation – open-vocabulary, continual, and remote-sensing segmentation
AIGC – image, audio, and video generation

I am happy to cooperate and share. If you are interested in my research, please feel free to contact me via email.

News

Mar 12, 2026	Congratulations! Our paper R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs has been accepted by CVPR!
Aug 28, 2025	Our paper R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs is now available on arXiv! R-4B achieves SOTA across 25 benchmarks.
Jun 09, 2025	Our paper RCTS: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger is now available on arXiv! Code released at GitHub.
May 01, 2025	Congratulations! Our paper RCTS: Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger has been marked as ICML 2025 Spotlight Paper! 😮
Sep 06, 2024	Our paper Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis is now available on arXiv!
Apr 06, 2024	Congratulations on our paper Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation being marked as CVPR 2024 Highlight Paper! 😮
Feb 29, 2024	Congratulations on our paper Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation being accepted by CVPR 2024! 😄

Selected Publications

Continual Semantic Segmentation via Scalable Contrastive Clustering and Background Diversity

Qi Yang, Xing Nie, Linsu Shi , and 3 more authors

In 2023 IEEE International Conference on Data Mining (ICDM) , 2023

Abs Bib HTML

Despite the efficacy towards static data distribution, traditional semantic segmentation methods encounter Catastrophic forgetting when tackling continually changing data streams. We introduce a novel, scalable segmentation architecture called ScaleSeg, designed to adapt the incremental scenarios. The architecture consists of a series of prototypes updated by online contrastive clustering. Additionally, we propose a background diversity strategy to enhance the model’s plasticity and stability, thus overcoming background shift. Comprehensive experiments demonstrate that ScaleSeg surpasses previous state-of-the-art methods.
@inproceedings{ScaleSeg, author = {Yang, Qi and Nie, Xing and Shi, Linsu and Yu, Jiazhong and Li, Fei and Xiang, Shiming}, booktitle = {2023 IEEE International Conference on Data Mining (ICDM)}, title = {Continual Semantic Segmentation via Scalable Contrastive Clustering and Background Diversity}, year = {2023}, pages = {1475-1480}, url = {https://ieeexplore.ieee.org/document/10415751}, doi = {10.1109/ICDM58522.2023.00194}, issn = {2374-8486} }
Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation

Qi Yang, Xing Nie , Tong Li, and 5 more authors

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024

CVPR 2024 Highlight

Abs Bib HTML Code

We propose an innovative audio-visual transformer framework, termed COMBO, an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time, our framework explores three types of bilateral entanglements within AVS: pixel entanglement, modality entanglement, and temporal entanglement. Comprehensive experiments on AVSBench-object (84.7 mIoU on S4, 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods.
@inproceedings{COMBO, title = {Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation}, author = {Yang, Qi and Nie, Xing and Li, Tong and Gao, Pengfei and Guo, Ying and Zhen, Cheng and Yan, Pengfei and Xiang, Shiming}, year = {2024}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages = {27134--27143}, url = {https://arxiv.org/abs/2312.06462}, note = {CVPR 2024 Highlight} }
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang, Binjie Mao , Zili Wang, and 6 more authors

2024

Abs Bib HTML

We construct a controllable video-to-audio synthesis model, termed Draw an Audio, which supports multiple input instructions through drawn masks and loudness signals. We introduce the Mask-Attention Module (MAM) and the Time-Loudness Module (TLM) to ensure content consistency and temporal alignment. Furthermore, we have extended a large-scale V2A dataset, named VGGSound-Caption. Extensive experiments verify Draw an Audio achieves the state-of-the-art.
@misc{DrawAnAudio, title = {Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis}, author = {Yang, Qi and Mao, Binjie and Wang, Zili and Nie, Xing and Gao, Pengfei and Guo, Ying and Zhen, Cheng and Yan, Pengfei and Xiang, Shiming}, year = {2024}, eprint = {2409.06135}, archiveprefix = {arXiv}, primaryclass = {cs.SD}, url = {https://arxiv.org/abs/2409.06135}, }
R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Qi Yang, Bolin Ni , Shiming Xiang, and 3 more authors

2025

Abs Bib HTML

We propose R-4B, an auto-thinking MLLM that adaptively decides when to think based on problem complexity. The central idea is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization (BPO). Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks, outperforming Qwen2.5-VL-7B in most tasks.
@misc{R4B, title = {R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning}, author = {Yang, Qi and Ni, Bolin and Xiang, Shiming and Hu, Han and Peng, Houwen and Jiang, Jie}, year = {2025}, eprint = {2508.21113}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, url = {https://arxiv.org/abs/2508.21113}, }
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger

Qi Yang, Chenghao Zhang , Lubin Fan , and 3 more authors

2025

Abs Bib HTML Code

We propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. We introduce a self-consistent evaluation mechanism and Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR). Extensive experiments demonstrate state-of-the-art performance on multiple VQA datasets.
@misc{RCTS, title = {Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger}, author = {Yang, Qi and Zhang, Chenghao and Fan, Lubin and Ding, Kun and Ye, Jieping and Xiang, Shiming}, year = {2025}, eprint = {2506.07785}, archiveprefix = {arXiv}, primaryclass = {cs.CV}, url = {https://arxiv.org/abs/2506.07785}, }